Justin Salamon
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology

Compute Deep Image & Audio Embeddings with OpenL3

27/1/2020

0 Comments

 
OpenL3 computes deep image & audio embeddings using a self-supervised L3-Net model and can be installed with a simple "pip install openl3" !

Back in May 2019 we announced the release of OpenL3, an open-source python library for computing deep audio embeddings using an improved L3-Net architecture trained on AudioSet. Today we're happy to announce the release of OpenL3 v0.3.0 which can also compute deep image embeddings!

Installing OpenL3 is easy (requires TensorFlow):
pip install openl3
And computing image embeddings is as easy as:
image = imread('/path/to/file.png')
emb = openl3.get_image_embedding(image, embedding_size=512)
Computing audio embeddings is equally simple:
audio, sr = soundfile.read('/path/to/file.wav')
embedding, timestamps = openl3.get_audio_embedding(audio, sr)
You can even process video files directly to obtain both image and audio embeddings:
openl3.process_video_file(video_filepath, output_dir='/path/to/output/folder')
Full instructions and all available options are described in the OpenL3 Tutorial. There's also an OpenL3 command line interface (CLI) if you want to script outside of python.

Full details about the embedding models and how they were trained can be found in:

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
[IEEE][PDF][BibTeX][Copyright]

We hope the machine learning community, including both computer vision and machine listening researchers, find OpenL3 useful in their work!
Picture
0 Comments

Adobe Audio Research Interns Submit Papers to ICASSP, CVPR, ICLR, CHI, IEEE VR, INTERSPEECH, ISMIR

24/1/2020

0 Comments

 
This past summer we had the pleasure of hosting 14 brilliant students (most of them pursuing their PhD but also Master's and Bachelor's) for audio research internships in San Francisco and Seattle. We worked together on a wide range of audio-related topics including speech processing, music, video, animation, spatial audio, VR, DAFX... the works! The projects also covered a range of disciplines including machine learning (including a healthy dose of deep learning), signal processing and human computer interaction; and a range of problems within these disciplines such as self-supervision and representation learning, metric learning, classification, transformation and generation (synthesis).

As a mentor, it was an incredible experience to guide, collaborate, and learn from this diverse group of people coming from a variety of disciplines and universities in 5 different countries. Between interns and mentors we spanned 11 different countries of origin, making it a truly international group! I'm also delighted to say a large proportion of internship projects have resulted in paper submissions to top-tier venues including ICASSP, CVPR, ICLR, CHI and IEEE VR, with upcoming submissions to INTERSPEECH and ISMIR in preparation!

The 2019 audio research interns were:
  • Emma Frid, KTH Royal Institute of Technology
  • Nathan Keil, Rensselaer Polytechnic Institute
  • Jongpil Lee, Korea Advanced Institute of Science and Technology
  • Stylianos Mimilakis, Technical University of Ilmenau
  • Max Morrison, Northwestern University
  • Kaizhi Qian, University of Illinois at Urbana-Champaign
  • Lucas Rencker, University of Surrey
  • Oona Risse-Adams, University of California, Santa Cruz
  • Jiaqi Su, Princeton University
  • Zhenyu Tang, University of Maryland-College Park
  • Yapeng Tian, University of Rochester
  • Yu Wang, New York University
  • Karren Yang, Massachusetts Institute of Technology
  • Yang Zhou, University of Massachusetts Amherst

I look forward to keeping in touch with everyone and hope we get to collaborate again in the future!
Picture
The 2019 San Francisco audio research mentors and interns (Seattle team we love you!)
0 Comments

Elected to the IEEE Audio and Acoustic Signal Processing Technical Committee

14/1/2020

0 Comments

 
Picture
I'm happy to report I've been elected to the IEEE Audio and Acoustic Signal Processing Technical Committee. 

The AASP TC's mission is to support, nourish and lead scientific and technological development in all areas of audio and acoustic signal processing. These areas are currently seeing increased levels of interest and significant growth providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning, from consumer electronics devices to blue-sky research, this remit encompasses countless technical challenges and many hot topics. The TC numbers some 30 appointed volunteer members drawn roughly equally from leading academic and industrial organizations around the world, unified by the common aim to offer their expertise in the service of the scientific community.

Looking forward to doing my bit for this excellent scientific community!

0 Comments

SONYC-UST: A MULTILABEL DATASET FROM AN URBAN ACOUSTIC SENSOR NETWORK

5/1/2020

0 Comments

 
SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realworld urban noise monitoring. It consists of 3068 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network: 
Picture
Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 fine-grained classes that were chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes:
Picture
The SONC-UST taxonomy (click to enlarge)

For more details please see:

SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network
M. Cartwright, A. E. Mendez Mendez, J. Cramer, V. Lostanlen, G. Dove, H.-H. Wu, J. Salamon, O. Nov, and J.P. Bello
Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 35-39, New York University, NY, USA, Oct. 2019.

Download SONYC-UST: 
https://doi.org/10.5281/zenodo.3338310
0 Comments

TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision

4/11/2019

0 Comments

 
Self-supervised representation learning with deep neural networks is a powerful tool for machine learning tasks with limited labeled data but extensive unlabeled data. To learn representations, self- supervised models are typically trained on a pretext task to predict structure in the data (e.g. audio-visual correspondence, short-term temporal sequence, word sequence) that is indicative of higher-level concepts relevant to a target, downstream task. Sensor networks are promising yet unexplored sources of data for self-supervised learning - they collect large amounts of unlabeled yet timestamped data over extended periods of time and typically exhibit long-term temporal structure (e.g., over hours, months, years) not observable at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in single-modal data and therefore could be exploited for self-supervision in many types of sensor networks. In this work, we present a model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network. We then demonstrate the utility of the learned audio representation in an urban sound event detection task with limited labeled data.
Picture


Read the full paper here:

TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
M. Cartwright, J. Cramer, J. Salamon, and J.P. Bello
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2019.
[IEEE][PDF][BibTeX][Copyright]
0 Comments

Robust Sound Event Detection in Bioacoustic Sensor Networks

24/10/2019

0 Comments

 
The innovation: we present a context-adaptive deep network that uses an auxiliary sub-network to model the background environment to improve inference-time robustness to changing environments.

The results: The proposed model produces state-of-the results on flight call detection that are robust to environmental changes across time and space.

The surprise: Interestingly, we find that while context adaptation alone doesn't help significantly, and applying PCEN pre-processing doesn't help much either, applying both combined leads to dramatic gains.

The tech: We release BirdVoxDetect, an open-source tool for automatically detecting avian flight calls in continuous audio recordings: https://github.com/BirdVox/birdvoxdetect

Installing BirdVoxDetect (assuming Python is installed) is as easy as calling : pip install birdvoxdetect

Full paper:
Robust Sound Event Detection in Bioacoustic Sensor Networks
V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J.P. Bello
PLoS ONE 14(10): e0214168, 2019. DOI: https://doi.org/10.1371/journal.pone.0214168
[PLoS ONE][PDF][BibTeX]

Model block diagram:
Picture
Abstract:
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.
0 Comments

Build a Web-Based Ukulele Tuner using CREPE

7/10/2019

0 Comments

 
The YouTube coding channel The Coding Train just published its latest Coding Challenge: "Ukulele Tuner with Machine Learning Pitch Detection Model", where they use our CREPE model to build a web-based ukulele tuner! Awesome!

They use CREPE via the ml5.js library which offers a variety of ML algorithms for the web. It was a delightful surprise to discover that CREPE is being used to power the pitchDetection function.

You can watch the full video here:
0 Comments

What's Broken in Music Informatics Research? Three Uncomfortable Statements

20/6/2019

0 Comments

 
Companion website for my invited talk at the Machine Learning for Music Discovery workshop at ICML 2019: "What's Broken in Music Informatics Research? Three Uncomfortable Statements".

VIDEO:

PAPER:

What's Broken in Music Informatics Research? Three Uncomfortable Statements
Justin Salamon

Invited talk, Machine Learning for Music Discovery workshop, International Conference on Machine Learning (ICML), Long Beach, California, USA, June 2019.
[PDF][Video]


SUPPLEMENTARY MATERIAL:

Figure 1 (see paper for details):
Picture
Figure 1: (a) ground truth melody f0 (pitch) sequence, (b) first algorithm's melody estimate, (c) second algorithm's melody estimate.
0 Comments

Bioacoustics Datasets: A New Website Listing Bioacoustics Datasets and Repositories

25/5/2019

0 Comments

 

I couldn't find a centralized list of #bioacoustics datasets (for #machinelearning or otherwise), so I created this page. Feedback welcome! #opendatahttps://t.co/42YHDBiBtM

— Justin Salamon (@justin_salamon) May 24, 2019
To explore Bioacoustic Datasets, a centralized list of bioacoustics datasets and repositories, visit: https://bioacousticsdatasets.weebly.com
0 Comments

OpenL3: A Competitive and Open Deep Audio Embedding

7/5/2019

0 Comments

 
We're excited to announce the release of OpenL3, an open-source deep audio embedding based on the self-supervised L3-Net. OpenL3 is an improved version of L3-Net, and outperforms VGGish and SoundNet (and the original L3-Net) on several sound recognition tasks. Most importantly, OpenL3 is open source and readily available for everyone to use: if you have TensorFlow installed just run pip install openl3 and you're good to go!

Full details are provided in our paper:

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
[IEEE][PDF][BibTeX][Copyright]

How well does it work?

Here's a comparison of classification results on three environmental sound datasets using embeddings from OpenL3 (blue), SoundNet (orange) and VGGish (green) as input to a simple 2-layer MLP:
Picture
Using OpenL3 we are able to match the current state of the art on UrbanSound8K, the most challenging of the three datasets, using a simple MLP without any of the tricks usually necessary for relatively small datasets (such as data augmentation). 

Using OpenL3

Installing OpenL3, a Python module, is as easy as calling (assuming TensorFlow is already installed):
$ pip install openl3

Once installed, using OpenL3 in python can be done like this (simplest use case without setting custom parameter values):
import openl3
import soundfile as sf

audio, sr = sf.read('/path/to/file.wav')
embedding, timestamps = openl3.get_embedding(audio, sr)

We also provide a command-line interface (CLI) that can be launched by calling "openl3" from the command line:
$ openl3 /path/to/file.wav

The API (both python and CLI) includes more options such as changing the hop size used to extract the embedding, the output dimensionality of the embedding and several other parameters. A good place to start is the OpenL3 tutorial.

How was OpenL3 trained?

OpenL3 is an improved version of L3-Net by Arandjelovic and Zisserman, which is trained on a subset of AudioSet using self-supervision by exploiting the correspondence between sound and visual objects in video data:
Picture
The embedding is obtained by taking the output of the final convolutional layer of the audio subnetwork. For more details please see our paper.

We look forward to seeing what the community does with OpenL3!

...and, if you're attending ICASSP 2019, be sure to stop by our poster on Friday, May 17 between 13:30-15:30 (session MLSP-P17: Deep Learning V, Poster Area G, paper 2149)! 
0 Comments
<<Previous
Forward>>

    NEWS

    Machine listening research, code, data & hacks!

    Archives

    October 2020
    June 2020
    May 2020
    April 2020
    January 2020
    November 2019
    October 2019
    June 2019
    May 2019
    March 2019
    February 2019
    January 2019
    November 2018
    October 2018
    August 2018
    July 2018
    May 2018
    April 2018
    February 2018
    October 2017
    August 2017
    July 2017
    June 2017
    April 2017
    March 2017
    January 2017
    December 2016
    November 2016
    October 2016
    August 2016
    June 2016
    May 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    July 2015
    June 2015
    April 2015
    February 2015
    November 2014
    October 2014
    September 2014
    June 2014
    April 2014
    March 2014
    February 2014
    December 2013
    September 2013
    July 2013
    May 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    August 2012
    July 2012
    June 2012

    Categories

    All
    ACM MM'13
    ACM MM'14
    Acoustic Ecology
    Acoustic Event Detection
    Acoustic Sensing
    AES
    Applied Acoustics
    Article
    Audio-annotator
    Audio To Midi
    Auditory Scene Analysis
    Avian
    Award
    Baseball
    Beer
    Best Oral Presentation
    Best Student Paper Award
    BigApps
    Bioacoustics
    BirdVox
    Book
    Chapter
    CHI
    Citizen Science
    Classification
    Conference
    Connected Cities
    Convolutional Neural Networks
    Cornell Lab Of Ornithology
    Coursera
    Cover Detection
    CREPE
    Crowdcrafting
    Crowdsourcing
    CUSP
    Data Augmentation
    Data Science
    Dataset
    Data Structures
    Dcase
    Deep Learning
    Domain
    Education
    Entrepreneurship
    Environmental Sound
    Essentia
    Eusipco
    Eusipco2015
    Evaluation
    Few-shot Learning
    Flight Calls
    Girl Scouts
    Grant
    Hackathon
    Hackday
    Hackfest
    HCI
    Hildegard Von Bingen
    ICASSP
    ICASSP 2020
    IEEE Signal Processing Letters
    Ieee Spm
    Indian Classical Music
    Interface
    Interview
    Ismir 2012
    Ismir2014
    Ismir2015
    Ismir2016
    Ismir2017
    Ismir2020
    ITP
    Jams
    Javascript
    JNMR
    Journal
    Machine Learning
    Machine Listening
    Map
    Media
    Melodia
    Melody Extraction
    Metric Learning
    Midi
    Migration Monitoring
    MIR
    Mir_eval
    MOOC
    MTG-QBH
    Music Informatics
    Music Information Retrieval
    Music Similarity
    National Science Foundation
    Neumerator
    New York Times
    Noise Pollution
    Notebook
    NPR
    NSF
    NYC
    NYU
    Open Source
    Pitch
    Pitch Contours
    Pitch Tracking
    Plos One
    Plug In
    Plug-in
    Presentation
    Press
    PRI
    Publication
    Python
    Query By Humming
    Query-by-humming
    Radio
    Research
    Robots
    Scaper
    Science And The City
    Science Friday
    Sensor Network
    Sensors
    Smart Cities
    Software
    SONYC
    Sound Classification
    Sound Education
    Sound Event Detection
    Soundscape
    Sounds Of New York City
    Sound Workshop
    STEM
    Synthesis
    Taste Of Science
    Taxonomy
    Technical Report
    Time Series
    Tonic ID
    Tony
    Tutorial
    Unsupervised Feature Learning
    Urban
    Urban Sound Analysis
    Urban Sound Tagging
    Vamp
    Version Identification
    Visualization
    Vocaloid
    Warblers
    Wav To Midi
    Welcome
    Wired
    WNYC
    Women In Science
    Workshop
    World Domination
    Wsf14
    Youtube

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology