News - Justin Salamon

Few-shot Drum Transcription in Polyphonic Music

11/10/2020

Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.

To learn more please read out paper:

Few-Shot Drum Transcription in Polyphonic Music
Y. Wang, J. Salamon, M. Cartwright, N. J. Bryan, J. P. Bello
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation and a poster here:
https://program.ismir2020.net/poster_1-14.html

The YouTube-ASMR-300K Dataset

15/6/2020

We're excited to release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio published to date!

Learning from videos with spatial audio is a relatively new domain. While the amount of spatial audio content is increasing, currently there are few video datasets with spatial audio in which the visual content is spatially aligned with the audio content. We therefore introduce a new large-scale dataset of ASMR videos collected from YouTube that contains stereo audio.

ASMR (autonomous sensory meridian response) videos are readily available online and typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with
stereo/binaural or paired microphones.

The audio in these videos contains binaural cues that, when listened to with headphones, create a highly immersive experience in which listeners perceive the sounds as if they were happening around
them. Thus there is strong correspondence between the visual and spatial audio content in these videos.

â€‹Our full dataset, YouTube-ASMR-300K, consists of approximately 300K 10-second video clips with spatial audio. From this full dataset, we also manually curated a subset of 30K clips from 30 ASMR channels that feature more sound events moving spatially for training our models. We call this curated dataset YouTube-ASMR. We split the video clips into training, validation, and test sets in an 80-10-10 ratio.

Compared to the existing datasets, YouTube-ASMR300K is (1) larger by at least 8X, (2) collected
in-the-wild, and (3) contains sound sources in motion (e.g., a user waves a tuning fork across the field of view), which is important for training models on diverse spatial cues. how YouTube-ASMR and YouTube-ASMR-300K compare to existing video dataset with spatial audio:

Dataset	â€‹# Unique videos	Durations (hours)
Lu et al., ICIP 2019	â€‹N/R	9.3
FAIR-play	N/R (2000 10-sec clips in total)	5.2
YouTube-360	1146	114
YouTube-ASMR	3520	96
YouTube-ASMR-300K	33725	904

â€‹YouTube-ASMR-300K was compiled as part of our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
K. Yang, B. Russell, J. Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

The YouTube-ASMR-300K dataset and other project materials are available on our companion website:

https://karreny.github.io/telling-left-from-right/

Teaching Machines to Locate and Generate 3D Sound in Videos by Watching ASMR

11/6/2020

Teaching machines to sense the world around them is a challenging problem. They need to:

Learn to SEE the world
Learn to HEAR the world
Learn to LINK what they SEE with what they HEAR

We've seen dramatic progress in this area in recent years, but significant challenges remain:

We have access to millions of videos, but they aren't labeled:
- Can we teach machines from videos without labels?
In the same way objects move on screen, the sounds they make move too. Humans leverage this spatial correspondence between sight & sound to better understand the world:
- Can machines leverage this audiovisual spatial correspondence?

We present a novel self-supervised approach to spatial, audiovisual learning: we train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. We also show the technique generalizes to 360 videos with 3D sound by replacing left/right flipping with front/back audio rotation.

Here's the idea in a nutshell:

In a video with binaural audio, the location of what we see on screen corresponds to the perceived location of the sound. The spatial audio effect is thanks to the stereo audio with Left and Right channels:

But what if we flip the left and right audio channels? Now the audiovisual spatial correspondence is broken, and the sound will be perceived as coming from the opposite direction:

We can leverage audiovisual spatial correspondence to learn a spatial audiovisual representation via self-supervision!

Concretely, we train a model to predict whether the audio channels have been flipped or not:

This surprisingly simple task results in a strong audiovisual representation that's useful in a variety of applications, including:

Sounding face tracking
On/off-screen sound source separation
Automatic audio spatialization in 2D and 3D video, a.k.a upmixing: converting a video with mono audio to a video with stereo or ambisonic audio giving a surround sound effect where the sound of visible objects moves in synchrony with the object's motion.

In case you missed in the demo video at the top of this page, here's the upmixing example again:

The YouTube-ASMR-300K dataset

To learn this spatial audiovisual representation, we compiled a new dataset containing hundreds of thousands of videos with spatial audio. But... where do you find video with spatial audio?

Cue in ASMR videos to the rescue!

ASMR (autonomous sensory meridian response) videos typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with stereo/binaural or paired microphones. Some people watch ASMR videos to fall asleep. Others enjoy the tingling sensation the spatial audio gives them. Others can't stand them. However people react to ASMR, one thing is clear - ASMR videos are trending on YouTube... in fact, there are millions of them!

And here's the important part - the audio in these videos contains binaural cues such that there is strong audiovisual correspondence between the visual and spatial audio cues.

We're delighted to formally release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio to date:

To download YouTube-ASMR-300K, please visit our companion website:

https://karreny.github.io/telling-left-from-right/

Learn more

You can also learn more about our work by reading our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
Karren Yang, Bryan Russell, Justin Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

DESED: A New Dataset for Sound Event Detection and a Comparative Evaluation of SOTA Systems

4/5/2020

We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.

New Dataset: DESED (Domestic Environment Sound Event Detection)

The study was performed using a new dataset for sound event detection, DESED:

Dataset website: https://project.inria.fr/desed/
Includes both recorded and synthesized soundscapes
Includes isolated sound events allowing users to generate new data using Scaper

Sound Event Detection in Synthetic Domestic Environments
R. Serizel, N. Turpault, A. Shah, J. Salamon
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]

Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers

3/5/2020

Class imbalance in the training data hinders the generalization ability of machine listening systems. In the context of bioacoustics, this issue may be circumvented by aggregating species labels into super-groups of higher taxonomic rank: genus, family, order, and so forth. However, different applications of machine listening to wildlife monitoring may require different levels of granularity. This paper introduces TaxoNet, a deep neural network for structured classification of signals from living organisms. TaxoNet is trained as a multitask and multilabel model, following a new architectural principle in end-to-end learning named "hierarchical composition": shallow layers extract a shared representation to predict a root taxon, while deeper layers specialize recursively to lower-rank taxa. In this way, TaxoNet is capable of handling taxonomic uncertainty, out-of-vocabulary labels, and open-set deployment settings. An experimental benchmark on two new bioacoustic datasets (ANAFCC and BirdVox-14SD) leads to state-of-the-art results in bird species classification. Furthermore, on a task of coarse-grained classification, TaxoNet also outperforms a flat single-task model trained on aggregate labels.

Try it out: pip install birdvoxdetect (https://github.com/BirdVox/birdvoxdetect)

Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers
J. Cramer, V. Lostanlen, A. Farnsworth, J. Salamon, J.P. Bello
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]

Few-Shot Sound Event Detection

3/5/2020

Locating perceptually similar sound events within a continuous recording is a common task for various audio applications. However, current tools require users to manually listen to and label all the locations of the sound events of interest, which is tedious and time-consuming. In this work, we (1) adapt state-of-the-art metric-based few-shot learning methods to automate the detection of similar-sounding events, requiring only one or few examples of the target event, (2) develop a method to automatically construct a partial set of labeled examples (negative samples) to reduce user labeling effort, and (3) develop an inference-time data augmentation method to increase detection accuracy. To validate our approach, we perform extensive comparative analysis of few-shot learning methods for the task of keyword detection in speech. We show that our approach successfully adapts closed-set few-shot learning approaches to an open-set sound event detection problem.

Few-Shot Sound Event Detection
Y. Wang, J. Salamon, N.J. Bryan and J.P. Bello
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]

Disentangled Multidimensional Metric Learning for Music Similarity

3/5/2020

Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar "feel", a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e. genre, mood, instrument, tempo). While prior work ignore this issue, we embrace this idea and introduce the concept of multidimensional similarity and unify both global and specialized similarity metrics into a single, semantically disentangled multidimensional similarity metric. To do so, we adapt a variant of deep metric learning called conditional similarity networks to the audio domain and extend it using track-based information to control the specificity of our model. We evaluate our method and show that our single, multidimensional model outperforms both specialized similarity spaces and alternative baselines. We also run a user-study and show that our approach is favored by human annotators as well.

Disentangled Multidimensional Metric Learning for Music Similarity
J. Lee, N.J. Bryan, J. Salamon, Z. Jin, J. Nam
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]

Sox vs Rubberband for Audio Pitch Shifting and Time Stretching

30/4/2020

This page compares the sound quality for audio pitch shifting and audio time stretching between Sox and Rubberband on different types of audio inputs.

I recently compared the two to decide which one to use in my Scaper soundscape synthesis and augmentation library, and figured I'd share what I found here in case anyone finds it helpful.

Sox and Rubberband are two excellent open-source command-line tools for audio processing. Both have python wrappers too: pysox and pyrubberband.

Both sox and rubberband provide optional arguments that allow you to fine tune the shifting/stretching algorithm for the specific audio content being processed. In my very quick and dirty (and non-comprehensive) exploration, I found one setting for sox and one setting for rubberband that generally gave the best sounding results (to my ears), so I used the same setup for all comparisons below.

â€‹In all the audio examples below, the following commands were used:

Pitch shifting:

sox infile.wav outfile.wav pitch 100
rubberband --pitch 1 infile.wav outfile.wav -c 6

Time stretching:

sox infile.wav outfile.wav tempo -s 0.833
rubberband --time 1.2 infile.wav outfile.wav -c 6

I strongly recommend listening to the examples below with good headphones, since the differences are sometimes quite subtle (but definitely noticeable with a good pair of cans).

Speech

Original

Pitch shift: Sox

Pitch shift: Rubberband

Time stretch: Sox

Time stretch: Rubberband

Engine

Original

Pitch shift: Sox

Pitch shift: Rubberband

Time stretch: Sox

Time stretch: Rubberband

Music: male voice

Original

Pitch shift: Sox

Pitch shift: Rubberband

Time stretch: Sox

Time stretch: Rubberband

â€‹Music: female voice

Original

Pitch shift: Sox

Pitch shift: Rubberband

Time stretch: Sox

Time stretch: Rubberband

So... which one's better? Like most things in life, it depends. It depends on your target application and what matters most in that context, it depends whether you're only doing pitch shifting or time stretching or both, it depends on your audio content, the parameters you choose...

For my application I have a relatively clear winner, but I won't bias you with my opinion ;)

Compute Deep Image & Audio Embeddings with OpenL3

27/1/2020

OpenL3 computes deep image & audio embeddings using a self-supervised L3-Net model and can be installed with a simple "pip install openl3" !

Back in May 2019 we announced the release of OpenL3, an open-source python library for computing deep audio embeddings using an improved L3-Net architecture trained on AudioSet. Today we're happy to announce the release of OpenL3 v0.3.0 which can also compute deep image embeddings!

Installing OpenL3 is easy (requires TensorFlow):

pip install openl3

And computing image embeddings is as easy as:

image = imread('/path/to/file.png')
emb = openl3.get_image_embedding(image, embedding_size=512)

Computing audio embeddings is equally simple:

audio, sr = soundfile.read('/path/to/file.wav')
embedding, timestamps = openl3.get_audio_embedding(audio, sr)

You can even process video files directly to obtain both image and audio embeddings:

openl3.process_video_file(video_filepath, output_dir='/path/to/output/folder')

Full instructions and all available options are described in the OpenL3 Tutorial. There's also an OpenL3 command line interface (CLI) if you want to script outside of python.

Full details about the embedding models and how they were trained can be found in:

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
[IEEE][PDF][BibTeX][Copyright]

We hope the machine learning community, including both computer vision and machine listening researchers, find OpenL3 useful in their work!

Adobe Audio Research Interns Submit Papers to ICASSP, CVPR, ICLR, CHI, IEEE VR, INTERSPEECH, ISMIR

24/1/2020

This past summer we had the pleasure of hosting 14 brilliant students (most of them pursuing their PhD but also Master's and Bachelor's) for audio research internships in San Francisco and Seattle. We worked together on a wide range of audio-related topics including speech processing, music, video, animation, spatial audio, VR, DAFX... the works! The projects also covered a range of disciplines including machine learning (including a healthy dose of deep learning), signal processing and human computer interaction; and a range of problems within these disciplines such as self-supervision and representation learning, metric learning, classification, transformation and generation (synthesis).

As a mentor, it was an incredible experience to guide, collaborate, and learn from this diverse group of people coming from a variety of disciplines and universities in 5 different countries. Between interns and mentors we spanned 11 different countries of origin, making it a truly international group! I'm also delighted to say a large proportion of internship projects have resulted in paper submissions to top-tier venues including ICASSP, CVPR, ICLR, CHI and IEEE VR, with upcoming submissions to INTERSPEECH and ISMIR in preparation!

The 2019 audio research interns were:

Emma Frid, KTH Royal Institute of Technology
Nathan Keil, Rensselaer Polytechnic Institute
Jongpil Lee, Korea Advanced Institute of Science and Technology
Stylianos Mimilakis, Technical University of Ilmenau
Max Morrison, Northwestern University
Kaizhi Qian, University of Illinois at Urbana-Champaign
Lucas Rencker, University of Surrey
Oona Risse-Adams, University of California, Santa Cruz
Jiaqi Su, Princeton University
Zhenyu Tang, University of Maryland-College Park
Yapeng Tian, University of Rochester
Yu Wang, New York University
Karren Yang, Massachusetts Institute of Technology
Yang Zhou, University of Massachusetts Amherst

I look forward to keeping in touch with everyone and hope we get to collaborate again in the future!

The 2019 San Francisco audio research mentors and interns (Seattle team we love you!)