Justin Salamon
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology

Few-shot Drum Transcription in Polyphonic Music

11/10/2020

0 Comments

 
Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.

To learn more please read out paper:

Few-Shot Drum Transcription in Polyphonic Music
Y. Wang, J. Salamon, M. Cartwright, N. J. Bryan, J. P. Bello
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation and a poster here:

https://program.ismir2020.net/poster_1-14.html
Picture
0 Comments

The YouTube-ASMR-300K Dataset

15/6/2020

0 Comments

 
We're excited to release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio published to date!
Picture

Learning from videos with spatial audio is a relatively new domain. While the amount of spatial audio content is increasing, currently there are few video datasets with spatial audio in which the visual content is spatially aligned with the audio content. We therefore introduce a new large-scale dataset of ASMR videos collected from YouTube that contains stereo audio.

ASMR (autonomous sensory meridian response) videos are readily available online and typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with
stereo/binaural or paired microphones. 

The audio in these videos contains binaural cues that, when listened to with headphones, create a highly immersive experience in which listeners perceive the sounds as if they were happening around
them. Thus there is strong correspondence between the visual and spatial audio content in these videos.

​Our full dataset, YouTube-ASMR-300K, consists of approximately 300K 10-second video clips with spatial audio. From this full dataset, we also manually curated a subset of 30K clips from 30 ASMR channels that feature more sound events moving spatially for training our models. We call this curated dataset YouTube-ASMR. We split the video clips into training, validation, and test sets in an 80-10-10 ratio.


Compared to the existing datasets, YouTube-ASMR300K is (1) larger by at least 8X, (2) collected
in-the-wild, and (3) contains sound sources in motion (e.g., a user waves a tuning fork across the field of view), which is important for training models on diverse spatial cues. how YouTube-ASMR and YouTube-ASMR-300K compare to existing video dataset with spatial audio:
Dataset
​# Unique videos
Durations (hours)
Lu et al., ICIP 2019
​N/R
9.3
FAIR-play
N/R (2000 10-sec clips in total)
5.2
YouTube-360
1146
114
YouTube-ASMR
3520
96
YouTube-ASMR-300K
33725
904

​YouTube-ASMR-300K was compiled as part of our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
K. Yang, B. Russell, J. Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

The YouTube-ASMR-300K dataset and other project materials are available on our companion website:

https://karreny.github.io/telling-left-from-right/
0 Comments

Teaching Machines to Locate and Generate 3D Sound in Videos by Watching ASMR

11/6/2020

0 Comments

 

Teaching machines to sense the world around them is a challenging problem. They need to:
  • Learn to SEE the world
  • Learn to HEAR the world
  • Learn to LINK what they SEE with what they HEAR

We've seen dramatic progress in this area in recent years, but significant challenges remain:
  • We have access to millions of videos, but they aren't labeled:
    • Can we teach machines from videos without labels?
  • In the same way objects move on screen, the sounds they make move too. Humans leverage this spatial correspondence between sight & sound to better understand the world:
    • Can machines leverage this audiovisual spatial correspondence?

We present a novel self-supervised approach to spatial, audiovisual learning: we train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. We also show the technique generalizes to 360 videos with 3D sound by replacing left/right flipping with front/back audio rotation.

Here's the idea in a nutshell:


In a video with binaural audio, the location of what we see on screen corresponds to the perceived location of the sound. The spatial audio effect is thanks to the stereo audio with Left and Right channels:
Picture
But what if we flip the left and right audio channels? Now the audiovisual spatial correspondence is broken, and the sound will be perceived as coming from the opposite direction:
Picture
We can leverage audiovisual spatial correspondence to learn a spatial audiovisual representation via self-supervision!

Concretely, we train a model to predict whether the audio channels have been flipped or not:
Picture
This surprisingly simple task results in a strong audiovisual representation that's useful in a variety of  applications, including:
  • Sounding face tracking
  • On/off-screen sound source separation
  • Automatic audio spatialization in 2D and 3D video, a.k.a upmixing: converting a video with mono audio to a video with stereo or ambisonic audio giving a surround sound effect where the sound of visible objects moves in synchrony with the object's motion.

In case you missed in the demo video at the top of this page, here's the upmixing example again:

The YouTube-ASMR-300K dataset

To learn this spatial audiovisual representation, we compiled a new dataset containing hundreds of thousands of videos with spatial audio. But... where do you find video with spatial audio?

Cue in ASMR videos to the rescue!


ASMR (autonomous sensory meridian response) videos typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with stereo/binaural or paired microphones. Some people watch ASMR videos to fall asleep. Others enjoy the tingling sensation the spatial audio gives them. Others can't stand them. However people react to ASMR, one thing is clear - ASMR videos are trending on YouTube... in fact, there are millions of them!

And here's the important part - the audio in these videos contains binaural cues such that there is strong audiovisual correspondence between the visual and spatial audio cues.

We're delighted to formally release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio to date:
Picture
To download YouTube-ASMR-300K, please visit our companion website:

https://karreny.github.io/telling-left-from-right/

Learn more

You can also learn more about our work by reading our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
Karren Yang, Bryan Russell, Justin Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]
Picture
0 Comments

DESED: A New Dataset for Sound Event Detection and a Comparative Evaluation of SOTA Systems

4/5/2020

0 Comments

 
We present a comparative analysis of the performance of state-of-the-art sound event detection systems. In particular, we study the robustness of the systems to noise and signal degradation, which is known to impact model generalization. Our analysis is based on the results of task 4 of the DCASE 2019 challenge, where submitted systems were evaluated on, in addition to real-world recordings, a series of synthetic soundscapes that allow us to carefully control for different soundscape characteristics. Our results show that while overall systems exhibit significant improvements compared to previous work, they still suffer from biases that could prevent them from generalizing to real-world scenarios.

New Dataset: DESED (Domestic Environment Sound Event Detection)

The study was performed using a new dataset for sound event detection, DESED:
  • Dataset website: https://project.inria.fr/desed/
  • Includes both recorded and synthesized soundscapes
  • Includes isolated sound events allowing users to generate new data using Scaper

Sound Event Detection in Synthetic Domestic Environments
R. Serizel, N. Turpault, A. Shah, J. Salamon

In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]
Picture
0 Comments

Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers

3/5/2020

0 Comments

 
Class imbalance in the training data hinders the generalization ability of machine listening systems. In the context of bioacoustics, this issue may be circumvented by aggregating species labels into super-groups of higher taxonomic rank: genus, family, order, and so forth. However, different applications of machine listening to wildlife monitoring may require different levels of granularity. This paper introduces TaxoNet, a deep neural network for structured classification of signals from living organisms. TaxoNet is trained as a multitask and multilabel model, following a new architectural principle in end-to-end learning named "hierarchical composition": shallow layers extract a shared representation to predict a root taxon, while deeper layers specialize recursively to lower-rank taxa. In this way, TaxoNet is capable of handling taxonomic uncertainty, out-of-vocabulary labels, and open-set deployment settings. An experimental benchmark on two new bioacoustic datasets (ANAFCC and BirdVox-14SD) leads to state-of-the-art results in bird species classification. Furthermore, on a task of coarse-grained classification, TaxoNet also outperforms a flat single-task model trained on aggregate labels.

Try it out: pip install birdvoxdetect (https://github.com/BirdVox/birdvoxdetect)

Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers
J. Cramer, V. Lostanlen, A. Farnsworth, J. Salamon, J.P. Bello
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]
Picture
0 Comments

Few-Shot Sound Event Detection

3/5/2020

0 Comments

 
Locating perceptually similar sound events within a continuous recording is a common task for various audio applications. However, current tools require users to manually listen to and label all the locations of the sound events of interest, which is tedious and time-consuming. In this work, we (1) adapt state-of-the-art metric-based few-shot learning methods to automate the detection of similar-sounding events, requiring only one or few examples of the target event, (2) develop a method to automatically construct a partial set of labeled examples (negative samples) to reduce user labeling effort, and (3) develop an inference-time data augmentation method to increase detection accuracy. To validate our approach, we perform extensive comparative analysis of few-shot learning methods for the task of keyword detection in speech. We show that our approach successfully adapts closed-set few-shot learning approaches to an open-set sound event detection problem.

Few-Shot Sound Event Detection
Y. Wang, J. Salamon, N.J. Bryan and J.P. Bello
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]
Picture
0 Comments

Disentangled Multidimensional Metric Learning for Music Similarity

3/5/2020

0 Comments

 
Music similarity search is useful for a variety of creative tasks such as replacing one music recording with another recording with a similar "feel", a common task in video editing. For this task, it is typically necessary to define a similarity metric to compare one recording to another. Music similarity, however, is hard to define and depends on multiple simultaneous notions of similarity (i.e. genre, mood, instrument, tempo). While prior work ignore this issue, we embrace this idea and introduce the concept of multidimensional similarity and unify both global and specialized similarity metrics into a single, semantically disentangled multidimensional similarity metric. To do so, we adapt a variant of deep metric learning called conditional similarity networks to the audio domain and extend it using track-based information to control the specificity of our model. We evaluate our method and show that our single, multidimensional model outperforms both specialized similarity spaces and alternative baselines. We also run a user-study and show that our approach is favored by human annotators as well.

Disentangled Multidimensional Metric Learning for Music Similarity
J. Lee, N.J. Bryan, J. Salamon, Z. Jin, J. Nam
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 2020.
[IEEE][PDF][BibTeX][Copyright]
Picture
0 Comments

Sox vs Rubberband for Audio Pitch Shifting and Time Stretching

30/4/2020

0 Comments

 
Picture
This page compares the sound quality for audio pitch shifting and audio time stretching between Sox and Rubberband on different types of audio inputs.

I recently compared the two to decide which one to use in my Scaper soundscape synthesis and augmentation library, and figured I'd share what I found here in case anyone finds it helpful.

Sox and Rubberband are two excellent open-source command-line tools for audio processing. Both have python wrappers too: pysox and pyrubberband.


Both sox and rubberband provide optional arguments that allow you to fine tune the shifting/stretching algorithm for the specific audio content being processed. In my very quick and dirty (and non-comprehensive) exploration, I found one setting for sox and one setting for rubberband that generally gave the best sounding results (to my ears), so I used the same setup for all comparisons below.

​In all the audio examples below,  the following commands were used:


Pitch shifting:
  • sox infile.wav outfile.wav pitch 100
  • rubberband --pitch 1 infile.wav outfile.wav -c 6

Time stretching:
  • sox infile.wav outfile.wav tempo -s 0.833
  • rubberband --time 1.2 infile.wav outfile.wav -c 6

I strongly recommend listening to the examples below with good headphones, since the differences are sometimes quite subtle (but definitely noticeable with a good pair of cans).

Speech

Original
Your browser does not support the audio element.

Pitch shift: Sox
Your browser does not support the audio element.

Pitch shift: Rubberband
Your browser does not support the audio element.

Time stretch: Sox
Your browser does not support the audio element.

Time stretch: Rubberband
Your browser does not support the audio element.

Engine
Original
Your browser does not support the audio element.

Pitch shift: Sox
Your browser does not support the audio element.

Pitch shift: Rubberband
Your browser does not support the audio element.

Time stretch: Sox
Your browser does not support the audio element.

Time stretch: Rubberband
Your browser does not support the audio element.

Music: male voice
Original
Your browser does not support the audio element.

Pitch shift: Sox
Your browser does not support the audio element.

Pitch shift: Rubberband
Your browser does not support the audio element.

Time stretch: Sox
Your browser does not support the audio element.

Time stretch: Rubberband
Your browser does not support the audio element.

​Music: female voice
Original
Your browser does not support the audio element.

Pitch shift: Sox
Your browser does not support the audio element.

Pitch shift: Rubberband
Your browser does not support the audio element.

Time stretch: Sox
Your browser does not support the audio element.

Time stretch: Rubberband
Your browser does not support the audio element.

So... which one's better? Like most things in life, it depends. It depends on your target application and what matters most in that context, it depends whether you're only doing pitch shifting or time stretching or both, it depends on your audio content, the parameters you choose...

For my application I have a relatively clear winner, but I won't bias you with my opinion ;)
0 Comments

Compute Deep Image & Audio Embeddings with OpenL3

27/1/2020

0 Comments

 
OpenL3 computes deep image & audio embeddings using a self-supervised L3-Net model and can be installed with a simple "pip install openl3" !

Back in May 2019 we announced the release of OpenL3, an open-source python library for computing deep audio embeddings using an improved L3-Net architecture trained on AudioSet. Today we're happy to announce the release of OpenL3 v0.3.0 which can also compute deep image embeddings!

Installing OpenL3 is easy (requires TensorFlow):
pip install openl3
And computing image embeddings is as easy as:
image = imread('/path/to/file.png')
emb = openl3.get_image_embedding(image, embedding_size=512)
Computing audio embeddings is equally simple:
audio, sr = soundfile.read('/path/to/file.wav')
embedding, timestamps = openl3.get_audio_embedding(audio, sr)
You can even process video files directly to obtain both image and audio embeddings:
openl3.process_video_file(video_filepath, output_dir='/path/to/output/folder')
Full instructions and all available options are described in the OpenL3 Tutorial. There's also an OpenL3 command line interface (CLI) if you want to script outside of python.

Full details about the embedding models and how they were trained can be found in:

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
[IEEE][PDF][BibTeX][Copyright]

We hope the machine learning community, including both computer vision and machine listening researchers, find OpenL3 useful in their work!
Picture
0 Comments

Adobe Audio Research Interns Submit Papers to ICASSP, CVPR, ICLR, CHI, IEEE VR, INTERSPEECH, ISMIR

24/1/2020

0 Comments

 
This past summer we had the pleasure of hosting 14 brilliant students (most of them pursuing their PhD but also Master's and Bachelor's) for audio research internships in San Francisco and Seattle. We worked together on a wide range of audio-related topics including speech processing, music, video, animation, spatial audio, VR, DAFX... the works! The projects also covered a range of disciplines including machine learning (including a healthy dose of deep learning), signal processing and human computer interaction; and a range of problems within these disciplines such as self-supervision and representation learning, metric learning, classification, transformation and generation (synthesis).

As a mentor, it was an incredible experience to guide, collaborate, and learn from this diverse group of people coming from a variety of disciplines and universities in 5 different countries. Between interns and mentors we spanned 11 different countries of origin, making it a truly international group! I'm also delighted to say a large proportion of internship projects have resulted in paper submissions to top-tier venues including ICASSP, CVPR, ICLR, CHI and IEEE VR, with upcoming submissions to INTERSPEECH and ISMIR in preparation!

The 2019 audio research interns were:
  • Emma Frid, KTH Royal Institute of Technology
  • Nathan Keil, Rensselaer Polytechnic Institute
  • Jongpil Lee, Korea Advanced Institute of Science and Technology
  • Stylianos Mimilakis, Technical University of Ilmenau
  • Max Morrison, Northwestern University
  • Kaizhi Qian, University of Illinois at Urbana-Champaign
  • Lucas Rencker, University of Surrey
  • Oona Risse-Adams, University of California, Santa Cruz
  • Jiaqi Su, Princeton University
  • Zhenyu Tang, University of Maryland-College Park
  • Yapeng Tian, University of Rochester
  • Yu Wang, New York University
  • Karren Yang, Massachusetts Institute of Technology
  • Yang Zhou, University of Massachusetts Amherst

I look forward to keeping in touch with everyone and hope we get to collaborate again in the future!
Picture
The 2019 San Francisco audio research mentors and interns (Seattle team we love you!)
0 Comments
<<Previous
Forward>>

    NEWS

    Machine listening research, code, data & hacks!

    Archives

    March 2023
    April 2022
    November 2021
    October 2021
    June 2021
    January 2021
    October 2020
    June 2020
    May 2020
    April 2020
    January 2020
    November 2019
    October 2019
    June 2019
    May 2019
    March 2019
    February 2019
    January 2019
    November 2018
    October 2018
    August 2018
    July 2018
    May 2018
    April 2018
    February 2018
    October 2017
    August 2017
    July 2017
    June 2017
    April 2017
    March 2017
    January 2017
    December 2016
    November 2016
    October 2016
    August 2016
    June 2016
    May 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    July 2015
    June 2015
    April 2015
    February 2015
    November 2014
    October 2014
    September 2014
    June 2014
    April 2014
    March 2014
    February 2014
    December 2013
    September 2013
    July 2013
    May 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    August 2012
    July 2012
    June 2012

    Categories

    All
    ACM MM'13
    ACM MM'14
    Acoustic Ecology
    Acoustic Event Detection
    Acoustic Sensing
    AES
    Applied Acoustics
    Article
    Audio-annotator
    Audio To Midi
    Auditory Scene Analysis
    Avian
    Award
    Baseball
    Beer
    Best Oral Presentation
    Best Paper Award
    Best Student Paper Award
    BigApps
    Bioacoustics
    BirdVox
    Book
    Chapter
    CHI
    Citizen Science
    Classification
    Computer Vision
    Conference
    Connected Cities
    Convolutional Neural Networks
    Cornell Lab Of Ornithology
    Coursera
    Cover Detection
    CREPE
    Crowdcrafting
    Crowdsourcing
    CUSP
    CVPR
    Data Augmentation
    Data Science
    Dataset
    Data Structures
    Dcase
    Deep Learning
    Domain
    Education
    Entrepreneurship
    Environmental Sound
    Essentia
    Eusipco
    Eusipco2015
    Evaluation
    Few-shot Learning
    Flight Calls
    Girl Scouts
    Grant
    Hackathon
    Hackday
    Hackfest
    HCI
    Hildegard Von Bingen
    ICASSP
    ICASSP 2020
    IEEE Signal Processing Letters
    Ieee Spm
    Indian Classical Music
    Interface
    Interspeech
    Interview
    Ismir 2012
    Ismir2014
    Ismir2015
    Ismir2016
    Ismir2017
    Ismir2020
    ITP
    Jams
    Javascript
    JNMR
    Journal
    Machine Learning
    Machine Listening
    Map
    Media
    Melodia
    Melody Extraction
    Metric Learning
    Midi
    Migration Monitoring
    MIR
    Mir_eval
    MOOC
    MTG-QBH
    Music Informatics
    Music Information Retrieval
    Music Similarity
    National Science Foundation
    Neumerator
    New York Times
    Noise Pollution
    Notebook
    NPR
    NSF
    NYC
    NYU
    Open Source
    Pitch
    Pitch Contours
    Pitch Tracking
    Plos One
    Plug In
    Plug-in
    Presentation
    Press
    PRI
    Prosody
    Publication
    Python
    Query By Humming
    Query-by-humming
    Radio
    Representation Learning
    Research
    Robots
    Scaper
    Science And The City
    Science Friday
    Self-supervision
    Sensor Network
    Sensors
    Sight And Sound Workshop
    Smart Cities
    Software
    SONYC
    Sound Classification
    Sound Education
    Sound Event Detection
    Soundscape
    Sounds Of New York City
    Sound Workshop
    Speech
    STEM
    Synthesis
    Taste Of Science
    Taxonomy
    Technical Report
    Time Series
    Tonic ID
    Tony
    Tutorial
    Unsupervised Feature Learning
    Urban
    Urban Sound Analysis
    Urban Sound Tagging
    Vamp
    Version Identification
    Visualization
    Vocaloid
    Vocoder
    Warblers
    Wav To Midi
    Welcome
    Wired
    WNYC
    Women In Science
    Workshop
    World Domination
    Wsf14
    Youtube

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology