Justin Salamon
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology

Audio-Text Models Do Not Yet Leverage Natural Language

27/3/2023

0 Comments

 
​Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models’ capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.

Audio-Text Models Do Not Yet Leverage Natural Language
H-H. Wu, O. Nieto, J.P. Bello, J. Salamon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
[PDF][arXiv][Copyright]
0 Comments

HEAR 2021: Holistic Evaluation of Audio Representations

13/4/2022

0 Comments

 
What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

For more information please see our paper:

​
HEAR 2021: Holistic Evaluation of Audio Representations
J. Turian, J. Shier, H. Raj Khan, B. Raj, B.W. Schuller, C.J. Steinmetz, C. Malloy, G. Tzanetakis, G. Velarde, K. McNally, M. Henry, N. Pinto, C. Noufi, C. Clough, D. Herremans, E. Fonseca, J. Engel, J. Salamon, P. Esling, P. Manocha, S. Watanabe, Z. Jin, Y. Bisk
​Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track, 2022.
[arXiv]

You  can also visit the challenge website: 
http://neuralaudio.ai/hear.html
0 Comments

Deep Embeddings and Section Fusion Improve Music Segmentation

8/11/2021

0 Comments

 
Music segmentation algorithms identify the structure of a music recording by automatically dividing it into sections and determining which sections repeat and when. Since the desired granularity of the sections may vary by application, multi-level segmentation produces several levels of segmentation ordered by granularity from one section (the whole song) up to N unique sections, and has proven to be a challenging MIR task. In this work we propose a multi-level segmentation method that leverages deep audio embeddings learned via other tasks. Our approach builds on an existing multi-level segmentation algorithm, replacing manually engineered features with deep embeddings learned through audio classification problems where data are abundant. Additionally, we propose a novel section fusion algorithm that leverages the multi-level segmentation to consolidate short segments at each level in a way that is consistent with the segmentations at lower levels. Through a series of experiments we show that replacing handcrafted features with deep embeddings can lead to significant improvements in multilevel music segmentation performance, and that section fusion further improves the results by cleaning up spurious short sections. We compare our approach to two strong baselines and show that it yields state-of-the-art results.

To learn more please see our paper:

Deep Embeddings and Section Fusion Improve Music Segmentation
J. Salamon, O. Nieto, N. J. Bryan
In 22nd International Society for Music Information Retrieval Conference (ISMIR), Nov. 2021.
[ISMIR][PDF][BibTeX]
Picture
0 Comments

Who calls the shots? Rethinking few-shot learning for audio

20/10/2021

0 Comments

 
Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the sup- port set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to explore these questions systematically. Our experiments lead to audio-specific insights on few- shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size-fits-all model, method, and support set selection criterion. Rather, it depends on the expected application scenario.

To learn more please read out paper:

Who calls the shots? Rethinking few-shot learning for audio
Y. Wang, N. Bryan, J. Salamon, M. Cartwright, J.P. Bello
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
[PDF][Copyright][arXiv]

You can access our code and the new
FSD-MIX-CLIPS and FSD-MIX-SED datasets via our github page:
https://github.com/wangyu/rethink-audio-fsl

UPDATE: our paper has been award a WASPAA 2021 Special Best Paper Award!
Picture
0 Comments

Invited talk at CVPR 2021 Sight and Sound workshop: On the road to artistic AudioVisual representation learning

24/6/2021

0 Comments

 
This past Sunday I had the pleasure of giving an invited talk at the CVPR 2021 Sight and Sound workshop.

In my talk, titled "On the road to artistic AudioVisual representation learning", I gave a selective overview of my research journey through audio and video representation learning (supervised and self-supervised), which has led me to some of the questions I'm interested in now on the topic of artistic AudioVisual representation learning.

I tried to keep the talk light, fun, and full of demos :)

The complete talk is now available here:
Picture
0 Comments

What's all the FUSS about Free Universal Sound Separation Data?

9/6/2021

0 Comments

 
​We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

What’s all the Fuss about Free Universal Sound Separation Data?
S. Wisdom, H. Erdogan, D.P.W. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, J.R. Hershey
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 186 - 190, 2021
[IEEE][PDF][Copyright]
Picture
0 Comments

Few-Shot Continual Learning for Audio Classification

8/6/2021

0 Comments

 
Supervised learning for audio classification typically imposes a fixed class vocabulary, which can be limiting for real-world applications where the target class vocabulary is not known a priori or changes dynamically. In this work, we introduce a few-shot continual learning framework for audio classification, where we can continuously expand a trained base classifier to recognize novel classes based on only few labeled data at inference time. This enables fast and interactive model updates by end-users with minimal human effort. To do so, we leverage the dynamic few-shot learning technique and adapt it to a challenging multi-label audio classification scenario. We incorporate a recent state-of-the-art audio feature extraction model as a backbone and perform a comparative analysis of our approach on two popular audio datasets (ESC-50 and AudioSet). We conduct an in-depth evaluation to illustrate the complexities of the problem and show that, while there is still room for improvement, our method outperforms three baselines on novel class detection while maintaining its performance on base classes.

Few-Shot Continual Learning for Audio Classification
Y. Wang, N.J. Bryan, M. Cartwright, J.P. Bello, J. Salamon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 321 - 325, 2021
[IEEE][PDF][Copyright]
Picture
0 Comments

2020 IEEE SPS Signal Processing Letters Best Paper Award

14/1/2021

0 Comments

 
I'm thrilled to report that our paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification has won a 2020 IEEE SPS Signal Processing Letters Best Paper Award.

The paper appeared in the March 2017 issue of the IEE SPL journal. It was honored for its "exceptional merit and broad interest on a subject related to the Society's technical scope." (To be eligible for consideration, an article must have appeared in Signal Processing Letters within a five-year window.). Here's the NYU press release about the award.

The research was conducted in the context of the Sounds of New York City (SONYC) project:

The paper represents the culmination of several years of research on urban sound, and builds on our earlier work in this area including:
  • A Dataset and Taxonomy for Urban Sound Research
    J. Salamon, C. Jacoby and J. P. Bello, 2014
  • Unsupervised Feature Learning for Urban Sound Classification
    J. Salamon and J. P. Bello, 2015
  • Feature Learning with Deep Scattering for Urban Sound Analysis
    J. Salamon and J. P. Bello, 2015
  • The Implementation of Low-cost Urban Acoustic Monitoring Devices
    C. Mydlarz, J. Salamon and J. P. Bello, 2016

I want to express my sincere gratitude to my colleague and mentor Prof. Juan P. Bello of NYU for the years of fruitful collaboration that led to and made this work possible.

You can read the paper here:

Deep Convolutional Neural Networks and Data Augmentation For Environmental Sound Classification
J. Salamon and J. P. Bello
IEEE Signal Processing Letters, 24(3), pages 279 - 283, 2017.
[IEEE][PDF][BibTeX][Copyright]
Picture
Awkward screenshot of me talking about the work from the SONYC video
0 Comments

Controllable Neural Prosody Synthesis

25/10/2020

0 Comments

 
Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

Here's Max's presentation of the work at INTERSPEECH 2020:
Your browser does not support the video tag.

To hear more examples please visit Max's paper companion website.

For further details please read our paper:

Controllable Neural Prosody Synthesis
M. Morrison, Z. Jin, J. Salamon, N.J. Bryan, G. Mysore
Proc. Interspeech. October 2020.
[INTERSPEECH][PDF][arXiv]
Picture
0 Comments

Metric Learning vs Classification for Disentangled Music Representation Learning

13/10/2020

0 Comments

 
​Deep representation learning offers a powerful paradigm for mapping input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well across tasks. Along with generalization, the emerging concept of disentangled representations is also of great interest, where multiple semantic concepts (e.g., genre, mood, instrumentation) are learned jointly but remain separable in the learned representation space. In this paper we present a single representation learning framework that elucidates the relationship between metric learning, classification, and disentanglement in a holistic manner. For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction). We find that classification-based models are generally advantageous for training time, similarity retrieval, and autotagging, while deep metric learning exhibits better performance for triplet-prediction. Finally, we show that our proposed approach yields state-of-the-art results for music auto-tagging.

To learn more please read our paper:

Metric Learning vs Classification for Disentangled Music Representation Learning
J. Lee, N. J. Bryan, J. Salamon, Z. Jin, J. Nam
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation on our companion website:

https://jongpillee.github.io/metric-vs-classification/
Picture
0 Comments
<<Previous

    NEWS

    Machine listening research, code, data & hacks!

    Archives

    March 2023
    April 2022
    November 2021
    October 2021
    June 2021
    January 2021
    October 2020
    June 2020
    May 2020
    April 2020
    January 2020
    November 2019
    October 2019
    June 2019
    May 2019
    March 2019
    February 2019
    January 2019
    November 2018
    October 2018
    August 2018
    July 2018
    May 2018
    April 2018
    February 2018
    October 2017
    August 2017
    July 2017
    June 2017
    April 2017
    March 2017
    January 2017
    December 2016
    November 2016
    October 2016
    August 2016
    June 2016
    May 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    July 2015
    June 2015
    April 2015
    February 2015
    November 2014
    October 2014
    September 2014
    June 2014
    April 2014
    March 2014
    February 2014
    December 2013
    September 2013
    July 2013
    May 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    August 2012
    July 2012
    June 2012

    Categories

    All
    ACM MM'13
    ACM MM'14
    Acoustic Ecology
    Acoustic Event Detection
    Acoustic Sensing
    AES
    Applied Acoustics
    Article
    Audio-annotator
    Audio To Midi
    Auditory Scene Analysis
    Avian
    Award
    Baseball
    Beer
    Best Oral Presentation
    Best Paper Award
    Best Student Paper Award
    BigApps
    Bioacoustics
    BirdVox
    Book
    Chapter
    CHI
    Citizen Science
    Classification
    Computer Vision
    Conference
    Connected Cities
    Convolutional Neural Networks
    Cornell Lab Of Ornithology
    Coursera
    Cover Detection
    CREPE
    Crowdcrafting
    Crowdsourcing
    CUSP
    CVPR
    Data Augmentation
    Data Science
    Dataset
    Data Structures
    Dcase
    Deep Learning
    Domain
    Education
    Entrepreneurship
    Environmental Sound
    Essentia
    Eusipco
    Eusipco2015
    Evaluation
    Few-shot Learning
    Flight Calls
    Girl Scouts
    Grant
    Hackathon
    Hackday
    Hackfest
    HCI
    Hildegard Von Bingen
    ICASSP
    ICASSP 2020
    IEEE Signal Processing Letters
    Ieee Spm
    Indian Classical Music
    Interface
    Interspeech
    Interview
    Ismir 2012
    Ismir2014
    Ismir2015
    Ismir2016
    Ismir2017
    Ismir2020
    ITP
    Jams
    Javascript
    JNMR
    Journal
    Machine Learning
    Machine Listening
    Map
    Media
    Melodia
    Melody Extraction
    Metric Learning
    Midi
    Migration Monitoring
    MIR
    Mir_eval
    MOOC
    MTG-QBH
    Music Informatics
    Music Information Retrieval
    Music Similarity
    National Science Foundation
    Neumerator
    New York Times
    Noise Pollution
    Notebook
    NPR
    NSF
    NYC
    NYU
    Open Source
    Pitch
    Pitch Contours
    Pitch Tracking
    Plos One
    Plug In
    Plug-in
    Presentation
    Press
    PRI
    Prosody
    Publication
    Python
    Query By Humming
    Query-by-humming
    Radio
    Representation Learning
    Research
    Robots
    Scaper
    Science And The City
    Science Friday
    Self-supervision
    Sensor Network
    Sensors
    Sight And Sound Workshop
    Smart Cities
    Software
    SONYC
    Sound Classification
    Sound Education
    Sound Event Detection
    Soundscape
    Sounds Of New York City
    Sound Workshop
    Speech
    STEM
    Synthesis
    Taste Of Science
    Taxonomy
    Technical Report
    Time Series
    Tonic ID
    Tony
    Tutorial
    Unsupervised Feature Learning
    Urban
    Urban Sound Analysis
    Urban Sound Tagging
    Vamp
    Version Identification
    Visualization
    Vocaloid
    Vocoder
    Warblers
    Wav To Midi
    Welcome
    Wired
    WNYC
    Women In Science
    Workshop
    World Domination
    Wsf14
    Youtube

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology