Justin Salamon - News

Audio-Text Models Do Not Yet Leverage Natural Language

27/3/2023

Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models’ capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.

Audio-Text Models Do Not Yet Leverage Natural Language
H-H. Wu, O. Nieto, J.P. Bello, J. Salamon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
[PDF][arXiv][Copyright]

HEAR 2021: Holistic Evaluation of Audio Representations

13/4/2022

What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR 2021 NeurIPS challenge is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR 2021 evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. In the spirit of shared exchange, each participant submitted an audio embedding model following a common API that is general-purpose, open-source, and freely available to use. Twenty-nine models by thirteen external teams were evaluated on nineteen diverse downstream tasks derived from sixteen datasets. Open evaluation code, submitted models and datasets are key contributions, enabling comprehensive and reproducible evaluation, as well as previously impossible longitudinal studies. It still remains an open question whether one single general-purpose audio representation can perform as holistically as the human ear.

For more information please see our paper:

HEAR 2021: Holistic Evaluation of Audio Representations
J. Turian, J. Shier, H. Raj Khan, B. Raj, B.W. Schuller, C.J. Steinmetz, C. Malloy, G. Tzanetakis, G. Velarde, K. McNally, M. Henry, N. Pinto, C. Noufi, C. Clough, D. Herremans, E. Fonseca, J. Engel, J. Salamon, P. Esling, P. Manocha, S. Watanabe, Z. Jin, Y. Bisk
Proceedings of Machine Learning Research (PMLR): NeurIPS 2021 Competition Track, 2022.
[arXiv]

You can also visit the challenge website: http://neuralaudio.ai/hear.html

Deep Embeddings and Section Fusion Improve Music Segmentation

8/11/2021

Music segmentation algorithms identify the structure of a music recording by automatically dividing it into sections and determining which sections repeat and when. Since the desired granularity of the sections may vary by application, multi-level segmentation produces several levels of segmentation ordered by granularity from one section (the whole song) up to N unique sections, and has proven to be a challenging MIR task. In this work we propose a multi-level segmentation method that leverages deep audio embeddings learned via other tasks. Our approach builds on an existing multi-level segmentation algorithm, replacing manually engineered features with deep embeddings learned through audio classification problems where data are abundant. Additionally, we propose a novel section fusion algorithm that leverages the multi-level segmentation to consolidate short segments at each level in a way that is consistent with the segmentations at lower levels. Through a series of experiments we show that replacing handcrafted features with deep embeddings can lead to significant improvements in multilevel music segmentation performance, and that section fusion further improves the results by cleaning up spurious short sections. We compare our approach to two strong baselines and show that it yields state-of-the-art results.

To learn more please see our paper:

Deep Embeddings and Section Fusion Improve Music Segmentation
J. Salamon, O. Nieto, N. J. Bryan
In 22nd International Society for Music Information Retrieval Conference (ISMIR), Nov. 2021.
[ISMIR][PDF][BibTeX]

Who calls the shots? Rethinking few-shot learning for audio

20/10/2021

Few-shot learning aims to train models that can recognize novel classes given just a handful of labeled examples, known as the sup- port set. While the field has seen notable advances in recent years, they have often focused on multi-class image classification. Audio, in contrast, is often multi-label due to overlapping sounds, resulting in unique properties such as polyphony and signal-to-noise ratios (SNR). This leads to unanswered questions concerning the impact such audio properties may have on few-shot learning system design, performance, and human-computer interaction, as it is typically up to the user to collect and provide inference-time support set examples. We address these questions through a series of experiments designed to elucidate the answers to these questions. We introduce two novel datasets, FSD-MIX-CLIPS and FSD-MIX-SED, whose programmatic generation allows us to explore these questions systematically. Our experiments lead to audio-specific insights on few- shot learning, some of which are at odds with recent findings in the image domain: there is no best one-size-fits-all model, method, and support set selection criterion. Rather, it depends on the expected application scenario.

To learn more please read out paper:

Who calls the shots? Rethinking few-shot learning for audio
Y. Wang, N. Bryan, J. Salamon, M. Cartwright, J.P. Bello
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2021.
[PDF][Copyright][arXiv]

You can access our code and the new FSD-MIX-CLIPS and FSD-MIX-SED datasets via our github page:
https://github.com/wangyu/rethink-audio-fsl

UPDATE: our paper has been award a WASPAA 2021 Special Best Paper Award!

Invited talk at CVPR 2021 Sight and Sound workshop: On the road to artistic AudioVisual representation learning

24/6/2021

This past Sunday I had the pleasure of giving an invited talk at the CVPR 2021 Sight and Sound workshop.

In my talk, titled "On the road to artistic AudioVisual representation learning", I gave a selective overview of my research journey through audio and video representation learning (supervised and self-supervised), which has led me to some of the questions I'm interested in now on the topic of artistic AudioVisual representation learning.

I tried to keep the talk light, fun, and full of demos :)

The complete talk is now available here:

What's all the FUSS about Free Universal Sound Separation Data?

9/6/2021

We introduce the Free Universal Sound Separation (FUSS) dataset, a new corpus for experiments in separating mixtures of an unknown number of sounds from an open domain of sound types. The dataset consists of 23 hours of single-source audio data drawn from 357 classes, which are used to create mixtures of one to four sources. To simulate reverberation, an acoustic room simulator is used to generate impulse responses of box-shaped rooms with frequency-dependent reflective walls. Additional open-source data augmentation tools are also provided to produce new mixtures with different combinations of sources and room simulations. Finally, we introduce an open-source baseline separation model, based on an improved time-domain convolutional network (TDCN++), that can separate a variable number of sources in a mixture. This model achieves 9.8 dB of scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources, while reconstructing single-source inputs with 35.8 dB absolute SI-SNR. We hope this dataset will lower the barrier to new research and allow for fast iteration and application of novel techniques from other machine learning domains to the sound separation challenge.

What’s all the Fuss about Free Universal Sound Separation Data?
S. Wisdom, H. Erdogan, D.P.W. Ellis, R. Serizel, N. Turpault, E. Fonseca, J. Salamon, P. Seetharaman, J.R. Hershey
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 186 - 190, 2021
[I EEE][PD F][Copyright]

Few-Shot Continual Learning for Audio Classification

8/6/2021

Supervised learning for audio classification typically imposes a fixed class vocabulary, which can be limiting for real-world applications where the target class vocabulary is not known a priori or changes dynamically. In this work, we introduce a few-shot continual learning framework for audio classification, where we can continuously expand a trained base classifier to recognize novel classes based on only few labeled data at inference time. This enables fast and interactive model updates by end-users with minimal human effort. To do so, we leverage the dynamic few-shot learning technique and adapt it to a challenging multi-label audio classification scenario. We incorporate a recent state-of-the-art audio feature extraction model as a backbone and perform a comparative analysis of our approach on two popular audio datasets (ESC-50 and AudioSet). We conduct an in-depth evaluation to illustrate the complexities of the problem and show that, while there is still room for improvement, our method outperforms three baselines on novel class detection while maintaining its performance on base classes.

Few-Shot Continual Learning for Audio Classification
Y. Wang, N.J. Bryan, M. Cartwright, J.P. Bello, J. Salamon
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 321 - 325, 2021
[IEEE][PDF][Copyright]

2020 IEEE SPS Signal Processing Letters Best Paper Award

14/1/2021

I'm thrilled to report that our paper Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification has won a 2020 IEEE SPS Signal Processing Letters Best Paper Award.

The paper appeared in the March 2017 issue of the IEE SPL journal. It was honored for its "exceptional merit and broad interest on a subject related to the Society's technical scope." (To be eligible for consideration, an article must have appeared in Signal Processing Letters within a five-year window.). Here's the NYU press release about the award.

The research was conducted in the context of the Sounds of New York City (SONYC) project:

The paper represents the culmination of several years of research on urban sound, and builds on our earlier work in this area including:

A Dataset and Taxonomy for Urban Sound Research
J. Salamon, C. Jacoby and J. P. Bello, 2014
Unsupervised Feature Learning for Urban Sound Classification
J. Salamon and J. P. Bello, 2015
Feature Learning with Deep Scattering for Urban Sound Analysis
J. Salamon and J. P. Bello, 2015
The Implementation of Low-cost Urban Acoustic Monitoring Devices
C. Mydlarz, J. Salamon and J. P. Bello, 2016

I want to express my sincere gratitude to my colleague and mentor Prof. Juan P. Bello of NYU for the years of fruitful collaboration that led to and made this work possible.

You can read the paper here:

Deep Convolutional Neural Networks and Data Augmentation For Environmental Sound Classification
J. Salamon and J. P. Bello
IEEE Signal Processing Letters, 24(3), pages 279 - 283, 2017.
[IEEE][PDF][BibTeX][Copyright]

Awkward screenshot of me talking about the work from the SONYC video

Controllable Neural Prosody Synthesis

25/10/2020

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

Here's Max's presentation of the work at INTERSPEECH 2020:

To hear more examples please visit Max's paper companion website.

For further details please read our paper:

Controllable Neural Prosody Synthesis
M. Morrison, Z. Jin, J. Salamon, N.J. Bryan, G. Mysore
Proc. Interspeech. October 2020.
[INTERSPEECH][PDF][arXiv]

Metric Learning vs Classification for Disentangled Music Representation Learning

13/10/2020

Deep representation learning offers a powerful paradigm for mapping input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well across tasks. Along with generalization, the emerging concept of disentangled representations is also of great interest, where multiple semantic concepts (e.g., genre, mood, instrumentation) are learned jointly but remain separable in the learned representation space. In this paper we present a single representation learning framework that elucidates the relationship between metric learning, classification, and disentanglement in a holistic manner. For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction). We find that classification-based models are generally advantageous for training time, similarity retrieval, and autotagging, while deep metric learning exhibits better performance for triplet-prediction. Finally, we show that our proposed approach yields state-of-the-art results for music auto-tagging.

To learn more please read our paper:

Metric Learning vs Classification for Disentangled Music Representation Learning
J. Lee, N. J. Bryan, J. Salamon, Z. Jin, J. Nam
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation on our companion website:
https://jongpillee.github.io/metric-vs-classification/