News - Justin Salamon

Elected to the IEEE Audio and Acoustic Signal Processing Technical Committee

14/1/2020

I'm happy to report I've been elected to the IEEE Audio and Acoustic Signal Processing Technical Committee.

The AASP TC's mission is to support, nourish and lead scientific and technological development in all areas of audio and acoustic signal processing. These areas are currently seeing increased levels of interest and significant growth providing a fertile ground for a broad range of specific and interdisciplinary research and development. Ranging from array processing for microphones and loudspeakers to music genre classification, from psychoacoustics to machine learning, from consumer electronics devices to blue-sky research, this remit encompasses countless technical challenges and many hot topics. The TC numbers some 30 appointed volunteer members drawn roughly equally from leading academic and industrial organizations around the world, unified by the common aim to offer their expertise in the service of the scientific community.

Looking forward to doing my bit for this excellent scientific community!

SONYC-UST: A MULTILABEL DATASET FROM AN URBAN ACOUSTIC SENSOR NETWORK

5/1/2020

SONYC Urban Sound Tagging (SONYC-UST) is a dataset for the development and evaluation of machine listening systems for realworld urban noise monitoring. It consists of 3068 audio recordings from the "Sounds of New York City" (SONYC) acoustic sensor network:

Via the Zooniverse citizen science platform, volunteers tagged the presence of 23 fine-grained classes that were chosen in consultation with the New York City Department of Environmental Protection. These 23 fine-grained classes can be grouped into eight coarse-grained classes:

The SONC-UST taxonomy (click to enlarge)

For more details please see:

SONYC Urban Sound Tagging (SONYC-UST): A Multilabel Dataset from an Urban Acoustic Sensor Network
M. Cartwright, A. E. Mendez Mendez, J. Cramer, V. Lostanlen, G. Dove, H.-H. Wu, J. Salamon, O. Nov, and J.P. Bello
Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), pages 35-39, New York University, NY, USA, Oct. 2019.

Download SONYC-UST: https://doi.org/10.5281/zenodo.3338310

TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision

4/11/2019

Self-supervised representation learning with deep neural networks is a powerful tool for machine learning tasks with limited labeled data but extensive unlabeled data. To learn representations, self- supervised models are typically trained on a pretext task to predict structure in the data (e.g. audio-visual correspondence, short-term temporal sequence, word sequence) that is indicative of higher-level concepts relevant to a target, downstream task. Sensor networks are promising yet unexplored sources of data for self-supervised learning - they collect large amounts of unlabeled yet timestamped data over extended periods of time and typically exhibit long-term temporal structure (e.g., over hours, months, years) not observable at the short time scales previously explored in self-supervised learning (e.g., seconds). This structure can be present even in single-modal data and therefore could be exploited for self-supervision in many types of sensor networks. In this work, we present a model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network. We then demonstrate the utility of the learned audio representation in an urban sound event detection task with limited labeled data.

Read the full paper here:

TriCycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
M. Cartwright, J. Cramer, J. Salamon, and J.P. Bello
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2019.
[IEEE][PDF][BibTeX][Copyright]

Robust Sound Event Detection in Bioacoustic Sensor Networks

24/10/2019

The innovation: we present a context-adaptive deep network that uses an auxiliary sub-network to model the background environment to improve inference-time robustness to changing environments.

The results: The proposed model produces state-of-the results on flight call detection that are robust to environmental changes across time and space.

The surprise: Interestingly, we find that while context adaptation alone doesn't help significantly, and applying PCEN pre-processing doesn't help much either, applying both combined leads to dramatic gains.

The tech: We release BirdVoxDetect, an open-source tool for automatically detecting avian flight calls in continuous audio recordings: https://github.com/BirdVox/birdvoxdetect

Installing BirdVoxDetect (assuming Python is installed) is as easy as calling : pip install birdvoxdetect

Full paper:
Robust Sound Event Detection in Bioacoustic Sensor Networks
V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J.P. Bello
PLoS ONE 14(10): e0214168, 2019. DOI: https://doi.org/10.1371/journal.pone.0214168
[PLoS ONE][PDF][BibTeX]

Model block diagram:

Abstract:
Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 ms) and long-term (30 min) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer, i.e. an affine layer whose weights are dynamically adapted at prediction time by an auxiliary network taking long-term summary statistics of spectrotemporal features as input. We show that PCEN reduces temporal overfitting across dawn vs. dusk audio clips whereas context adaptation on PCEN-based summary statistics reduces spatial overfitting across sensor locations. Moreover, combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.

Build a Web-Based Ukulele Tuner using CREPE

7/10/2019

The YouTube coding channel The Coding Train just published its latest Coding Challenge: "Ukulele Tuner with Machine Learning Pitch Detection Model", where they use our CREPE model to build a web-based ukulele tuner! Awesome!

They use CREPE via the ml5.js library which offers a variety of ML algorithms for the web. It was a delightful surprise to discover that CREPE is being used to power the pitchDetection function.

You can watch the full video here:

What's Broken in Music Informatics Research? Three Uncomfortable Statements

20/6/2019

Companion website for my invited talk at the Machine Learning for Music Discovery workshop at ICML 2019: "What's Broken in Music Informatics Research? Three Uncomfortable Statements".

VIDEO:

PAPER:

What's Broken in Music Informatics Research? Three Uncomfortable Statements
Justin Salamon
Invited talk, Machine Learning for Music Discovery workshop, International Conference on Machine Learning (ICML), Long Beach, California, USA, June 2019.
[PDF][Video]

SUPPLEMENTARY MATERIAL:

Figure 1 (see paper for details):

Figure 1: (a) ground truth melody f0 (pitch) sequence, (b) first algorithm's melody estimate, (c) second algorithm's melody estimate.

Bioacoustics Datasets: A New Website Listing Bioacoustics Datasets and Repositories

25/5/2019

I couldn't find a centralized list of #bioacoustics datasets (for #machinelearning or otherwise), so I created this page. Feedback welcome! #opendata https://t.co/42YHDBiBtM
â€” Justin Salamon (@justin_salamon) May 24, 2019

To explore Bioacoustic Datasets, a centralized list of bioacoustics datasets and repositories, visit: https://bioacousticsdatasets.weebly.com

OpenL3: A Competitive and Open Deep Audio Embedding

7/5/2019

We're excited to announce the release of OpenL3, an open-source deep audio embedding based on the self-supervised L3-Net. OpenL3 is an improved version of L3-Net, and outperforms VGGish and SoundNet (and the original L3-Net) on several sound recognition tasks. Most importantly, OpenL3 is open source and readily available for everyone to use: if you have TensorFlow installed just run pip install openl3 and you're good to go!

Full details are provided in our paper:

Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
[IEEE][PDF][BibTeX][Copyright]

How well does it work?

Here's a comparison of classification results on three environmental sound datasets using embeddings from OpenL3 (blue), SoundNet (orange) and VGGish (green) as input to a simple 2-layer MLP:

Using OpenL3 we are able to match the current state of the art on UrbanSound8K, the most challenging of the three datasets, using a simple MLP without any of the tricks usually necessary for relatively small datasets (such as data augmentation).

Using OpenL3

Installing OpenL3, a Python module, is as easy as calling (assuming TensorFlow is already installed):

$ pip install openl3

Once installed, using OpenL3 in python can be done like this (simplest use case without setting custom parameter values):

import openl3
import soundfile as sf

audio, sr = sf.read('/path/to/file.wav')
embedding, timestamps = openl3.get_embedding(audio, sr)

We also provide a command-line interface (CLI) that can be launched by calling "openl3" from the command line:

$ openl3 /path/to/file.wav

The API (both python and CLI) includes more options such as changing the hop size used to extract the embedding, the output dimensionality of the embedding and several other parameters. A good place to start is the OpenL3 tutorial.

How was OpenL3 trained?

OpenL3 is an improved version of L3-Net by Arandjelovic and Zisserman, which is trained on a subset of AudioSet using self-supervision by exploiting the correspondence between sound and visual objects in video data:

The embedding is obtained by taking the output of the final convolutional layer of the audio subnetwork. For more details please see our paper.

We look forward to seeing what the community does with OpenL3!

...and, if you're attending ICASSP 2019, be sure to stop by our poster on Friday, May 17 between 13:30-15:30 (session MLSP-P17: Deep Learning V, Poster Area G, paper 2149)!

HistoryTracker: Minimizing Human Interactions in Baseball Game Annotation

4/5/2019

The sport data tracking systems available today are based on specialized hardware (high-definition cameras, speed radars, RFID) to detect and track targets on the field. While effective, implementing and maintaining these systems pose a number of challenges, including high cost and need for close human monitoring. On the other hand, the sports analytics community has been exploring human computation and crowdsourcing in order to produce tracking data that is trustworthy, cheaper and more accessible. However, state-of-the-art methods require a large number of users to perform the annotation, or put too much burden into a single user. We propose HistoryTracker, a methodology that facilitates the creation of tracking data for baseball games by warm-starting the annotation process using a vast collection of historical data. We show that HistoryTracker helps users to produce tracking data in a fast and reliable way.

HistoryTracker: Minimizing Human Interactions in Baseball Game Annotation
J. P. Ono, A. Gjoka, J. Salamon, C. A. Dietrich, and C. T. Silva
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI'19), Glasgow, UK, May 2019.
[ACM][PDF][BibTeX]

The paper received a CHI 2019 Honorable Mention Award:

30-second teaser video:

Jorge's full presentation at CHI 2019:

DCASE 2019 Workshop in NYC: Call for Papers

22/3/2019

The 4th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2019, will be held in New York City on October 25-26 2019:
http://dcase.community/workshop2019/

The workshop immediately follows WASPAA 2019 and SANE 2019, also hosted in New York, offering a full week of exciting audio related research!

As in previous years the workshop is organized in conjunction with the DCASE challenge. We aim to bring together researchers from many different universities and companies with an interest in the topic, and provide the opportunity for scientific exchange of ideas and opinions.

The technical program will include invited speakers on the topic of computational everyday sound analysis and recognition, and oral and poster presentations of accepted papers. In addition, a special poster session will be dedicated to the DCASE 2019 challenge entries and results.

We invite submissions on the topics of computational analysis of acoustic scenes and sound events, including but not limited to:

Tasks in computational environmental audio analysis

Acoustic scene classification
Sound event detection and localization
Audio tagging
Challenges in real-life applications (e.g., rare events, overlapping sound events, weak labels)

Methods for computational environmental audio analysis

Signal processing methods
Machine learning methods
Auditory-motivated methods
Cross-disciplinary methods involving, e.g., acoustics, biology, psychology, geography, materials science, transports science

Resources, applications, and evaluation of computational environmental audio analysis

Publicly available datasets or software, taxonomies and ontologies, evaluation procedures
Ethics, privacy, responsible research
Applications
Description of systems submitted to the DCASE 2019 Challenge, expanded from the challenge technical report submissions to include evaluation results and comparison.

Reproducible research with open-source code and open data is encouraged (but not mandatory).

Important notice for challenge participants
Note that while each DCASE challenge submission must be accompanied by a technical report describing the system, in order to be considered for publication at the peer reviewed workshop, such reports must be augmented with final results from the challenge and a careful analysis of those results in the context of the other submissions in a way that provides meaningful, useable insight.

IMPORTANT DATES

05 Jul 2019: Workshop paper abstract submission
12 Jul 2019: Workshop paper submission
23 Aug 2019: Notification of paper acceptance
25 Oct 2019 - 26 Oct 2019: Workshop

We look forward to receiving your submissions!

GENERAL CHAIRS
Juan P. Bello, New York University
Mark Cartwright, New York University

PROGRAM CHAIRS
Daniel P. W. Ellis, Google, Inc.
Michael Mandel, Brooklyn College (CUNY)
Justin Salamon, Adobe Research

LOCAL ORGANIZATION
Vincent Lostanlen, Cornell University