To explore Bioacoustic Datasets, a centralized list of bioacoustics datasets and repositories, visit: https://bioacousticsdatasets.weebly.com
We're excited to announce the release of OpenL3, an open-source deep audio embedding based on the self-supervised L3-Net. OpenL3 is an improved version of L3-Net, and outperforms VGGish and SoundNet (and the original L3-Net) on several sound recognition tasks. Most importantly, OpenL3 is open source and readily available for everyone to use: if you have TensorFlow installed just run pip install openl3 and you're good to go!
Full details are provided in our paper:
Look, Listen and Learn More: Design Choices for Deep Audio Embeddings
J. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello.
IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp 3852-3856, Brighton, UK, May 2019.
How well does it work?
Here's a comparison of classification results on three environmental sound datasets using embeddings from OpenL3 (blue), SoundNet (orange) and VGGish (green) as input to a simple 2-layer MLP:
Using OpenL3 we are able to match the current state of the art on UrbanSound8K, the most challenging of the three datasets, using a simple MLP without any of the tricks usually necessary for relatively small datasets (such as data augmentation).
Installing OpenL3, a Python module, is as easy as calling (assuming TensorFlow is already installed):
$ pip install openl3
Once installed, using OpenL3 in python can be done like this (simplest use case without setting custom parameter values):
import openl3 import soundfile as sf audio, sr = sf.read('/path/to/file.wav') embedding, timestamps = openl3.get_embedding(audio, sr)
We also provide a command-line interface (CLI) that can be launched by calling "openl3" from the command line:
$ openl3 /path/to/file.wav
The API (both python and CLI) includes more options such as changing the hop size used to extract the embedding, the output dimensionality of the embedding and several other parameters. A good place to start is the OpenL3 tutorial.
How was OpenL3 trained?
OpenL3 is an improved version of L3-Net by Arandjelovic and Zisserman, which is trained on a subset of AudioSet using self-supervision by exploiting the correspondence between sound and visual objects in video data:
The embedding is obtained by taking the output of the final convolutional layer of the audio subnetwork. For more details please see our paper.
We look forward to seeing what the community does with OpenL3!
...and, if you're attending ICASSP 2019, be sure to stop by our poster on Friday, May 17 between 13:30-15:30 (session MLSP-P17: Deep Learning V, Poster Area G, paper 2149)!
The 4th Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2019, will be held in New York City on October 25-26 2019:
The workshop immediately follows WASPAA 2019 and SANE 2019, also hosted in New York, offering a full week of exciting audio related research!
As in previous years the workshop is organized in conjunction with the DCASE challenge. We aim to bring together researchers from many different universities and companies with an interest in the topic, and provide the opportunity for scientific exchange of ideas and opinions.
The technical program will include invited speakers on the topic of computational everyday sound analysis and recognition, and oral and poster presentations of accepted papers. In addition, a special poster session will be dedicated to the DCASE 2019 challenge entries and results.
We invite submissions on the topics of computational analysis of acoustic scenes and sound events, including but not limited to:
Tasks in computational environmental audio analysis
Methods for computational environmental audio analysis
Resources, applications, and evaluation of computational environmental audio analysis
Reproducible research with open-source code and open data is encouraged (but not mandatory).
Important notice for challenge participants
Note that while each DCASE challenge submission must be accompanied by a technical report describing the system, in order to be considered for publication at the peer reviewed workshop, such reports must be augmented with final results from the challenge and a careful analysis of those results in the context of the other submissions in a way that provides meaningful, useable insight.
We look forward to receiving your submissions!
Juan P. Bello, New York University
Mark Cartwright, New York University
Daniel P. W. Ellis, Google, Inc.
Michael Mandel, Brooklyn College (CUNY)
Justin Salamon, Adobe Research
Vincent Lostanlen, Cornell University
After a marvelous five years at NYU I've decided it's time for me to move on, and I'm pretty excited to announce that I've accepted a research scientist position with Adobe Research in San Francisco!
I'd like to start this post by thanking my colleagues at the Music and Audio Research Lab (MARL) and the Center for Urban Science and Progress (CUSP) for an inspiring five years of research collaboration and friendship. It has been a privilege to work with this fantastic group of people, from whom I have learned a great deal. In particular I'd like to express my gratitude to Juan Pablo Bello who gave me the opportunity to join MARL, provided invaluable guidance and mentorship, and has always been highly supportive of me charting my own research path.
At Adobe I'm joining the newly formed Audio Research Group spearheaded by Gautham Mysore, along with Nick Bryan, Zeyu Jin and Juan-Pablo Caceres. I'm excited to be part of this new initiative, collaborate with the broader Adobe Research team (including the image folks!) and have the opportunity to conduct novel research in audio and video that will (hopefully) help shape the future of creative audio-visual applications and experiences.
I will also, importantly, maintain my ties and involvement with the academic research community, be it publishing in conferences and journals, reviewing, organizing events (such as the upcoming DCASE 2019 workshop and challenge), and collaborating with researchers in academia.
Speaking of which, Adobe Research has a (summer) internship program! If you're a PhD student working at the cutting edge of audio (and audio-visual) research and would be interested in spending a few months working with Adobe researchers in San Francisco feel free to get in touch! It's a great opportunity to open up new research avenues or expand existing ones, strengthen your research network and enjoy a few months in a fantastic city. And, in case you were wondering, you can (and are expected to) publish the results of your research internship! The 2019 summer internship program is almost finalized, but keep an eye out for 2020, the call for applications usually goes out in the fall of the previous year (i.e. fall 2019).
Finally, it is not lost on me that I am also another (white) man moving to the bay area to work in the tech industry. At the same time as being excited about this new adventure, I am also aware of the challenges the tech boom has caused in the San Francisco bay area, including gentrification, the housing shortage and related homelessness crisis. It is my every hope that I will also find a way to be part of the solution to these issues - recommendations for volunteer work, mentorship (e.g. I am a mentor in the WiMIR mentoring program), and programs that seek to empower underserved communities in the bay area are most welcome.
Oh, and if you find yourself in the bay area be sure to come say hello!
Open-Source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research
Writing quality open-source scientific software is crucial for comparable, reproducible research. Many of us learned this the hard way over many years of work (and lines of code). So we decided to write the article we wish we could have read at the beginning of our research careers. We hope you find it useful!
Open-source Practices for Music Signal Processing Research: Recommendations for Transparent, Sustainable, and Reproducible Audio Research
B. McFee, J. W. Kim, M. Cartwright, J. Salamon, R. M. Bittner, and J. P. Bello.
IEEE Signal Processing Magazine, 36(1):128-137, Jan. 2019.
But... is this really such a big deal? Do you really need an open source implementation of your research? Can't other researchers just work off the description in your paper? In the plot below (reproduced from the article) we show the performance of 8 different onset detection systems that were all implemented based on the same description (in the caption) with "minor" changes to the implementation:
Performance of 8 systems implemented based on the following description with "minor" implementation differences: "Audio was converted to 22050 Hz (mono), and a 2048-point short-time Fourier transform (STFT) was computed with a 64-sample hop. The STFT was reduced to 128 Mel-frequency bands, and magnitudes were compressed by log scaling. An onset envelope was computed using thresholded spectral differencing, and peaks were selected using the method of Bock et al. ."
In the full article we share experiences and advice gained from developing open source software for MIR research, with the hope that practitioners in other related disciplines may benefit from our findings and become effective developers of open source scientific software.
Many of the issues we encounter in MIR applications are likely to recur in more general signal processing areas, as data sets increase in complexity, evaluation becomes more integrated and realistic, and traditionally small research components become integrated with larger systems.
We're pleased to announce the release of five new(ish) datasets for research on monophonic, melody, bass and multi-f0 pitch tracking:
The datasets are released as part of our ISMIR 2017 publication:
An analysis/synthesis framework for automatic f0 annotation of multitrack datasets
J. Salamon, R. M. Bittner, J. Bonada, J. J. Bosch, E. GÃ³mez, and J. P. Bello.
In 18th Int. Soc. for Music Info. Retrieval Conf., Suzhou, China, Oct. 2017.
And consist of:
65 songs from the MedleyDB dataset in which the melody track has been resynthesized to obtain a perfect melody f0 annotation using the analysis/synthesis method described in the paper.
71 songs from the MedleyDB dataset in which the bass track has been resynthesized to obtain a perfect bass f0 annotation using the analysis/synthesis method described in the paper.
85 songs from the MedleyDB dataset in which polyphonic pitched instruments (such as piano and guitar) have been removed and all monophonic pitched instruments (such as bass and voice) have been resynthesized to obtain perfect f0 annotations using the analysis/synthesis method described in the paper.
230 solo stems (tracks) from the MedleyDB dataset spanning a variety of musical instruments and voices, which have been resynthesized to obtain a perfect f0 annotation using the analysis/synthesis method described in the paper.
10 classical music pieces (four-part J.S. Bach chorales) from the Bach10 dataset where each instrument (bassoon, clarinet, saxophone and violin) has been resynthesized to obtain perfect f0 annotations using the analysis/synthesis method described in the paper.
To hear some samples from these datasets please see the examples page.
We hope you find these datasets useful! For example, at MARL we used MDB-stem-synth to train and evaluate our convnet-based monophonic pitch tracker CREPE.
We're also working on an open-source library implementing the analysis/synthesis method described in the paper so that people can use it to create new f0 annotations for multitrack datasets. It's still in the works, we'll send a follow-up message once it's released.
Thanks and sorry it's taken us this long to release the data!
Justin, Rachel, Jordi, Juanjo, Emilia and Juan.
In the context of automatic speech recognition and acoustic event detection, an adaptive procedure named per-channel energy normalization (PCEN) has recently shown to outperform the pointwise logarithm of mel-frequency spectrogram (logmelspec) as an acoustic frontend. This article investigates the adequacy of PCEN for spectrogram-based pattern recognition in far-field noisy recordings, both from theoretical and practical standpoints. First, we apply PCEN on various datasets of natural acoustic environments and find empirically that it Gaussianizes distributions of magnitudes while decorrelating frequency bands. Secondly, we describe the asymptotic regimes of each component in PCEN: temporal integration, gain control, and dynamic range compression. Thirdly, we give practical advice for adapting PCEN parameters to the temporal properties of the noise to be mitigated, the signal to be enhanced, and the choice of time-frequency representation. As it converts a large class of real-world soundscapes into additive white Gaussian noise (AWGN), PCEN is a computationally efficient frontend for robust detection and classification of acoustic events in heterogeneous environments.
Read the full paper here:
Per-Channel Energy Normalization: Why and how
V. Lostanlen, J. Salamon, M. Cartwright, B. McFee, A. Farnsworth, S. Kelling, and J. P. Bello.
IEEE Signal Processing Letters, 26(1): 39â43, Jan. 2019.
Here's a plot from our paper comparing the application of log vs PCEN on a mel-spectrogram computed from an audio recording captured by a remote acoustic sensor for avian flight call detection (as part of our BirdVox project). In the top plot (log) we clearly see energy from undesired noise sources such as insects and a car, whereas in the bottom plot (PCEN) we see these confounding factors have been attenuated, while the flight calls we wish to detect (which appear as very short chirps) are kept.
Fig. 1. A soundscape comprising bird calls, insect stridulations, and a passing vehicle. The logarithmic transformation of the mel-frequency spectrogram (a) maps all magnitudes to a decibel-like scale, whereas per-channel energy normalization (b) enhances transient events (bird calls) while discarding stationary noise (insects) as well as slow changes in loudness (vehicle). Data provided by BirdVox. Mel-frequency spectrogram and PCEN computed with default librosa 0.6.1 parameters and T = 60 ms (see Section IV).
On Thursday October 18th I'll be giving a talk at the Speech and Audio in the Northeast (SANE) 2018 workshop hosted by Google Boston, covering some of our most recent work on sound recognition with acoustic sensor networks as part of the SONYC and BirdVox projects. Hope you can join us!
The combination of remote acoustic sensors with automatic sound recognition represents a powerful emerging technology for studying both natural and urban environments. At NYU we've been working on two projects whose aim is to develop and leverage this technology: the Sounds of New York City (SONYC) project is using acoustic sensors to understand noise patterns across NYC to improve noise mitigation efforts, and the BirdVox project is using them for the purpose of tracking bird migration patterns in collaboration with the Cornell Lab of Ornithology. Acoustic sensors present both unique opportunities and unique challenges when it comes to developing machine listening algorithms for automatic sound event detection: they facilitate the collection of large quantities of audio data, but the data is unlabeled, constraining our ability to leverage supervised machine learning algorithms. Training generalizable models becomes particularly challenging when training data come from a limited set of sensor locations (and times), and yet our models must generalize to unseen natural and urban environments with unknown and sometimes surprising confounding factors. In this talk I will present our work towards tackling these challenges along several different lines with neural network architectures, including novel pooling layers that allow us to better leverage weakly labeled training data, self-supervised audio embeddings that allow us to train high-accuracy models with a limited amount of labeled data, and context-adaptive networks that improve the robustness of our models to heterogenous acoustic environments.
âUPDATE: thanks everyone for attending the talk! Here are a video recording of the talk as well as the slides:
Robust Sound Event Detection in Acoustic Sensor Networks
Speech and Audio in the Northeast (SANE), Google, Cambridge, MA, Oct. 2018
We present AutoPool, an adaptive (trainable) pooling operator which smoothly interpolates between common pooling operators, such as min-, max-, or average-pooling, automatically adapting to the characteristics of the data.
AutoPool can be readily applied to any differentiable model for time-series label prediction. AutoPool is presented in the following paper, where it is evaluated in conjunction with convolutional neural networks for Sound Event Detection (SED) using a multiple instance learning (MIL) paradigm:
Adaptive pooling operators for weakly labeled sound event detection
B. Mcfee, J. Salamon, and J. P. Bello
IEEE/ACM Transactions on Audio, Speech and Language Processing, 26(11): 2180â2193, Nov. 2018.
For ease of use, we've implemented AutoPool as a Keras layer and made it is available on pypi:
>pip install autopool
Using AutoPool is as straightforward as using any standard Keras pooling layer, for example:
from autopool import AutoPool bag_pred = AutoPool(axis=1)(instance_pred)
Further details and examples are provided in the documentation.
AutoPool extends softmax-weighted pooling by adding a trainable parameter a (alpha) to be learned jointly with all other trainable model parameters:
Here, p(Y|x) denotes the prediction for an instance x, and X denotes a set (bag) of instances. The aggregated prediction P(Y|X) is a weighted average of the instance-level predictions. Note that when a = 0 this reduces to an unweighted mean; when a = 1 this simplifies to soft-max pooling; and when a-->infinity this approaches the max operator. Hence the name: AutoPool.
Constrained and Regularized AutoPool
In the paper we show there can be benefits to either constraining the range a can take, or, alternatively, applying l2 regularization on a; this results in constrained AutoPool (CAP) and regularized AutoPool (RAP) respectively. Since AutoPool is implemented as a keras layer, CAP and RAP can be can be achieved through the layer's optional arguments:
CAP with non-negative a:
bag_pred = AutoPool(axis=1, kernel_constraint=keras.constraints.non_neg())(instance_pred)
CAP with a norm-constrained to some value alpha_max:
bag_pred = AutoPool(axis=1, kernel_constraint=keras.constraints.max_norm(alpha_max, axis=0))(instance_pred)
Heuristics for determining sensible values of alpha_max are given in the paper (section III.E).
RAP with l2 regularized a:
bag_pred = AutoPool(axis=1, kernel_regularizer=keras.regularizers.l2(l=1e-4))(instance_pred)
CAP and RAP can be combined, of course, by applying both a kernel constraint and a kernel regularizer.
AutoPool directly generalizes to multi-label settings, in which multiple class labels may be applied to each instance x (for example "car" and "siren" may both be present in an instance). In this setting, a separate auto-pooling operator is applied to each class. Rather than a single parameter a, there is a vector of parameters a_c where c indexes the output vocabulary. This allows a jointly trained model to adapt the pooling strategies independently for each category. Please see the paper for further details.
Here's an example output of a deep CNN trained with a regularized AutoPool layer: the model was trained on weakly labeled data (time series with sound event tags, but without any timestamps) and produces strong predictions (i.e. sound event labels with start/end times):
For more experimental results and a detailed evaluation of AutoPool compared to other pooling functions across several datasets, please see our paper.
I'm happy to announce the release of scaper v0.2.0.
Scaper is a python library for soundscape synthesis and augmentation. It can be used to quickly and easily generate audio data for machine learning experiments with audio, for example for sound event detection in environmental and bioacoustics audio or (with some work) audio source separation as well!
The latest release includes improvements to how scaper computes the loudness of foreground and background events, resulting in more reliable SNR in the output audio.
Installing the latest release is as easy as calling: >pip install scaper
If you haven't used scaper before, you might want to check out the scaper documentation and in particular the scaper tutorial.
And, if you really want to dig into the why/what/how behind scaper, please see the scaper-paper:
Scaper: A Library for Soundscape Synthesis and Augmentation
J. Salamon, D. MacConnell, M. Cartwright, P. Li, and J. P. Bello.
In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017.
Machine listening research, code, data & hacks!