Category: Publication

Controllable Neural Prosody Synthesis

25/10/2020

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

Here's Max's presentation of the work at INTERSPEECH 2020:

To hear more examples please visit Max's paper companion website.

For further details please read our paper:

Controllable Neural Prosody Synthesis
M. Morrison, Z. Jin, J. Salamon, N.J. Bryan, G. Mysore
Proc. Interspeech. October 2020.
[INTERSPEECH][PDF][arXiv]

0 Comments

Time Lattice: A Data Structure for the Interactive Visual Analysis of Large Time Series

19/7/2018

0 Comments

Advances in technology coupled with the availability of low-cost sensors have resulted in the continuous generation of large time series from several sources. In order to visually explore and compare these time series at different scales, analysts need to execute online analytical processing (OLAP) queries that include constraints and group-by's at multiple temporal hierarchies. Effective visual analysis requires these queries to be interactive. However, while existing OLAP cube-based structures can support interactive query rates, the exponential memory requirement to materialize the data cube is often unsuitable for large data sets. Moreover, none of the recent space-efficient cube data structures allow for updates. Thus, the cube must be re-computed whenever there is new data, making them impractical in a streaming scenario. We propose Time Lattice, a memoryâ€efficient data structure that makes use of the implicit temporal hierarchy to enable interactive OLAP queries over large time series. Time Lattice is a subset of a fully materialized cube and is designed to handle fast updates and streaming data. We perform an experimental evaluation which shows that the space efficiency of the data structure does not hamper its performance when compared to the state of the art. In collaboration with signal processing and acoustics research scientists, we use the Time Lattice data structure to design the Noise Profiler, a web-based visualization framework that supports the analysis of noise from cities. We demonstrate the utility of Noise Profiler through a set of case studies.

For example, we used the Noise Profiler to rapidly explore and visualize noise patterns in NYC during weekdays versus weekends across multiple locations, using time series data from SONYC noise sensors:

Noise patterns on weekdays vs. weekends from a variety of locations in NYC. Time series data from SONYC noise sensors explored and visualized using the Noise Profiler tool built with Time Lattice.

For further details see our paper:

Time Lattice: A Data Structure for the Interactive Visual Analysis of Large Time Series
F. Miranda, M. Lage, H. Doraiswamy, C. Mydlarz, J. Salamon, Y. Lockerman, J. Freire, C. Silva
Computer Graphics Forum (EuroVis '18), 37(3), 2018, 13-22
[Wiley][PDF][BibTeX]

0 Comments

Birdvox-full-night: A Dataset and Benchmark For Avian Flight Call Detection

17/4/2018

0 Comments

We've just released Birdvox-full-night, a new challenging dataset for machine learning on bioacoustic data! Details about the dataset and the models we benchmarked are provided in our ICASSP 2018 paper:

Birdvox-Full-Night: A Dataset and Benchmark for Avian Flight Call Detection
V. Lostanlen, J. Salamon, A. Farnsworth, S. Kelling, and J. P. Bello
In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Calgary, Canada, Apr. 2018.
[PDF][Copyright]

This article addresses the automatic detection of vocal, nocturnally migrating birds from a network of acoustic sensors. Thus far, owing to the lack of annotated continuous recordings, existing methods had been benchmarked in a binary classification setting (presence vs. absence). Instead, with the aim of comparing them in event detection, we release BirdVox-full-night, a dataset of 62 hours of audio comprising 35402 flight calls of nocturnally migrating birds, as recorded from 6 sensors. We find a large performance gap between energy based detection functions and data-driven machine listening. The best model is a deep convolutional neural network trained with data augmentation. We correlate recall with the density of flight calls over time and frequency and identify the main causes of false alarm.

You can download the dataset here: https://wp.nyu.edu/birdvox/birdvox-full-night/

You can also check out additional bioacoustic datasets for machine learning we have released as part of the BirdVox project here: https://wp.nyu.edu/birdvox/codedata/#datasets

Finally, if you're at ICASSP 2018 and want to learn more be sure to grab my esteemed colleague Vincent Lostanlen for a chat!

0 Comments

New Book Chapter: Sound Analysis in Smart Cities

3/10/2017

0 Comments

This chapter introduces the concept of smart cities and discusses the importance of sound as a source of information about urban life. It describes a wide range of applications for the computational analysis of urban sounds and focuses on two high-impact areas, audio surveillance, and noise pollution monitoring, which sit at the intersection of dense sensor networks and machine listening. For sensor networks we focus on the pros and cons of mobile versus static sensing strategies, and the description of a low-cost solution to acoustic sensing that supports distributed machine listening. For sound event detection and classification we focus on the challenges presented by this task, solutions including feature design and learning strategies, and how a combination of convolutional networks and data augmentation result in the current state of the art. We close with a discussion about the potential and challenges of mobile sensing, the limitations imposed by the data currently available for research, and a few areas for future exploration.

Sound analysis in smart cities
J. P. Bello, C. Mydlarz, and J. Salamon.
In T. Virtanen, M. D. Plumbley, and D. P. W. Ellis, editors, Computational Analysis of Sound Scenes and Events, pages 373–397. Springer International Publishing, 2018.
[Springer][PDF][BibTeX]

0 Comments

Best Student Paper Award at 2017 AES International Conference on Semantic Audio

23/6/2017

0 Comments

I'm excited to report that our paper "Pitch Contours as a Mid-Level Representation for Music Informatics", has won the Best Student Paper Award at the 2017 AES International Conference on Semantic Audio. The paper, led and presented by my colleague Rachel Bittner, proposes a factored architecture for a variety of pitch-informed MIR tasks such predominant and multiple f0 estimation, genre, gender and singing style classification; with pitch contours as a powerful and semantically rich mid-level representation.

So... should all machine learning for music be end-to-end? See what we found in the full paper:

Pitch Contours as a Mid-Level Representation for Music Informatics
R. M. Bittner, J. Salamon, J. J. Bosch, and J. P. Bello.
In AES Conference on Semantic Audio, Erlangen, Germany, Jun. 2017.
[PDF]

0 Comments

Deep Convolutional Neural Networks and Data Augmentation For Environmental Sound Classification

20/1/2017

0 Comments

The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model’s classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

For further details see our paper:

Deep Convolutional Neural Networks and Data Augmentation For Environmental Sound Classification
J. Salamon and J. P. Bello
IEEE Signal Processing Letters, In Press, 2017.
[IEEE][PDF][BibTeX][Copyright]

0 Comments

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification

15/12/2016

0 Comments

Automated classification of organisms to species based on their vocalizations would contribute tremendously to abilities to monitor biodiversity, with a wide range of applications in the field of ecology. In particular, automated classification of migrating birds’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we explore state-of-the-art classification techniques for large-vocabulary bird species classification from flight calls. In particular, we contrast a “shallow learning” approach based on unsupervised dictionary learning with a deep convolutional neural network combined with data augmentation. We show that the two models perform comparably on a dataset of 5428 flight calls spanning 43 different species, with both significantly outperforming an MFCC baseline. Finally, we show that by combining the models using a simple late-fusion approach we can further improve the results, obtaining a state-of-the-art classification accuracy of 0.96.

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification
J. Salamon, J. P. Bello, A. Farnsworth and S. Kelling
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.
[IEEE][PDF][BibTeX][Copyright]

0 Comments

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring

23/11/2016

0 Comments

A white-throated sparrow, one of the species targeted in the study. Image by Simon Pierre Barrette, license CC-BY-SA 3.0.

Automatic classification of animal vocalizations has great potential to enhance the monitoring of species movements and behaviors. This is particularly true for monitoring nocturnal bird migration, where automated classification of migrants’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we investigate the automatic classification of bird species from flight calls, and in particular the relationship between two different problem formulations commonly found in the literature: classifying a short clip containing one of a fixed set of known species (N-class problem) and the continuous monitoring problem, the latter of which is relevant to migration monitoring. We implemented a state-of-the-art audio classification model based on unsupervised feature learning and evaluated it on three novel datasets, one for studying the N-class problem including over 5000 flight calls from 43 different species, and two realistic datasets for studying the monitoring scenario comprising hundreds of thousands of audio clips that were compiled by means of remote acoustic sensors deployed in the field during two migration seasons. We show that the model achieves high accuracy when classifying a clip to one of N known species, even for a large number of species. In contrast, the model does not perform as well in the continuous monitoring case. Through a detailed error analysis (that included full expert review of false positives and negatives) we show the model is confounded by varying background noise conditions and previously unseen vocalizations. We also show that the model needs to be parameterized and benchmarked differently for the continuous monitoring scenario. Finally, we show that despite the reduced performance, given the right conditions the model can still characterize the migration pattern of a specific species. The paper concludes with directions for future research.

The full article is available freely (open access) on PLOS ONE:

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring
J. Salamon , J. P. Bello, A. Farnsworth, M. Robbins, S. Keen, H. Klinck and S. Kelling
PLOS ONE 11(11): e0166866, 2016. doi: 10.1371/journal.pone.0166866.
[PLOS ONE][PDF][BibTeX]

Along with this study, we have also published the three new datasets for bioacoustic machine learning that were compiled for this study.

0 Comments

The Implementation of Low-cost Urban Acoustic Monitoring Devices

16/6/2016

0 Comments

click on image to enlarge

The urban sound environment of New York City (NYC) can be, amongst other things: loud, intrusive, exciting and dynamic. As indicated by the large majority of noise complaints registered with the NYC 311 information/complaints line, the urban sound environment has a profound effect on the quality of life of the city’s inhabitants. To monitor and ultimately understand these sonic environments, a process of long-term acoustic measurement and analysis is required. The traditional method of environmental acoustic monitoring utilizes short term measurement periods using expensive equipment, setup and operated by experienced and costly personnel. In this paper a different approach is pro- posed to this application which implements a smart, low-cost, static, acoustic sensing device based around consumer hardware. These devices can be deployed in numerous and varied urban locations for long periods of time, allowing for the collection of longitudinal urban acoustic data. The varied environmental conditions of urban settings make for a challenge in gathering calibrated sound pressure level data for prospective stakeholders. This paper details the sensors’ design, development and potential future applications, with a focus on the calibration of the devices’ Microelectromechanical systems (MEMS) microphone in order to generate reliable decibel levels at the type/class 2 level.

For further details see our paper:

The Implementation of Low-cost Urban Acoustic Monitoring Devices
C. Mydlarz, J. Salamon and J. P. Bello
Applied Acoustics, special issue on Acoustics for Smart Cities, 2016.
[Elsevier][PDF]

This paper is part of the SONYC project.

0 Comments

A Comparison of Melody Extraction Methods Based on Source-Filter Modelling

26/5/2016

1 Comment

This work explores the use of source-filter models for pitch salience estimation and their combination with different pitch tracking and voicing estimation methods for automatic melody extraction. Source-filter models are used to create a mid-level representation of pitch that implicitly incorporates timbre information. The spectrogram of a musical audio signal is modelled as the sum of the lead- ing voice (produced by human voice or pitched musical instruments) and accompaniment. The leading voice is then modelled with a Smoothed Instantaneous Mixture Model (SIMM) based on a source-filter model. The main advantage of such a pitch salience function is that it enhances the leading voice even without explicitly separating it from the rest of the signal. We show that this is beneficial for melody extraction, increasing pitch estimation accuracy and reducing octave errors in comparison with simpler pitch salience functions. The adequate combination with voicing detection techniques based on pitch contour characterisation leads to significant improvements over state- of-the-art methods, for both vocal and instrumental music.

Click on image to enlarge

For further details see our paper:

A Comparison of Melody Extraction Methods Based on Source-Filter Modelling
J. J. Bosch, R. M. Bittner, J. Salamon, and E. Gómez
Proc. 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York City, USA, Aug. 2016.

1 Comment

<<Previous

Controllable Neural Prosody Synthesis

Time Lattice: A Data Structure for the Interactive Visual Analysis of Large Time Series

Birdvox-full-night: A Dataset and Benchmark For Avian Flight Call Detection

New Book Chapter: Sound Analysis in Smart Cities

Best Student Paper Award at 2017 AES International Conference on Semantic Audio

Deep Convolutional Neural Networks and Data Augmentation For Environmental Sound Classification

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification

Towards the Automatic Classification of Avian Flight Calls for Bioacoustic Monitoring

The Implementation of Low-cost Urban Acoustic Monitoring Devices

A Comparison of Melody Extraction Methods Based on Source-Filter Modelling

NEWS

Archives

Categories