News - Justin Salamon

Invited talk at CVPR 2021 Sight and Sound workshop: On the road to artistic AudioVisual representation learning

24/6/2021

This past Sunday I had the pleasure of giving an invited talk at the CVPR 2021 Sight and Sound workshop.

In my talk, titled "On the road to artistic AudioVisual representation learning", I gave a selective overview of my research journey through audio and video representation learning (supervised and self-supervised), which has led me to some of the questions I'm interested in now on the topic of artistic AudioVisual representation learning.

I tried to keep the talk light, fun, and full of demos :)

The complete talk is now available here:

Deep Salience Representations for F0 Estimation in Polyphonic Music

15/7/2017

Estimating fundamental frequencies in polyphonic music remains a notoriously difficult task in Music Information Retrieval. While other tasks, such as beat tracking and chord recognition have seen improvement with the application of deep learning models, little work has been done to apply deep learning methods to fundamental frequency related tasks including multi-f0 and melody tracking, primarily due to the scarce availability of labeled data. In this work, we describe a fully convolutional neural network for learning salience representations for estimating fundamental frequencies, trained using a large, semi-automatically generated f0 dataset. We demonstrate the effectiveness of our model for learning salience representations for both multi-f0 and melody tracking in polyphonic audio, and show that our models achieve state-of-the-art performance on several multi-f0 and melody datasets. We conclude with directions for future research.

Deep Salience Representations for f0 Estimation in Polyphonic Music
Rachel. M. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello.
In 18th International Society for Music Information Retrieval Conference, Suzhou, China, Oct. 2017.

Best Student Paper Award at 2017 AES International Conference on Semantic Audio

23/6/2017

I'm excited to report that our paper "Pitch Contours as a Mid-Level Representation for Music Informatics", has won the Best Student Paper Award at the 2017 AES International Conference on Semantic Audio. The paper, led and presented by my colleague Rachel Bittner, proposes a factored architecture for a variety of pitch-informed MIR tasks such predominant and multiple f0 estimation, genre, gender and singing style classification; with pitch contours as a powerful and semantically rich mid-level representation.

So... should all machine learning for music be end-to-end? See what we found in the full paper:

Pitch Contours as a Mid-Level Representation for Music Informatics
R. M. Bittner, J. Salamon, J. J. Bosch, and J. P. Bello.
In AES Conference on Semantic Audio, Erlangen, Germany, Jun. 2017.
[PDF]

Pitch Contours as a Mid-Level Representation for Music Informatics

14/4/2017

Content-based Music Informatics includes tasks that involve estimating the pitched content of music, such as the main melody or the bass line. To date, the field lacks a good machine representation that models the human perception of pitch, with each task using specific, tailored representations. This paper proposes factoring pitch estimation problems into two stages, where the output of the first stage for all tasks is a multipitch contour representation. Further, we propose the adoption of pitch contours as a unit of pitch organization. We give a review of the existing work on contour extraction and characterization and present experiments that demonstrate the discriminability of pitch contours.

Agree? Disagree? Get the full details here:

Pitch Contours as a Mid-Level Representation for Music Informatics
R. M. Bittner, J. Salamon, J. J. Bosch, and J. P. Bello.
In AES Conference on Semantic Audio, Erlangen, Germany, Jun. 2017.
[PDF]

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification

15/12/2016

Automated classification of organisms to species based on their vocalizations would contribute tremendously to abilities to monitor biodiversity, with a wide range of applications in the field of ecology. In particular, automated classification of migrating birds’ flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we explore state-of-the-art classification techniques for large-vocabulary bird species classification from flight calls. In particular, we contrast a “shallow learning” approach based on unsupervised dictionary learning with a deep convolutional neural network combined with data augmentation. We show that the two models perform comparably on a dataset of 5428 flight calls spanning 43 different species, with both significantly outperforming an MFCC baseline. Finally, we show that by combining the models using a simple late-fusion approach we can further improve the results, obtaining a state-of-the-art classification accuracy of 0.96.

Fusing Shallow and Deep Learning for Bioacoustic Bird Species Classification
J. Salamon, J. P. Bello, A. Farnsworth and S. Kelling
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, March 2017.
[IEEE][PDF][BibTeX][Copyright]

Pitch Analysis for Active Music Discovery @ ICML 2016

23/6/2016

Today I'll be giving an invited talk at the Machine Learning for Music Discovery Workshop as part of the ICML 2016 conference.

The talk is about Pitch Analysis for Active Music Discovery:

A significant proportion of commercial music is comprised of pitched content: a melody, a bass line, a famous guitar solo, etc. Consequently, algorithms that are capable of extracting and understanding this type of pitched content open up numerous opportunities for active music discovery, ranging from query-by-humming to musical-feature-based exploration of Indian art music or recommendation based on singing style. In this talk I will describe some of my work on algorithms for pitch content analysis of music audio signals and their application to music discovery, the role of machine learning in these algorithms, and the challenge posed by the scarcity of labeled data and how we may address it.

And here's the extended abstract:

Pitch Analysis for Active Music Discovery
J. Salamon
Machine Learning for Music Discovery workshop, International Conference on Machine Learning (ICML), invited talk, New York City, NY, USA, June 2016.
[PDF]

The workshop has a great program lined up, if your'e attending ICML 2016 be sure to drop by!

A Comparison of Melody Extraction Methods Based on Source-Filter Modelling

26/5/2016

This work explores the use of source-filter models for pitch salience estimation and their combination with different pitch tracking and voicing estimation methods for automatic melody extraction. Source-filter models are used to create a mid-level representation of pitch that implicitly incorporates timbre information. The spectrogram of a musical audio signal is modelled as the sum of the lead- ing voice (produced by human voice or pitched musical instruments) and accompaniment. The leading voice is then modelled with a Smoothed Instantaneous Mixture Model (SIMM) based on a source-filter model. The main advantage of such a pitch salience function is that it enhances the leading voice even without explicitly separating it from the rest of the signal. We show that this is beneficial for melody extraction, increasing pitch estimation accuracy and reducing octave errors in comparison with simpler pitch salience functions. The adequate combination with voicing detection techniques based on pitch contour characterisation leads to significant improvements over state- of-the-art methods, for both vocal and instrumental music.

Click on image to enlarge

For further details see our paper:

A Comparison of Melody Extraction Methods Based on Source-Filter Modelling
J. J. Bosch, R. M. Bittner, J. Salamon, and E. Gómez
Proc. 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York City, USA, Aug. 2016.

Ensemble: A Hybrid Human-Machine System for Generating Melody Scores From Audio

20/5/2016

Music transcription is a highly complex task that is difficult for automated algorithms, and equally challenging to people, even those with many years of musical training. Furthermore, there is a shortage of high-quality datasets for training automated transcription algorithms. In this research, we explore a semi-automated, crowdsourced approach to generate music transcriptions, by first running an automatic melody transcription algorithm on a (polyphonic) song to produce a series of discrete notes representing the melody, and then soliciting the crowd to correct this melody. We present a novel web-based interface that enables the crowd to correct transcriptions, report results from an experiment to understand the capabilities of non-experts to perform this challenging task, and characterize the characteristics and actions of workers and how they correlate with transcription performance.

For further details check out our paper:

Ensemble: A Hybrid Human-Machine System for Generating Melody Scores From Audio
T. Tse, J. Salamon, A. Williams, H. Jiang and E. Law
Proc. 17th International Society for Music Information Retrieval Conference (ISMIR 2016), New York City, USA, Aug. 2016.
[ISMIR][PDF][BibTex]

Screenshot of the Ensemble interface

Screenshot of the Ensemble interface

Best Oral Presentation Award at ISMIR 2015

2/11/2015

Our paper "Melody Extraction by Contour Classification", presented by colleague and first author Rachel Bittner, has won the Best Oral Presentation Award at the ISMIR 2015 conference!

A huge congratulations to my co-authors Rachel Bittner, Slim Essid and Juan Pablo Bello, and especially to Rachel for doing such an excellent job at presenting the paper at the conference!

R. Bittner, J. Salamon, S. Essid and J. P. Bello. "Melody Extraction by Contour Classification". Proc. 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, Oct. 2015.
[ISMIR][PDF][BibTex]

Melody Extraction by Contour Classification

21/7/2015

Due to the scarcity of labeled data, most melody extraction algorithms do not rely on fully data-driven processing blocks but rather on careful engineering. For example, the Melodia melody extraction algorithm employs a pitch contour selection stage that relies on a number of heuristics for selecting the melodic output. In this paper we explore the use of a discriminative model to perform purely data-driven melodic contour selection. Specifically, a discriminative binary classifier is trained to distinguish melodic from non-melodic contours. This classifier is then used to predict likelihoods for a track’s extracted contours, and these scores are decoded to generate a single melody output. The results are compared with the Melodia algorithm and with a generative model used in a previous study. We show that the discriminative model outperforms the generative model in terms of contour classification accuracy, and the melody output from our proposed system performs comparatively to Melodia. The results are complemented with error analysis and avenues for future improvements.

For further details please see our paper:

R. Bittner, J. Salamon, S. Essid and J. P. Bello. "Melody Extraction by Contour Classification". Proc. 16th International Society for Music Information Retrieval Conference (ISMIR 2015), Malaga, Spain, Oct. 2015.
[ISMIR][PDF][BibTex]