Blog Archives

Controllable Neural Prosody Synthesis

25/10/2020

Speech synthesis has recently seen significant improvements in fidelity, driven by the advent of neural vocoders and neural prosody generators. However, these systems lack intuitive user controls over prosody, making them unable to rectify prosody errors (e.g., misplaced emphases and contextually inappropriate emotions) or generate prosodies with diverse speaker excitement levels and emotions. We address these limitations with a user-controllable, context-aware neural prosody generator. Given a real or synthesized speech recording, our model allows a user to input prosody constraints for certain time frames and generates the remaining time frames from input text and contextual prosody. We also propose a pitch-shifting neural vocoder to modify input speech to match the synthesized prosody. Through objective and subjective evaluations we show that we can successfully incorporate user control into our prosody generation model without sacrificing the overall naturalness of the synthesized speech.

Here's Max's presentation of the work at INTERSPEECH 2020:

To hear more examples please visit Max's paper companion website.

For further details please read our paper:

Controllable Neural Prosody Synthesis
M. Morrison, Z. Jin, J. Salamon, N.J. Bryan, G. Mysore
Proc. Interspeech. October 2020.
[INTERSPEECH][PDF][arXiv]

0 Comments

Metric Learning vs Classification for Disentangled Music Representation Learning

13/10/2020

0 Comments

Deep representation learning offers a powerful paradigm for mapping input data onto an organized embedding space and is useful for many music information retrieval tasks. Two central methods for representation learning include deep metric learning and classification, both having the same goal of learning a representation that can generalize well across tasks. Along with generalization, the emerging concept of disentangled representations is also of great interest, where multiple semantic concepts (e.g., genre, mood, instrumentation) are learned jointly but remain separable in the learned representation space. In this paper we present a single representation learning framework that elucidates the relationship between metric learning, classification, and disentanglement in a holistic manner. For this, we (1) outline past work on the relationship between metric learning and classification, (2) extend this relationship to multi-label data by exploring three different learning approaches and their disentangled versions, and (3) evaluate all models on four tasks (training time, similarity retrieval, auto-tagging, and triplet prediction). We find that classification-based models are generally advantageous for training time, similarity retrieval, and autotagging, while deep metric learning exhibits better performance for triplet-prediction. Finally, we show that our proposed approach yields state-of-the-art results for music auto-tagging.

To learn more please read our paper:

Metric Learning vs Classification for Disentangled Music Representation Learning
J. Lee, N. J. Bryan, J. Salamon, Z. Jin, J. Nam
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation on our companion website:
https://jongpillee.github.io/metric-vs-classification/

0 Comments

Few-shot Drum Transcription in Polyphonic Music

11/10/2020

0 Comments

Data-driven approaches to automatic drum transcription (ADT) are often limited to a predefined, small vocabulary of percussion instrument classes. Such models cannot recognize out-of-vocabulary classes nor are they able to adapt to finer-grained vocabularies. In this work, we address open vocabulary ADT by introducing few-shot learning to the task. We train a Prototypical Network on a synthetic dataset and evaluate the model on multiple real-world ADT datasets with polyphonic accompaniment. We show that, given just a handful of selected examples at inference time, we can match and in some cases outperform a state-of-the art supervised ADT approach under a fixed vocabulary setting. At the same time, we show that our model can successfully generalize to finer-grained or extended vocabularies unseen during training, a scenario where supervised approaches cannot operate at all. We provide a detailed analysis of our experimental results, including a breakdown of performance by sound class and by polyphony.

To learn more please read out paper:

Few-Shot Drum Transcription in Polyphonic Music
Y. Wang, J. Salamon, M. Cartwright, N. J. Bryan, J. P. Bello
In 21st International Society for Music Information Retrieval Conference (ISMIR), Montreal, Canada (virtual), Oct. 2020.

You can find more related materials including a short video presentation and a poster here:
https://program.ismir2020.net/poster_1-14.html

0 Comments

Controllable Neural Prosody Synthesis

Metric Learning vs Classification for Disentangled Music Representation Learning

Few-shot Drum Transcription in Polyphonic Music

NEWS

Archives

Categories