Justin Salamon
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology

The YouTube-ASMR-300K Dataset

15/6/2020

0 Comments

 
We're excited to release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio published to date!
Picture

Learning from videos with spatial audio is a relatively new domain. While the amount of spatial audio content is increasing, currently there are few video datasets with spatial audio in which the visual content is spatially aligned with the audio content. We therefore introduce a new large-scale dataset of ASMR videos collected from YouTube that contains stereo audio.

ASMR (autonomous sensory meridian response) videos are readily available online and typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with
stereo/binaural or paired microphones. 

The audio in these videos contains binaural cues that, when listened to with headphones, create a highly immersive experience in which listeners perceive the sounds as if they were happening around
them. Thus there is strong correspondence between the visual and spatial audio content in these videos.

​Our full dataset, YouTube-ASMR-300K, consists of approximately 300K 10-second video clips with spatial audio. From this full dataset, we also manually curated a subset of 30K clips from 30 ASMR channels that feature more sound events moving spatially for training our models. We call this curated dataset YouTube-ASMR. We split the video clips into training, validation, and test sets in an 80-10-10 ratio.


Compared to the existing datasets, YouTube-ASMR300K is (1) larger by at least 8X, (2) collected
in-the-wild, and (3) contains sound sources in motion (e.g., a user waves a tuning fork across the field of view), which is important for training models on diverse spatial cues. how YouTube-ASMR and YouTube-ASMR-300K compare to existing video dataset with spatial audio:
Dataset
​# Unique videos
Durations (hours)
Lu et al., ICIP 2019
​N/R
9.3
FAIR-play
N/R (2000 10-sec clips in total)
5.2
YouTube-360
1146
114
YouTube-ASMR
3520
96
YouTube-ASMR-300K
33725
904

​YouTube-ASMR-300K was compiled as part of our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
K. Yang, B. Russell, J. Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

The YouTube-ASMR-300K dataset and other project materials are available on our companion website:

https://karreny.github.io/telling-left-from-right/
0 Comments

Teaching Machines to Locate and Generate 3D Sound in Videos by Watching ASMR

11/6/2020

0 Comments

 

Teaching machines to sense the world around them is a challenging problem. They need to:
  • Learn to SEE the world
  • Learn to HEAR the world
  • Learn to LINK what they SEE with what they HEAR

We've seen dramatic progress in this area in recent years, but significant challenges remain:
  • We have access to millions of videos, but they aren't labeled:
    • Can we teach machines from videos without labels?
  • In the same way objects move on screen, the sounds they make move too. Humans leverage this spatial correspondence between sight & sound to better understand the world:
    • Can machines leverage this audiovisual spatial correspondence?

We present a novel self-supervised approach to spatial, audiovisual learning: we train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. We also show the technique generalizes to 360 videos with 3D sound by replacing left/right flipping with front/back audio rotation.

Here's the idea in a nutshell:


In a video with binaural audio, the location of what we see on screen corresponds to the perceived location of the sound. The spatial audio effect is thanks to the stereo audio with Left and Right channels:
Picture
But what if we flip the left and right audio channels? Now the audiovisual spatial correspondence is broken, and the sound will be perceived as coming from the opposite direction:
Picture
We can leverage audiovisual spatial correspondence to learn a spatial audiovisual representation via self-supervision!

Concretely, we train a model to predict whether the audio channels have been flipped or not:
Picture
This surprisingly simple task results in a strong audiovisual representation that's useful in a variety of  applications, including:
  • Sounding face tracking
  • On/off-screen sound source separation
  • Automatic audio spatialization in 2D and 3D video, a.k.a upmixing: converting a video with mono audio to a video with stereo or ambisonic audio giving a surround sound effect where the sound of visible objects moves in synchrony with the object's motion.

In case you missed in the demo video at the top of this page, here's the upmixing example again:

The YouTube-ASMR-300K dataset

To learn this spatial audiovisual representation, we compiled a new dataset containing hundreds of thousands of videos with spatial audio. But... where do you find video with spatial audio?

Cue in ASMR videos to the rescue!


ASMR (autonomous sensory meridian response) videos typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with stereo/binaural or paired microphones. Some people watch ASMR videos to fall asleep. Others enjoy the tingling sensation the spatial audio gives them. Others can't stand them. However people react to ASMR, one thing is clear - ASMR videos are trending on YouTube... in fact, there are millions of them!

And here's the important part - the audio in these videos contains binaural cues such that there is strong audiovisual correspondence between the visual and spatial audio cues.

We're delighted to formally release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio to date:
Picture
To download YouTube-ASMR-300K, please visit our companion website:

https://karreny.github.io/telling-left-from-right/

Learn more

You can also learn more about our work by reading our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
Karren Yang, Bryan Russell, Justin Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]
Picture
0 Comments

    NEWS

    Machine listening research, code, data & hacks!

    Archives

    April 2022
    November 2021
    October 2021
    June 2021
    January 2021
    October 2020
    June 2020
    May 2020
    April 2020
    January 2020
    November 2019
    October 2019
    June 2019
    May 2019
    March 2019
    February 2019
    January 2019
    November 2018
    October 2018
    August 2018
    July 2018
    May 2018
    April 2018
    February 2018
    October 2017
    August 2017
    July 2017
    June 2017
    April 2017
    March 2017
    January 2017
    December 2016
    November 2016
    October 2016
    August 2016
    June 2016
    May 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    July 2015
    June 2015
    April 2015
    February 2015
    November 2014
    October 2014
    September 2014
    June 2014
    April 2014
    March 2014
    February 2014
    December 2013
    September 2013
    July 2013
    May 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    August 2012
    July 2012
    June 2012

    Categories

    All
    ACM MM'13
    ACM MM'14
    Acoustic Ecology
    Acoustic Event Detection
    Acoustic Sensing
    AES
    Applied Acoustics
    Article
    Audio-annotator
    Audio To Midi
    Auditory Scene Analysis
    Avian
    Award
    Baseball
    Beer
    Best Oral Presentation
    Best Paper Award
    Best Student Paper Award
    BigApps
    Bioacoustics
    BirdVox
    Book
    Chapter
    CHI
    Citizen Science
    Classification
    Computer Vision
    Conference
    Connected Cities
    Convolutional Neural Networks
    Cornell Lab Of Ornithology
    Coursera
    Cover Detection
    CREPE
    Crowdcrafting
    Crowdsourcing
    CUSP
    CVPR
    Data Augmentation
    Data Science
    Dataset
    Data Structures
    Dcase
    Deep Learning
    Domain
    Education
    Entrepreneurship
    Environmental Sound
    Essentia
    Eusipco
    Eusipco2015
    Evaluation
    Few-shot Learning
    Flight Calls
    Girl Scouts
    Grant
    Hackathon
    Hackday
    Hackfest
    HCI
    Hildegard Von Bingen
    ICASSP
    ICASSP 2020
    IEEE Signal Processing Letters
    Ieee Spm
    Indian Classical Music
    Interface
    Interspeech
    Interview
    Ismir 2012
    Ismir2014
    Ismir2015
    Ismir2016
    Ismir2017
    Ismir2020
    ITP
    Jams
    Javascript
    JNMR
    Journal
    Machine Learning
    Machine Listening
    Map
    Media
    Melodia
    Melody Extraction
    Metric Learning
    Midi
    Migration Monitoring
    MIR
    Mir_eval
    MOOC
    MTG-QBH
    Music Informatics
    Music Information Retrieval
    Music Similarity
    National Science Foundation
    Neumerator
    New York Times
    Noise Pollution
    Notebook
    NPR
    NSF
    NYC
    NYU
    Open Source
    Pitch
    Pitch Contours
    Pitch Tracking
    Plos One
    Plug In
    Plug-in
    Presentation
    Press
    PRI
    Prosody
    Publication
    Python
    Query By Humming
    Query-by-humming
    Radio
    Representation Learning
    Research
    Robots
    Scaper
    Science And The City
    Science Friday
    Self-supervision
    Sensor Network
    Sensors
    Sight And Sound Workshop
    Smart Cities
    Software
    SONYC
    Sound Classification
    Sound Education
    Sound Event Detection
    Soundscape
    Sounds Of New York City
    Sound Workshop
    Speech
    STEM
    Synthesis
    Taste Of Science
    Taxonomy
    Technical Report
    Time Series
    Tonic ID
    Tony
    Tutorial
    Unsupervised Feature Learning
    Urban
    Urban Sound Analysis
    Urban Sound Tagging
    Vamp
    Version Identification
    Visualization
    Vocaloid
    Vocoder
    Warblers
    Wav To Midi
    Welcome
    Wired
    WNYC
    Women In Science
    Workshop
    World Domination
    Wsf14
    Youtube

    RSS Feed

Powered by Create your own unique website with customizable templates.
  • Home
  • News
  • Research
  • Publications
  • Code/Data
  • Melody Extraction
  • PhD Thesis
  • Contact
    • Music
    • Music Technology