Blog Archives

The YouTube-ASMR-300K Dataset

15/6/2020

We're excited to release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio published to date!

Learning from videos with spatial audio is a relatively new domain. While the amount of spatial audio content is increasing, currently there are few video datasets with spatial audio in which the visual content is spatially aligned with the audio content. We therefore introduce a new large-scale dataset of ASMR videos collected from YouTube that contains stereo audio.

ASMR (autonomous sensory meridian response) videos are readily available online and typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with
stereo/binaural or paired microphones.

The audio in these videos contains binaural cues that, when listened to with headphones, create a highly immersive experience in which listeners perceive the sounds as if they were happening around
them. Thus there is strong correspondence between the visual and spatial audio content in these videos.

â€‹Our full dataset, YouTube-ASMR-300K, consists of approximately 300K 10-second video clips with spatial audio. From this full dataset, we also manually curated a subset of 30K clips from 30 ASMR channels that feature more sound events moving spatially for training our models. We call this curated dataset YouTube-ASMR. We split the video clips into training, validation, and test sets in an 80-10-10 ratio.

Compared to the existing datasets, YouTube-ASMR300K is (1) larger by at least 8X, (2) collected
in-the-wild, and (3) contains sound sources in motion (e.g., a user waves a tuning fork across the field of view), which is important for training models on diverse spatial cues. how YouTube-ASMR and YouTube-ASMR-300K compare to existing video dataset with spatial audio:

Dataset	â€‹# Unique videos	Durations (hours)
Lu et al., ICIP 2019	â€‹N/R	9.3
FAIR-play	N/R (2000 10-sec clips in total)	5.2
YouTube-360	1146	114
YouTube-ASMR	3520	96
YouTube-ASMR-300K	33725	904

â€‹YouTube-ASMR-300K was compiled as part of our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
K. Yang, B. Russell, J. Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

The YouTube-ASMR-300K dataset and other project materials are available on our companion website:

https://karreny.github.io/telling-left-from-right/

0 Comments

Teaching Machines to Locate and Generate 3D Sound in Videos by Watching ASMR

11/6/2020

0 Comments

Teaching machines to sense the world around them is a challenging problem. They need to:

Learn to SEE the world
Learn to HEAR the world
Learn to LINK what they SEE with what they HEAR

We've seen dramatic progress in this area in recent years, but significant challenges remain:

We have access to millions of videos, but they aren't labeled:
- Can we teach machines from videos without labels?
In the same way objects move on screen, the sounds they make move too. Humans leverage this spatial correspondence between sight & sound to better understand the world:
- Can machines leverage this audiovisual spatial correspondence?

We present a novel self-supervised approach to spatial, audiovisual learning: we train a model to determine whether the left and right audio channels have been flipped, forcing it to reason about spatial localization across the visual and audio streams. We also show the technique generalizes to 360 videos with 3D sound by replacing left/right flipping with front/back audio rotation.

Here's the idea in a nutshell:

In a video with binaural audio, the location of what we see on screen corresponds to the perceived location of the sound. The spatial audio effect is thanks to the stereo audio with Left and Right channels:

But what if we flip the left and right audio channels? Now the audiovisual spatial correspondence is broken, and the sound will be perceived as coming from the opposite direction:

We can leverage audiovisual spatial correspondence to learn a spatial audiovisual representation via self-supervision!

Concretely, we train a model to predict whether the audio channels have been flipped or not:

This surprisingly simple task results in a strong audiovisual representation that's useful in a variety of applications, including:

Sounding face tracking
On/off-screen sound source separation
Automatic audio spatialization in 2D and 3D video, a.k.a upmixing: converting a video with mono audio to a video with stereo or ambisonic audio giving a surround sound effect where the sound of visible objects moves in synchrony with the object's motion.

In case you missed in the demo video at the top of this page, here's the upmixing example again:

The YouTube-ASMR-300K dataset

To learn this spatial audiovisual representation, we compiled a new dataset containing hundreds of thousands of videos with spatial audio. But... where do you find video with spatial audio?

Cue in ASMR videos to the rescue!

ASMR (autonomous sensory meridian response) videos typically feature an individual actor or "ASMRtist" making different sounds while facing towards a camera set up with stereo/binaural or paired microphones. Some people watch ASMR videos to fall asleep. Others enjoy the tingling sensation the spatial audio gives them. Others can't stand them. However people react to ASMR, one thing is clear - ASMR videos are trending on YouTube... in fact, there are millions of them!

And here's the important part - the audio in these videos contains binaural cues such that there is strong audiovisual correspondence between the visual and spatial audio cues.

We're delighted to formally release the YouTube-ASMR-300K dataset, the largest video dataset with spatial audio to date:

To download YouTube-ASMR-300K, please visit our companion website:

https://karreny.github.io/telling-left-from-right/

Learn more

You can also learn more about our work by reading our CVPR 2020 paper:

Telling Left From Right: Learning Spatial Correspondence of Sight and Sound
Karren Yang, Bryan Russell, Justin Salamon
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9932-9941.
[CVF][PDF][BibTeX]

0 Comments

The YouTube-ASMR-300K Dataset

Teaching Machines to Locate and Generate 3D Sound in Videos by Watching ASMR

The YouTube-ASMR-300K dataset

Learn more

NEWS

Archives

Categories