A pair of researchers, Ruohan Gao, University of Texas and Kristen Grauman, Facebook AI research came out with a method, earlier this month, that can teach an AI system the conversion of ordinary mono sounds into binaural sounds. The researchers have termed this concept as “2.5D visual sound” and it uses a video to generate synthetic 3D sounds.
According to the researchers, binaural audio provides a listener with the 3D sound sensation that allows a rich experience of the scene. However, these recordings are not easily available and require expertise and equipment to obtain. Researchers state that humans generally determine the direction of a sound with the help of visual cues. So, they have used a similar technique, where a machine learning system is provided with a video involving a scene and mono sound recording. Using this video, the ML system then figures out the direction of the sounds and further distorts the “interaural time and level differences” to generate the effect of a 3D sound for the listener.
Researchers mention that they have devised a deep convolutional neural network which is capable of learning how to decode the monaural (single-channel) soundtrack into its binaural counterpart. Visual information about object and scene information is injected within the CNN during the whole process.
“We call the resulting output 2.5D visual sound—the visual stream helps “lift” the flat single channel audio into spatialized sound. In addition to sound generation, we show the self-supervised representation learned by our network benefits audio-visual source separation”, say researchers.
Training method used
For the training process, researchers first created a database of examples of the effect that it wants the machine learning system to learn. Grauman and Gao created a database using binaural recordings of over 2,265 musical clips which they had also converted into videos.
The researchers mention in the paper, “Our intent was to capture a variety of sound-making objects in a variety of spatial contexts, by assembling different combinations of instruments and people in the room. We post-process the raw data into 10s clips. In the end, our BINAURAL-MUSIC-ROOM dataset consists of 2,265 short clips of musical performances, totaling 6.3 hours”.
The equipment used for this project involved a 3Dio Free Space XLR binaural microphone, a GoPro HERO6 Black camera, and a Tascam DR-60D recorder as an audio pre-amplifier. The GoPro camera was mounted on top of the 3Dio binaural microphone to mimic a person seeing and hearing, respectively. The GoPro camera records videos at 30fps with stereo audio.
Researchers then used these recordings from the dataset for training a machine-learning algorithm which could recognize the direction of sound from a video of the scene. Once the machine learning system learns this behavior, it is then capable of watching a video and distorting a monaural recording to simulate where the sound is ought to be coming from.
The video shows the performance results of the research which is quite good. In the video, the results of 2.5D recordings are compared against monaural recording.
However, it is not capable of generating a complete 3D sound and there certain situations that the algorithm finds difficult to deal with. Other than that, the ML system cannot consider any sound source that is not visible in the video, and the ones that it has not been trained on.
Researchers say that this method works best for music videos and they have plans to extend its applications. “Generating binaural audio for off-the-shelf video could potentially close the gap between transporting audio and visual experiences, and will be useful for new applications in VR/ AR. As future work, we plan to explore ways to incorporate object localization and motion, and explicitly model scene sounds”, say the researchers.
For more information, check out the official research paper.