Over 70 years ago, the Turing Test, a test for intelligence on a computer which required that a human be unable to distinguish the machine from another human being, was proposed.
Researchers are still trying to figure out if it is possible to create robots and machines that behave so much like humans that we can’t tell the difference.
Researchers from MIT's Computer Science and Artificial Intelligence Lab have now produced sound effects that fool humans—proving that maybe we can’t tell the difference after all.
Late last year, MIT researchers helped develop a system that passed a "visual" Turing Test, which produced written characters that fooled humans. Now the Computer Science and Artificial Intelligence Lab (CSAIL) has demonstrated a deep-learning algorithm that passes the Turing Test for sound.
So, when humans are shown a silent video clip of an object being hit, the algorithm can produce a sound for the hit that is so realistic that it can actually fool human viewers.
The researchers envision future versions of similar algorithms being used to automatically produce sound effects for movies and TV shows, as well as to help robots better understand objects' properties.
“When you run your finger across a wine glass, the sound it makes reflects how much liquid is in it,” said Andrew Owens, CSAIL PhD student and lead author on the paper. “An algorithm that models such sounds can reveal key information about objects’ shapes and material types, as well as the force and motion of their interactions with the world.”
The team employed deep-learning techniques, which involved teaching computers to sift through enormous amounts of data so that they can find patterns on their own. Deep-learning approaches are beneficial because they allow computer scientists to create fewer hand-design algorithms and simply supervise their progress.
How It Works
First the team needed to give the machine sounds to study. The researchers recorded about 1,000 videos of an estimated 46,000 sounds that represented various objects being hit, scraped and prodded with a drumstick.
The team then relayed those videos to a deep-learning algorithm that analyzed their pitch, loudness and other features.
“To then predict the sound of a new video, the algorithm looks at the sound properties of each frame of that video, and matches them to the most similar sounds in the database,” said Owens. “Once the system has those bits of audio, it stitches them together to create one coherent sound.”
The algorithm was accurately able to create the intricacies of different hits. Pitch was not a problem either, because the machine could synthesize hit-sounds ranging from the low-pitched “thuds” of a soft couch to the high-pitched “clicks” of a hard wood railing.
In order to test out the realistic level of the fake sounds, the team conducted an online study in which subjects saw two videos of collisions: one with actual recorded sound and one with the algorithm’s. They then had to determine which sounds were real and which were fake.
Subjects ended up picking the fake sound over the real one twice as often as a baseline algorithm. They were particularly fooled by materials like leaves and dirt.
Additionally, the team found that the materials’ sounds revealed key aspects of their physical properties.
As well as the system tricked humans, the researchers say there is still room for improvements.
For example, if the drumstick moves especially erratically in a video, the algorithm could miss or hallucinate a false hit.
“From the gentle blowing of the wind to the buzzing of laptops, at any given moment there are so many ambient sounds that aren’t related to what we’re actually looking at,” said Owens. “What would be really exciting is to somehow simulate sound that is less directly associated to the visuals.”
According to the MIT team, this work could help robots’ ability to interact with surroundings.
“Being able to predict sound is an important first step toward being able to predict the consequences of physical interactions with the world,” said Owens.