Researchers at the University of Washington have developed algorithms to solve a challenge that has been plaguing the computer vision world: how to turn audio clips into realistic lip-synced video of a person speaking the words.
The team has successfully generated a highly realistic video of former U.S. President Barack Obama talking about terrorism, fatherhood, job creation and other topics using audio clips of those speeches and existing weekly video addresses that were originally on a different topic.
"These types of results have never been shown before," said Ira Kemelmacher-Shlizerman, an assistant professor at the UW's Paul G. Allen School of Computer Science & Engineering. "Realistic audio-to-video conversion has practical applications like improving video conferencing for meetings, as well as futuristic ones such as being able to hold a conversation with a historical figure in virtual reality by creating visuals just from audio. This is the kind of breakthrough that will help enable those next steps."
The system converts audio files of a person’s speech into realistic mouth shapes, which are then grafted onto and blended with the head of that person from another video.
The team chose Obama as the subject for their research because the machine learning technique needs an available video of the person to learn from, and there are hours of video of the former president in the public domain.
"In the future video, chat tools like Skype or Messenger will enable anyone to collect videos that could be used to train computer models," Kemelmacher-Shlizerman said.
Streaming audio over the internet takes up less bandwidth than video, so the new system has the potential to end video chats that are timing out from poor connections.
"When you watch Skype or Google Hangouts, often the connection is stuttery and low-resolution and really unpleasant, but often the audio is pretty good," said co-author and Allen School professor Steve Seitz. "So if you could use the audio to produce a much higher-quality video that would be terrific."
By feeding video into the network instead of just audio, reversing the process, the team could potentially develop algorithms that could detect if a video is real of manufactured.
The new learning tool makes progress in overcoming what is known as the “uncanny valley” problem. This problem has dogged efforts to create realistic video from audio. When synthesized human likenesses appear to be real but are just a bit off the mark, people find these likenesses creepy or off-putting.
"People are particularly sensitive to any areas of your mouth that don't look realistic," said lead author Supasorn Suwajanakorn, a recent doctoral graduate in the Allen School. "If you don't render teeth right or the chin moves at the wrong time, people can spot it right away and it's going to look fake. So you have to render the mouth region perfectly to get beyond the uncanny valley."
Previous audio-to-video conversion processes have involved a long method of filming a few people in a studio saying the same sentences over and over to try to capture a particular sound in correlation to different mouth shapes. This method is very time-consuming, expensive and tedious. The algorithms that the research team developed can learn from videos that already exist on the internet.
Instead of synthesizing the final video directly from audio, they first trained a neural network to watch videos of a person and then translated audio sounds into basic mouth shapes.
The researchers combined previous research from UW Graphics and Image Laboratory team with a new mouth synthesis technique. They were able to realistically superimpose and blend the mouth shapes and textures to an existing reference video of the subject. Another key insight was to allow a small time shift to enable the neural network to anticipate what the speaker will say next.
The new lip-syncing process enabled researchers to create the realistic videos of Obama speaking in the White House, using words he spoke previously on a talk show or interview from years before.
The neural network is designed to learn the patterns of only one individual at a time. This means that Obama’s voice is the only information used to “drive” the synthesized video. There are future steps in place to include helping algorithms to generalize a situation and recognize a person’s voice and speech patterns with less data.
This research was funded by Samsung, Google, Facebook, Intel and the UW Animation Research Labs. The team will be presenting a paper on this research on August 2 at SIGGRAPH 2017.