Researchers at MIT have developed a new machine learning system that can identify objects inside an image based on a spoken description of the image.
Given an image and audio caption, the system was able to locate in real-time relevant regions of the image being described.
The system could be used to enhance speech-recognition systems such as Siri or Google Voice Search that require transcriptions of many thousands of hours of speech recordings. This model doesn’t require manual transcriptions and annotations. Instead, it learns words directly from objects in raw images and recorded speech clips and then associates them with one another.
“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to,” said David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. “We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”
Currently, the model can recognize only several hundred different words and objects, but the researchers are working on it so it could save countless hours of manual labor and create new opportunities in speech and image recognition.
Researchers demonstrated the model on an image of a young girl with blonde hair and blue eyes wearing a blue dress with a white lighthouse and a red roof in the background. The model learned to associate which pixels in the image corresponded to the words when an audio caption was narrated.
The system could be used in learning translation between different languages without the need for a bilingual annotator. Of the estimated 7,000 languages spoken worldwide, only about 100 have enough transcription data for speech recognition. Researchers envision the possibility of this system becoming an all-in-one mechanism where it can translate multiple different to one user.