Audio and Video

Machine Learning System Can Identify Objects in an Image Based on Recorded Speech Clips

18 September 2018
How the machine learning system was taught to recognize objects. Source: MIT

Researchers at MIT have developed a new machine learning system that can identify objects inside an image based on a spoken description of the image.

Given an image and audio caption, the system was able to locate in real-time relevant regions of the image being described.

The system could be used to enhance speech-recognition systems such as Siri or Google Voice Search that require transcriptions of many thousands of hours of speech recordings. This model doesn’t require manual transcriptions and annotations. Instead, it learns words directly from objects in raw images and recorded speech clips and then associates them with one another.

“We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to,” said David Harwath, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Spoken Language Systems Group. “We got the idea of training a model in a manner similar to walking a child through the world and narrating what you’re seeing.”

Currently, the model can recognize only several hundred different words and objects, but the researchers are working on it so it could save countless hours of manual labor and create new opportunities in speech and image recognition.

Researchers demonstrated the model on an image of a young girl with blonde hair and blue eyes wearing a blue dress with a white lighthouse and a red roof in the background. The model learned to associate which pixels in the image corresponded to the words when an audio caption was narrated.

The system could be used in learning translation between different languages without the need for a bilingual annotator. Of the estimated 7,000 languages spoken worldwide, only about 100 have enough transcription data for speech recognition. Researchers envision the possibility of this system becoming an all-in-one mechanism where it can translate multiple different to one user.

To contact the author of this article, email

Powered by CR4, the Engineering Community

Discussion – 0 comments

By posting a comment you confirm that you have read and accept our Posting Rules and Terms of Use.
Engineering Newsletter Signup
Get the GlobalSpec
Stay up to date on:
Features the top stories, latest news, charts, insights and more on the end-to-end electronics value chain.
Weekly Newsletter
Get news, research, and analysis
on the Electronics industry in your
inbox every week - for FREE
Sign up for our FREE eNewsletter
Find Free Electronics Datasheets