Although computer vision algorithms can do some remarkable things — like identifying specific faces out of a sea of millions — they fail miserably at some very basic tasks, such as determining whether two objects in an image are different.
In the interest of discovering how computer vision might become smarter, researchers at Brown University have explored a more basic question: Why, in this context, are computers so dumb in the first place?
“We think that by working to understand the limitations of current computer vision systems…we can really move toward new, much more advanced systems, rather than simply tweaking the systems we already have," said Thomas Serre, an associate professor of cognitive, linguistic and psychological sciences at Brown and the senior author of a new study on the subject.
To explore the scope of the problem, Serre and his colleagues used state-of-the-art computer vision algorithms to analyze simple black-and-white images containing two or more randomly generated shapes. In some cases the objects were identical; sometimes they were the same, but with one object rotated in relation to the other; sometimes they were completely different. The computer was asked to identify the same-or-different relationship.
The result? Even after hundreds of thousands of training examples, the algorithms’ ability to recognize the appropriate relationship was no better than chance.
The researchers’ suspicion was that this had something to do with the fact that the algorithms could not individuate the objects — in other words, they see only a collection of pixels they have learned to associate with labels. They cannot tell where one object in the image stops and the background, or another object, begins.
The suspicion was borne out by experiments that separated the objects into two images, relieving the computer of the individuation task. And in this case, the algorithms had no problem learning same-or-different relationships — as long as they didn’t have to view the two objects at once.
Serre’s explanation for the weakness lies within the architecture of the machine learning systems, which use something called ”convolutional neural networks” — layers of connected processing units that loosely mimic networks of neurons in the brain. Problem is, information only flows in one direction through the layers of an artificial network, a significant difference from the recurring connections made within the human visual system. The sort of back-and-forth feedback done within the brain is believed to make it possible to pay attention to certain parts of the visual field and make mental representations bound to a specific object, before attention gets shifted to another object.
“When both objects are represented in working memory, your visual system is able to make comparisons like same-or-different," Serre said. It could be, he added, that making computer vision smarter will require neural networks that more closely approximate the recurrent nature of human visual processing.
The research was presented last week at the annual meeting of the Cognitive Science Society.