Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed an AI system that can analyze a music video, isolate individual instruments and make them louder or quieter.
Researchers trained the deep learning system, dubbed PixelPlayer, on over 60 hours of video. The self-supervised AI can then view a previously unseen musical performance, identify instruments at pixel level and extract only the sounds associated with an individual instrument.
Potential applications for the system include aiding recording engineers, automating audio quality processing or training a robot to recognize environmental sounds.
In a new paper, CSAIL researchers demonstrated that PixelPlayer can identify the sounds of more than 20 commonly seen instruments. The system first locates the image regions that produce sounds, and then separates the input sounds into a set of components that represent the sound from each pixel. PixelPlayer can also recognize musical elements by analyzing sound waves. For example, it correlates certain harmonic frequencies to violin music, or quicker, pulse-like percussive sounds with the xylophone.
Lead author Hang Zhao says that more training data would improve PixelPlayer’s scope and accuracy, but it would still experience difficulty discerning between closely related instruments, like an alto and tenor saxophone.
While previous audio analysis systems have exclusively focused on audio, PixelPlayer adds the element of vision to distinguish between instruments. Vision eliminates the need for human labeling and is key to to the system’s self-supervised operation.
“We expected a best-case scenario where we could recognize which instruments make which kinds of sounds,” says Zhao, a Ph.D. student at CSAIL. “We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video.”