Scientists and engineers have been working to develop a reliable and flexible voice-conversion technique. Most approaches rely on statistical models, with the Gaussian mixture models considered to be mainstream approaches. Unfortunately most of these voice-conversion methods require parallel data to train the system. This means that speech data from the source and target speakers must align so that each frame of the source speaker’s data corresponds with that of the target speaker. This reliance on parallel data poses problems that have prevented the techniques from gaining enough traction to achieve broad adoption.
A voice-conversion technique developed by researchers at the University of Electro-Communications in Japan, however, may offer a more viable alternative. The model created by Toru Nakashika and his colleagues uses an adaptive, restricted Boltzmann machine, which does not require parallel data from two speakers to train the system. Testing has shown that Nakashika’s approach can deconstruct and rebuild the source speaker’s speech, creating a voice that sounds like a different person.
The researchers have based this voice-conversion model on the premise that the acoustic features of speech consist of two layers: neutral phonological information, which is not associated with a specific person; and speaker identity features, which make words sound like they come from a specific speaker. After training the system, the researchers found that the model’s performance was comparable to that of existing parallel-trained models, with one exception. It offered a unique advantage: the system can generate new phonemic sounds for the target speaker—providing target speakers speech generation of a different language.
While this technology is still in the early stages of its development, the simple, intuitive and flexible nature of the model promises to open the door for a number of applications. These include security functions—such as speaker identification and authentication—and interface control modalities, such as speech recognition.
