Text, voice and video chat company Skype has demonstrated real-time language translation between English and Spanish using technology supplied by Microsoft Research. Microsoft acquired Skype for about $8.5 billion in May 2011. Skype has demonstrated the technology by way of a video chat between Peterson School in Mexico City and Stafford Elementary School in Tacoma, Washington.
While the technology is still in a "preview" or beta mode the company is encouraging users to sign up to help improve the technology and extend the number of languages that can be addressed.
Skype Translator relies on machine learning and recent improvements in speech recognition made possible by the introduction of deep neural networks, Skype said. The neural networks are used in conjunction with machine translation after first excising disfluencies – the ums and ahs typical of speech. By learning from training data during the preview stage, the software can learn to better recognize and translate the diversity of accents that people use and the specialized terms that may be relevant to various topics.
The training data comes from Skype Translator conversations that have been recorded to enable analysis as well as translated web pages and videos with captions. Skype Translator participants are notified as the call begins that their conversation will be recorded and used by Microsoft to improve the software.
As well as speaking out the translation the software creates text divided in sentences and complete with punctuation and capitalization.
ANNs on graphics cards
The technology is based on the adoption of artificial neural networks as an alternative to context-dependent Gaussian mixture model hidden Markov models (CD-GMM-HMM) as pioneered by Microsoft Research for speech transcription. This was proposed by Frank Seide, Gang Li and Dong Yu in paper published at the Interspeech 2011 conference.
Part of the group's progress was down to modelling very short utterances called senones of which there can be thousands rather than the relatively few phonemes. By modelling senones directly using deep neural networks it was possible to outperform the conventional CD-GMM-HMM large vocabulary speech recognition systems. This capability was also arriving at a time when it was becoming possible to map neural network computation on to graphics processing units and graphics cards. The speed up contributed to the feasibility of the architectural model.
In addition to Skype and Microsoft Research have created a customized "bot" that controls the call experience and organizes the audio, text and data streams and acts like a translator would.
The Skype Translator preview program is currently available for Spanish-speaking and English-speaking Skype customers who use Windows 8.1 or Windows 10 Technical Preview on their desktop or tablet computer.
Related links and articles:
Interspeech 2011 paper of Seide, Li and Yu
News articles:
Voice Recognition Installed in More than Half of New Cars by 2019