Giving Voice More Reach

As the Internet of Things (IoT) begins to take shape, it becomes clear that the 50 billion objects to be deployed by 2020 will require a new class of voice technology. Consumers have come to rely on voice-enabled virtual assistants to facilitate interactions with their smartphones. But emerging applications such as smart home appliances, wearable devices and smart automotive systems require natural language interfaces that can function over longer distances and contend with greater levels of ambient noise.

Smart home applications pose challenges for natural language interfaces not encountered in smartphones. Operating distances are measured in feet instead of centimeters, increasing the impact of ambient noise. (Image courtesy of Conexant.)In these cases, current voice technology all too often falters, delivering inadequate performance.

To provide a foundation that will enable natural language interfaces to take on these new applications, Conexant has released two voice-processing systems—the CX20926 and the CX20924. These promise to provide clear communication and accurate speech recognition, sweeping aside some of the obstacles that have precluded broader application of voice technology.

Complicating Factors

The catch is that most existing voice-processing systems, such as those found in smartphones, have been tailored for near-field applications. Many of the “things,” or “objects,” expected to enter the market in the near future will be a different animal.

“The largest application for voice today is mobile phones,” says Vineet Ganju, executive marketing director for audio at Conexant. “In this application, the phone and its microphones are held within a few centimeters of the user’s mouth. In this use case, the loudness of the voice relative to the background noise is very high, so there is not a high demand for noise-suppression technology. For new use cases, such as smart watches, remote controls and smart speakers, the product can be an arm’s length away or even across the room. In these cases, the background noise or content from the product’s own speaker can be much louder relative to the user’s voice.”

The expanded operating range brings a new set of factors into play that interfere with the reception and interpretation of speech. For one thing, the amount and variety of ambient noise grows. Internal operating noise from the device being used can compromise performance, and if a user is some distance from the device, speech can be further colored by the speech reflecting off surfaces in the user’s surroundings. All these effects combine to degrade voice clarity and the performance of speech-recognition systems.

All these factors can hinder natural language interfaces and speech-recognition systems from accurately determining the orientation, distance and location of the voice communicating commands. Enabling robust performance under these conditions requires far-field speech processing and the means to mitigate the effects of more pronounced ambient noise. The question is: Can existing technologies overcome these challenges?

Technologies that Fall Short

Most current mobile voice-processing systems rely on algorithms that use level difference between two microphones, which works well when the microphones are next to the user's face, with one microphone closer to the user's mouth than tMinimum variance distortionless-response beamforming experiences significant residual speech in the noise-estimation channel, causing speech distortion in the adaptive post-filter stage. The only way to overcome this problem is to add more microphones, which increases the cost of the system, making the approach less than ideal for consumer applications. (Image courtesy of Conexant.)he other. When the application extends the distance between the point of origin of the speech and the microphones by half a meter or more, these systems fall short.

Other voice-processing systems apply a form of beamforming called minimum variance distortionless response.

Unfortunately these systems can be ineffective when only two microphones are in play. To perform well, they require more microphones, which can make the approach too expensive for consumer devices. In addition, this technique cannot handle interference coming from the same direction as the spoken commands.

A New Approach

To avoid the limitations encountered with the level difference between two microphones and beamforming techniques, Conexant has developed its Smart Source Pickup noise-suppression algorithm (SSP). This blind source separation (BSS) process is part of the company’s AudioSmart family of voice pre-processing algorithms.

Conexant’s SSP noise suppression algorithm enhances the performance of two-microphone solutions, reducing the impact of noise in near-field and far-field applications. The blind source separation technology uses spatial representation of target speech and noise sources to reduce interference. (Image courtesy of Conexant.)Essentially, BSS promises to convert a noisy, far-field voice signal into a clear, near-field voice signal to achieve greater automated speech-recognition (ASR) accuracy.

Traditional BSS approaches often come up short in the face of real-world operating conditions, but Conexant contends it has overcome these issues by applying constrained independent component analysis (ICA). Using this approach, the algorithm performs a dynamic acoustic scene analysis, which allows the system to estimate the number of acoustic sources and the direction of arrival, and classify the sources as interference or speech sources.

ICA can create a rich spatial representation of the target and noise sources, even when confronted with highly reverberant conditions because the filtering implicitly models the reverberation. The system then uses the features and estimated spatial filters to control a statistically based spectral filter that enhances the signal. The enhanced signal can take the form of a true stereo output, preserving the spatial information in the desired signal(s) while removing unwanted signals from both channels. Alternatively it can take the form of a true mono signal or a mono signal derived from the stereo signal through optional direct-sum beamforming. As a result, the system can cancel out stationary and dynamic noise, avoiding strict directional constraints and inter-channel level difference constraints.

Giving Battery-operated Devices a Voice

Building on its SSP technology, Conexant recently released the CX20926, a low-power audio/sensor system-on-chip (SoC), targeting applications such as wearables, TV remotes, and tablets and laptops. Conexant has tailored the SoC to pre-process speech input, removing unwanted background noise. The chip aims to provide a clear voice signal to the ASR engine, enabling greater accuracy, especially in far-field applications.

To conserve battery power, the CX20926 includes a local word-detection engine, so it doesn’t have to go to the cloud for this functionality. When no voice is present, the chip operates at extremely low power levels.

As you might expect, this SoC shoehorns a lot of components into a small space, integrating motion sensors, one or two digital MEMS microphones, a digital signal processor (DSP) and an ARM M0+ microcontroller. Conexant chose the ARM M0+ because of its small size, low-power requirements and easy programming. As with many other specialized SoCs, the CX20926 divides its computing workload between multiple processors to achieve greater efficiency. In this case, the ARM M0+ manages the user interface, wireless connectivity and sensor processing, while the DSP handles all voice- and audio-processing tasks.

Using data provided by the motion sensor and microphones, the CX20926 achieves contextual awareness. For example, it can determine whether it is operating indoors or outdoors or if it is moving or still. The SoC uses contextual awareness to optimize voice pre-processing and to conserve power.

Over the next five years, the company plans to refine the CX20926. “The CX20926 will target even lower power levels and higher levels of intelligence to determine when, where and what to process,” says Ganju.

Going the Distance
The Smart Source Locator uses the voice signal captured by the CX20924’s four independent microphones to determine the location of the user. (Image courtesy of Conexant.)
As the IoT takes shape, a new class of applications has begun to emerge that demands natural language interfaces that support much greater operating ranges. To meet this challenge, Conexant has developed the CX20924, a four-microphone, far-field, voice pre-processing SoC. This device targets smart home and industrial robotics control applications.

The SoC combines four digital MEMS microphones, a system microprocessor and a DSP optimized for voice and audio processing. One of the features that sets the CX20924 apart from competing devices is its Smart Source Locator algorithm.

“The Smart Source Locator uses the voice signal captured by four independent microphones to determine the location of the user in any direction,” says Ganju. “Since the signal captured by each of the four microphones is slightly different, due to distance from the user, location, loudness, etc., the Smart Source Locator can intelligently use those differences to determine the location of the voice.”

The algorithm also enhances noise suppression in scenarios involving multiple voice-like sounds in a room. For example, it promises to do a better job of suppressing noise from a TV or radio playing in the background while a user is talking to the device.

In terms of basic functions, this SoC performs the same pre-processing as the CX20926. Essentially the CX20924 simultaneously connects directly to the system’s four microphones, processing their signals with the AudioSmart algorithms, which include the Smart Source Locator. The system then passes a clear voice signal through a digital audio interface (I2S or USB) to the host processor.
Conexant’s algorithms enhance noise suppression in scenarios involving multiple voice-like sounds in a room, distinguishing between the speaker’s voice and extraneous noise. (Image courtesy of Conexant.)
Moving forward, Conexant plans to build on the programmable nature of the SoC’s DSP engine. “We will continue to integrate more intelligent acoustic processing to make the CX20924 more aware of the context in which it is being used,” says Ganju.

Market Demands

As natural language interfaces evolve to meet the demands of the IoT, developers will have to tailor systems for a broad spectrum of application scenarios. Applications supporting mobile and wearable devices must be able to deliver premium-level voice- processing performance at very low power levels to stay within constrained battery-power budgets.

Developers of systems targeting far-field speech applications will have to take a system- level approach. This will mean optimizing everything in the device for extended-range speech applications, from the choice of interfaces and the size and type of memory to the type of processor used.

The one design demand that all devices aimed at consumer markets will have to meet is affordability. All must achieve premium-level performance at consumer price points.

Discussion – 0 comments

AUTOMOTIVE & TRANSPORTATION

AUTOMOTIVE & TRANSPORTATION

RELATED ARTICLES

RELATED ARTICLES

FEATURED INFOGRAPHIC