Voice Control for Low Power Microcontrollers

November 3 2022, 09:10

Ido Gus, Deep Learning Senior Team Leader at CEVA’s Sensor and Audio Business Unit, writes about voice control deployment on low power and resource constrained microcontroller-units (MCUs). The article discusses what to consider for the development of products using voice as a primary human interface, and how speech recognition models using AI technology can support a wide variety of use cases and languages, without compromising on power requirements.

In this article, we will discuss the why and how of voice control deployment on low power and resource constrained microcontroller-units (MCUs) and its translation into real world applications.

But first, let’s define a couple of core concepts - Human Computer interface (HCI), Voice User Interface, and Voice Control:

- Human Computer Interface (HCI) is a well-defined concept that can be described as the point of communication between a human user and a computer. The communication channel classification can be based on many of the human senses: vision, hearing, touch, and so on.

- Voice User Interface (VUI) makes it possible for humans to communicate with machines using voice. Machines may employ some form of speech recognition to translate human speech to commands and queries.

- Voice Control is an implementation of a VUI, allowing a human to use simple, concise commands to operate a device or appliance.

VUIs have been around for a couple of years and have been made very popular over recent years thanks to devices such as Amazon Echo, Google Home, Apple Home Pod and their associated voice assistants deployed also on smartphones, TVs, cars and other devices. Most of these devices rely on complex, cloud based, speech recognition engines. These engines handle complex human speech, allowing users to use natural language for interaction with machines.

However, these abilities come with a (manyfold) price tag, starting with compromised user privacy, as user queries are uploaded to the cloud for processing, and are stored there for various lengths of time (from hours to months, depending on the service supplier). Also, the device must have a connection to the cloud to operate, and processing on the cloud is often energy consuming and slower, which in turn makes the device BOM costs soar as relatively complex connectivity hardware must be integrated into the device, often resulting in major design modifications.

The price tags of a full-fledged cloud-based voice assistants can be alleviated, for many use cases, by deploying a small, task optimized voice control engine on a battery-operated, resource constrained, offline, MCU-enabled device. Voice control powered by a small dedicated VUI engine can be realized on a simple MCU-based hardware module serving as a drop-in replacement for existing controls (knobs, buttons, touch screens, etc.). Naturally, there are limitations to the capabilities of such a solution, but as we will shortly see, for many tasks and use cases, these limitations are outshined by the benefits.

The major limitation of voice control implementations for MCUs, is that these are often characterized by a limited vocabulary support – only a small set of words can be recognized, words which the user must remember to operate the device properly. In other words, the user cannot use natural language, and instead must make their request using the supported words and commands. For example, “play the next song” might not be recognized by a system configured to detect the command “next song” or even just “next”.

This limitation has a plus side – simplicity. Using short, concise commands, greatly reduces the risk of the device “misunderstanding” the command, due to ambient noise or other interruptions. This becomes very evident when considering the tasks voice control on MCU is designed to handle.

Let’s review some use-cases.

Major Appliances
Many major appliances that have button\knob\touch interfaces are also operated with dirty or wet hands (ovens, cooktops, washing machines, dishwashers). A Voice Control deployed on an MCU-powered hardware module can prove to be very useful in keeping the appliance clean and easily operatable (have you ever tried to operate a touch interface with wet fingers?). From a manufacturing stand point, voice control deployed on a mass produced, MCU-powered hardware module can serve as a drop-in replacement for existing buttons, knobs, and touch interfaces with minimal integration costs.

Robot Vacuum Cleaners
Robot Vacuum Cleaners (RVCs) can operate independently or via remote controls (which always get lost…). An MCU Voice Control module supporting just a few commands (“clean kitchen”, “stop”, “go charge”) can significantly improve user experience, with a small impact on BOM and costs, while preforming better than a cloud-based voice assistant, which often has difficulties with noisy environments and short commands.

Public Kiosks and Vending Machines
With Covid-19, Hygiene became a major concern, especially in the public domain. A MCU Voice Control module can provide an effective, low-cost option to upgrade existing machinery catering to public health. Supported commands can be displayed\printed on the device to alleviate the lack of support for natural language while lowering error rates.

Wearables, Hearables, and other Tiny Devices (TWS and Hearing Aids)
This device class is characterized by a limited power supply (small batteries, rendering continuous cloud connection impractical), limited compute resources (rendering large vocabulary speech recognition engines impractical), and limited surface space (rendering buttons and tap interfaces inconvenient) – which makes MCU power voice controlled an ideal solution.

IR Remote Control with Voice Control (for TVs, Home Entertainment, and HVAC systems)
Remote control is the preferred interface for operating TVs, home entertainment systems, A/C system, ceiling fans, and any device that is out of reach. Adding on-device VUI to remote controls allow better personalization (e.g., with speaker verification smart TV apps such as Netflix can be made to start up with user’s profile) and can also solve the “looking for the remote“ hassle. After-market universal voice-controlled remote controls can offer an easy upgrade for older systems.

What Makes Up a Good Voice Control Solution?
An MCU-powered Voice control solution must address some key challenges to be considered an efficient, effective and reliable alternative to existing interfaces (knobs, buttons, touch):

Quality of Service – the probability that the voice control engine will “understand” (detect correctly) the uttered command or word. Two types of errors exist – False Accept and False Reject. User sensitivity to each type of error may vary with use case and the voice control engine must be tuned accordingly. In general, users would expect a True Acceptance Rate of 95% or higher, and no more than 1 False Accept per 24 hours. In other words, VUI performance should be such that a user would not bother reach for the remote or button.

Noise Robustness – the ability to provide high-quality detection in noisy environments for all of the cases reviewed earlier operate in (some are source of the noise). A good VUI implementation is expected to have a perceivable performance degradation only at SNR levels lower than 5db.

Power and Compute Requirements – these are critical in determining if the candidate implementation is suitable for the use case. For battery-operated implementations, power consumption should be in the milliwatt range. Such a VUI implementation should be able to run on a Cortex-M0+ or similar MCU consuming less than 50MCPS and 80KB of memory.

Security – an MCU voice control solution may be expected/required to respond selectively to commands issues by specific entities. This can be realized by speaker verification technology that can be integrated into the system.

VUI for MCUs Implementation Challenges
Building a competitive VUI engine is a game of balancing multiple (and often opposing) constraints:
- Quality of service (True Acceptance Rate vs False Accepts per Hour)
- Robustness to noise
- Robustness to reverberation
- Extremely limited compute and memory resources
- Robustness to accents
- Data acquisition costs

In deep learning research, a common way to boost model performance involves increasing model complexity and the amount of training data. Such techniques are not applicable in the “real world” where the goal is building a model (VUI engine in this case) targeting MCUs that have very limited resources (model complexity must be kept to a bare minimum) in an economical fashion (data acquisition resources are limited).

The pressure set by the different constraints means that different model-size reduction techniques and advanced data engineering methods aimed at making the most of limited data acquisition resource need to be analyzed. Techniques such as post-processing quantization and quantization aware training, structured and unstructured pruning, low-rank approximation and sparsity and knowledge distillation can be deployed. While these techniques can reduce compute and memory footprints, model performance still has to consider:

- Multiple audio signal processing techniques
- Multiple feature extraction techniques
- Different model architectures from CNNs to RNNs and transformers
- A wide array of audio data engineering methods from effective and efficient data collection procedures to data augmentations and noise mixing parameters

Finally, when satisfactory model architecture, data acquisition, and training recipes are realized, a number of implementation challenges still need to be overcome:

- Code portability and maintainability
- High performance and high accuracy fixed point arithmetic
- Multi-platform optimizations
- API simplicity and usability

CEVA WhisPro is a Neural Network based speech recognition technology targeting the development of products using voice as a primary human interface. WhisPro extends CEVA's intelligent sound IP portfolio, offering developers a holistic solution for cloud-based or edge voice-controlled devices.

Conclusion
An effective VUI engine such as CEVA’s WhisPro voice control technology forms a key part of our ability to use voice as a primary human interface for intelligent cloud-based services and edge devices. Speech recognition models need to have a high recognition rate. Inherent AI technology should support a range of commands for a wide variety of use cases and languages, without compromising on power or compute requirements. Last, to stop unauthorized use of a voice-activate device, security features such as speaker verification are a must.

For further information about CEVA’s voice control solutions, visit www.ceva-dsp.com

About the Author
Ido Gus serves as CEVA’s Deep Learning Senior Team Leader at the Sensor and Audio Business Unit. He brings over 15 years of experience, spanning software development, algorithm optimization, deep learning algorithm research, and project management, and he specialized in the application of deep learning algorithms to audio and sound processing. Ido holds a B.Sc. in Information Systems Engineering from Ben Gurion University of the Negev, and an M.BA. from the Hebrew University of Jerusalem. He is passionate about leading cutting edge deep learning projects from research to optimized implementation on edge devices.

This article was originally published in The Audio Voice newsletter, (#397), November 3, 2022.

« Back