Voice Commands on The Edge: How Syntiant Is Solving the Power Problem

June 11 2020, 17:10
Alexa! OK Google! Hey Siri! Did we get your attention?

If we were to have read that aloud most of us would have our favorite robot voice asking us what we want. Whether it is Amazon Alexa, Google Home, or another assistant, voice commands are the logical next step in user interfaces (UI). The uptake in the market has been extremely good for these devices. However, deployment has exposed some major issues with this technology. First, there are technical issues and second, there are privacy and security issues. The technical issues relate to accuracy, power consumption, and processing time while the privacy issues are a mix of technical ability vs. ethics. But since the desire to talk to our devices is not going away, in this article I'm going to discuss the Syntiant NDP 10x processor and how it solves the aforementioned challenges.

The UI development timeline from keyboard to mouse to touch and now to voice has been a compelling progression, which has driven a lot of exciting innovation in consumer electronics. For most of us, there are many use cases such as dirty hands in the kitchen, sweaty hands in the gym, or wet hands while relaxing in a hot tub - where voice control is an extremely useful application. The product is typically a Bluetooth (BT) speaker or headphone or some other mobile device such as a wearable or mobile phone that in these situations requires controls without contact. This is a cool feature, but there are some large technical constraints to execute this feature on edge devices. Minimizing battery power consumption while maintaining accuracy of the key word spotting (KWS) or wake word detection with low latency processing is the largest challenge.

Syntiant's neural network technology allows for BT speakers, BT headphones, or almost any edge device to feature voice assistant interactions on the most power-constrained devices. The Neural Decision Processor or NDP 10x is a purpose-built processor based on a semiconductor architecture that runs deep learning algorithms with always-on capability to send interrupts to the application processor when Alexa, OK Google, or other wake words are recognized. We will explain more about this later, but custom command words such as "volume up," "answer call," or "mute" are also possible. The key element is that the functions are triggered locally on the edge all while consuming less than 140 µW of power. To put into context, the Syntiant NDP100 is 200x more efficient than existing competitive solutions while offering higher frames per second (FPS) and considerably more parameters (or inputs).  Further, the inference time on the NDP processor had the lowest latency.
 

We can explain this with the analogy of the compute path in the NDP being similar to a gate logic function, best described as a classifier. The function of the processor arrives at either a yes or no decision that a keyword has been recognized and if yes, it sends an interrupt. In comparison, an MCU is burdened with the power cost of moving data in and out of memory to decide a keyword has been detected, while the Syntiant NDP exploits massive advantages in parallel processing. The next advantage for the Syntiant chip is the ability to be exponentially more accurate at 8 bit or lower processing, while the equivalent in an MCU might need 128 or 256 bit to achieve the same accuracy. This lightens the power consumption severely. The metaphor of a 20-year veteran bartender vs. a new bartender is a good comparison. Both serve more or less the same drinks but the veteran delivers faster with higher accuracy and expends less energy. The veteran is purpose built for the task, while the new bartender is probably capable of many tasks and bartending is just one of them. This, in a nutshell, is the NDP vs. the MCU/DSP.

For earbuds and mobile phones, which we can call "close-talk" devices, the precision of the wake word(s) is extremely critical to a successful product. While Syntiant can develop libraries or models of wake words for customers, they have also aligned a partnership with several key algorithm partner companies including Sensory to provide the voice recognition with extreme accuracy in at least 15 different languages. These algorithms are growing in precision with interpolative and extrapolative functions based on the growing datasets. The roadmap to natural language processing (NLP) is critical to the adoption of the overall voice command technology. We can imagine our BT speakers learning "turn it up" or "crank it up" or similar local commands on their own. This would be extremely useful and it would further push the adoption due to the ease of use. Looking forward, the improvement of the algorithms is complementary to the Syntiant hardware on which it runs. The advancements by both processor and algorithms will make this category a must-have feature.

Turning our attention to the privacy and security issues, according to a recent article in Forbes, voice assistants are exposed to several pitfalls in this area. The possible exposure of corporate or personal data via the ability to be always listening or possibly recording is a big concern. Syntiant provides a technical firewall to prevent Alexa, Google, or others from listening. Syntiant's Always on Voice (AoV) technology listens for wake or keywords. Only when they are authenticated does the system send an interrupt to the application processor to access the cloud or perform a function. To be simple, systems using Syntiant will not permit always recording or gathering to be a function of the system.  

This segues nicely into the close-talk devices with the highest utility for this technology, which are small and portable, "edge computing" devices. The congruent feature of these devices is that they come with small buttons and typically, severe cost constraints. It's a sloppy interface for the users to be pressing tiny buttons to answer calls, turn up the volume, or mute. It is too easy to have a bad experience. Some of the UI's in true wireless earbuds are quite cantankerous with a double tap the left ear to mute or a triple tap the right ear to track forward. Not to mention, the tapping often dislodges the earbud. Therefore, it is a somewhat obvious point that voice control UI with the NDP 10x relieves this problem. 
 

In sum, the Syntiant NDP 10x solves many of the hardware problems in voice command solutions for edge devices with power, and latency. In combination with a well-trained neural network, the accuracy is very high so it will increase the adoption rate of the technology. We all would be happy using voice commands if they reacted perfectly without draining the batteries on our devices, so Syntiant is looking very promising as it rolls out its purpose-built chips for voice commands. 
For further information contact Dave Lindberg.

This article was originally published in The Audio Voice newsletter 281. 
related items