Soniox Unveils AudioMind Advanced AI Transcription Model

March 13 2024, 00:35
California startup Soniox has unveiled AudioMind, its first AI model capable of deeply understanding audio and transform conversations and transcription into actionable information. Already leading in the English voice-to-text, natural language processing space, Soniox is taking its approach much further by developing AI that is able to fully understand the world and humans through audio, aiming to solve problems requiring real-world interaction.
 

Soniox describes AudioMind as the world's first AI model capable of comprehending the full richness of audio. But as always, generic and somewhat abstract definitions don't fully encapsulate what is essentially a very useful and desirable AI transcription application.

Founded in 2020 and currently headquartered in Foster City, CA, Soniox developed one of the best speech recognition engines in the market. The company currently offers one of the leading cloud-based transcription engines available commercially - the one that audioXpress has been successfully using for interviews and general speech-to-text conversion.

Focused on speech AI, Soniox introduced the world's first unsupervised learning approach for speech recognition in 2021. This innovation was essential to overcome the limitations that previously hindered speech system performance.

In 2023, Soniox started transitioning from speech AI to general AI, leveraging its unique expertise in unsupervised learning and deep understanding of building accurate, reliable and efficient AI technology. Most recently, Soniox announced they were working hard on their own large language model (LLM) and announced Soniox 7B. An LLM that supports English and code with 8K context, built on top of Mistral 7B the LLM created by French company Mistral.ai, enhanced with additional pre-training and fine-tuning for strong problem-solving capabilities. According to Soniox, its Soniox 7B large language model outperforms Mistral 7B on all benchmarks, and matches GPT-4 on some benchmarks. Now, Soniox applied lessons learned with Soniox 7B to create an actual AI agent.

"Today marks a significant milestone in AI: the release of AudioMind, the first AI model capable of deeply understanding audio, giving access to the full spectrum of auditory experiences," states Klemen Simonic, Soniox CEO. "Soniox's mission is to understand the world and humans through audio. We started by building the most accurate speech recognition AI. Today, we are introducing AudioMind, the world's first AI model capable of comprehending the full richness of audio."
 

AudioMind has been trained to listen and understand audio in a manner akin to human processing. It can recognize speech, identify speakers, discern tone, gender, emotions, and distinguish between environmental and human-made sounds. The model is capable of summarizing and creating custom-formatted documents directly from audio, which are not feasible with text-only methods. It supports English and can process audio files up to 60 minutes in length, and it processes 1h of audio in about 3 minutes.

"AudioMind represents a significant leap forward in harnessing the power of audio. We hope it will transform our interaction with the audio world, unlocking new possibilities and catalyzing a wave of innovative applications across various fields," Simonic adds.

On the company's website, Soniox released a series of examples generated directly by AudioMind without modification, that are intended to demonstrate the capabilities of AudioMind with Transcript Generation, Speaker Intelligence, Sound Intelligence, Audio Summarization, Audio Document Creation, Audio Q&A, and Voice Interaction. The examples showcase how AudioMind generates custom transcripts when prompted, understanding formatting instructions. In those examples we can see how AudioMind can recognize, identify and understand the state of the speaker from voice, as well as recognize sounds and comprehend their context within the overall audio environment.
 
AudioMind also enables custom audio summarization through user-provided instructions via a prompt, and can convert audio into custom-formatted documents, using all available audio information to ensure content is organized and formatted as specified in the prompt. The model is even able to answer complex questions about audio content, such as identifying the topics of conversations, attribute dialogue to specific speakers, and analyze emotional tones and sounds.

Audio can also serve as a prompting method. Instead of typing, prompts can be spoken and AudioMind can hear the user's voice with details.

Klemen Simonic, Sonix founder and CEO, brings 12+ years of diverse industry and academic experience in AI from his work at Facebook, Google, Stanford University and the University of Ljubljana in Slovenia. As one of the founding members of Facebook’s speech team, the speech technologies Klemen built include voice activity detection, language identification and automatic speech recognition systems.

Ambroz Bizjak, is Soniox co-founder, and Chief Architect. Ambroz met Klemen at the University of Ljubjana in Slovenia during his undergraduate studies in mathematics and computer science. After finishing his studies, Ambroz worked at Cosylab for 8 years, where he developed the core software for control systems for particle accelerators, fusion reactors and cancer therapy systems that are used all around the world. Engineers and scientists involved in these high-impact projects have lauded Ambroz as one of the world's most exceptional programmers.

AudioMind is in preview mode, and access can be requested here.
www.soniox.com
Page description
About Joao Martins
Since 2013, Joao Martins leads audioXpress as editor-in-chief of the US-based magazine and website, the leading audio electronics, audio product development and design publication, working also as international editor for Voice Coil, the leading periodical for... Read more

related items