A true tri-dimensional representation of sound in space? An effective way to translate virtual acoustics? A limitless soundstage for infinite creativity? What about sound “in your head”? Companies such as Embody are working hard to generate personalized spatial audio that focuses on a reference listening experience - meaning, you have the feeling of being in real studio environments listening to a pair of studio monitors, while listening through headphones. Not sound “in your head” but real reference room sound. But yet, most of the electronic music produced these days is optimized to sound as if it is “in your head,” because producers assume listeners use headphones - and “like” headphones. The concept of two speakers in a room is increasingly alien to most people.
Benefiting from an implementation of head-tracking that actually works, Apple proposes that we use the AirPods Pro earbuds or the new AirPods Max headphones to listen to immersive audio - as if we were in a real home theater environment - including listening to a Dolby Atmos movie soundtrack, which requires ceiling speakers (typically 7.1.4 channels). And yet, Dolby itself is promoting Dolby Atmos music, which is produced in a studio surrounded with speakers, but is actually mixed for binaural rendering, a smart speaker, a soundbar, or even with smartphone speakers. If we are mixing music, what exactly should be done with that? What should be the target? Grammy Awards are already being attributed to "Immersive Audio” works. But what will consumers recognize as immersive in a Morten Lindberg Grammy award-winning album, when listening on a smartphone that claims to support Dolby Atmos? Will they be able to tell a Lady Gaga song on TIDAL was mixed for “spatial audio”, or will they just say “nah… this is stereo!” when it’s not?
And what should we do with existing stereo conventions that are firmly established in our collective memory of how music should sound - such as how an orchestral piece with the violins on the left, the clarinets center left, the cellos on the right, and trombones on the far right of the stage? Where can you start being creative and not “break the rules”? How far can we go in sound design for immersive audio? Are there any boundaries? Or is spatial audio just wild territory, free to explore?
I started to write about this topic with audio product developers in mind. But the questions equally apply to content creators, because basically we all have the same doubts. Audio developers designing new earphones are dealing with signal processing and algorithms required by spatial audio and how those will be translated by (predominantly) two-channel systems. Others are attempting well-tuned and aligned multi-way transducer designs to translate some extra perception of tri-dimensionality. And will that sound as the “musicians intended”? (If there even is such a thing in spatial audio…).
Apple engineers designing the new M1 iMac were told that the new desktop computer design needed to be able to play Dolby Atmos movies in a convincing way - and they had to do their best from a two-dimensional thin frame, with microspeakers placed relatively close to each other. In a way, not much different from designing a soundbar for immersive formats, just using smaller speaker drivers while benefiting from much more powerful “smart” DSP. But what did they tune the system for when it is playing music and not movies or games?
As spatial audio creation takes off for multiple applications, we are also seeing a diversity of creative approaches for immersive audio reproduction (also frequently marketed to consumers as “3D audio”). Things get more specific when we discuss audio formats, typically Dolby Atmos (the de-facto mainstream immersive audio standard) and MPEG-H (the open, broadcast immersive audio standard, also used in Sony’s 360 Reality Audio) - both object-based. But there are many more “formats,” which all explore a combination of object-based audio, channel-based content, and/or scene-based for real-time user interaction (such as in gaming and interactive experiences like virtual reality).
All that is extensively covered and summarized in Nuno Fonseca’s booklet “All You Need to Know About 3D Audio” - recommended. There are no simple explanations of what spatial or immersive audio entails, because things get complicated depending on whether you are looking at it from the creation of content perspective, or actual implementation of playback systems.
Nuno Fonseca, a University professor and member of multiple Audio Engineering Society (AES) technical and standards committees on audio for cinema, audio for games, and spatial audio, is also the founder and CEO of Sound Particles, a fast-growing software company that offers software tools for spatial audio creation - extensively adopted in Hollywood and by all major production houses (the list of Sound Particles’ users is an incredible Who’s Who of mainstream content production). The company’s software tools and approach to “particles” is unique precisely because it allows being both universal and agnostic to “formats.”
The questions remaining are: What can I do with spatial audio? How exactly should we conceptualize 3D sound?
The tools for “spatial audio” processing are now being created - and apart from ideas of moving sounds from back to front (great for flying helicopters - not so great for a guitar solo), there is no universally recognized idea of what it means. Particularly for music, everything remains relatively unfamiliar territory.
The live sound industry is now totally focused on paving the way for experimentation. Immersive sound installations with multiple arrays and speakers surrounding auditoriums (no speakers from the sky/ceiling, unlike Dolby Atmos cinema installations) are an attractive proposition. They tend to sound extremely clean and less fatiguing, because the whole dynamics change, with less compression demanded from the need to “throw” sound to cover the audience from a frontal location. With multiple speaker arrays, sound engineers feel the extra headroom, allowing them the margin to play a bit more with sound dynamics, and even frequency-based timing/phase manipulation.
Ideal sound installations will surround the audience to precisely render any object-based sound creation, translated to whatever channel sound reproduction is in place - with hundreds, dozens or simply five channels, supported with a bit of virtual acoustics - as Meyer Sound, L-Acoustics, d&b audiotechnik, Yamaha, Spatial, FLUX, and many other companies are enabling.
While professional audio is moving confidently ahead, the consumer electronics industry is currently busy combining its most powerful digital signal processors that can be fitted in low power wearables, precisely to generate those spatial audio cues in binaural systems. In recent examples, CEVA is working with spatial audio pioneer VisiSonics to combine motion sensors for head-tracking and their powerful DSP family, in order to handle multichannel audio and 3D audio impressions rendered binaurally using generic or personalized head-related transfer function (HRTF) profiles. And Dirac offers a complete suite of spatial audio solutions for headphones, combining its patented Dynamic HRTF technology, magnitude response correction, impulse response correction, and digital signal enhancement, which can be ported to all major platforms from Qualcomm, RealTek, Airoha, BES, and multiple available SDKs. In fact, like other DSP pioneers active in spatial audio, Dirac perfected all those technologies originally for automotive and home theater applications.
But all those development tools still need reference material in order to become relevant for consumers. As the late Bruce Swedien (1934–2020) and Al Schmitt (1930–2021, RIP) reminded us, music “production” is so much about experimentation and creativity as it is about meeting established conventions - particularly when addressing specific music genres. An experienced music producer knows exactly how to position the percussion and drum sounds relative to the bass track, and how to use dynamic processing for creating the required tri-dimensionality between voice and all the different base and solo instruments. In their minds, the mixing process of those “aural cues” is defined “spatially” through a two-channel system, and the result can even be extremely impressive when translated to any actual “spatial audio” music mix. But what should we do with music created FOR spatial audio?
As Nuno Fonseca appropriately states, “Space is still one dimension that is not fully explored by musicians.”
This article was originally published in The Audio Voice newsletter, April 29, 2021 (#325). Click to register to The Audio Voice