Immersive Audio Explored - Really, What Is Immersive Audio?

October 3 2023, 14:10
Any form of audio and entertainment content today is produced and distributed as files, which are then streamed to consumers. audioXpress invited Emotion Systems, a company specializing in file-based audio automation tools, to share its perspective of working with immersive audio material.
 

If you asked people what immersive audio is, most would say that it is an audio-visual experience that enhances the enjoyment of the content. Some may even take it for granted and not differentiate between the image and the sound — it is just a combination of factors, the story, visuals, and audio!

But we can all agree that the sound and soundtrack bring content to life. Without immersive audio or with lower-quality audio, the enjoyment is significantly reduced.

Immersive audio is a generic term that describes a broad range of technologies designed to make the sound listening experience more lifelike and encompassing for the listener. Immersive audio aims to create a three-dimensional sound experience, which means listeners can perceive sounds coming from all around them, including from above and below, not just from left and right. This is achieved using various technologies and techniques, such as surround sound, object-based audio, and binaural audio.

So, What About Dolby Atmos?
Dolby Atmos is a specific immersive audio technology developed by Dolby Laboratories. Unlike traditional audio methods that distribute sound across different channels, it’s an object-based audio technology that breaks audio down into individual objects that move around in a three-dimensional space. With Dolby Atmos, sounds are precisely placed and moved anywhere in a room, including overhead, to make the entertainment experience sound more realistic and immersive.

In essence, Dolby Atmos is a form of immersive audio, but not all immersive audio systems use Dolby Atmos. Other formats, such as DTS:X, Auro 3D, Sony 360 Reality Audio, and others, provide immersive audio experiences but use different technologies or standards. The exact specifications and performance can vary depending on the technology used and how it’s implemented.

Therefore, while Dolby Atmos is a proprietary technology with specific standards and requirements, immersive audio is a broad term encompassing various sound technologies, including but not limited to Dolby Atmos.
 
Dolby Atmos Renderer audio production software, central to a Dolby Atmos mixing system, renders up to 128 inputs (including audio beds and objects with metadata) to any standard channel-based layouts and creates the Dolby Atmos master files, used for encoding for final distribution to streaming services or Blu-ray.

Spatialization
From the first phonographs to today’s immersive audio experiences, the evolution of audio technology has been a remarkable journey that’s changed how we experience sound.

In the beginning, there was monophonic sound (mono). This simple form of audio reproduction used a single channel, resulting in a sound that appeared to come from one place. There was no sense of feeling immersed in the sound. It was straightforward, with everything centered, as if all the audio emanated from a single point.

The advent of stereo sound marked a significant leap forward. It introduced a second channel, which allowed sound engineers to create a left and a right audio channel, providing a sense of depth and directionality to the sound. Listeners could now experience sounds panned to the left or right or split between the two channels, offering an impression of space and movement. In spatial terms, movement in X from left to right with L-R panning.

Following an era of important experiments with four (quad) or more channels, where much of the current understanding of “spatialization” originates, the next consequential advancement (from a consumer perspective) came with the introduction of surround sound. With multiple audio channels positioned around the listener, the sound could come from the front, sides, and behind. The most common system, 5.1 surround sound, offered five full bandwidth channels (front speakers left, center and right, or LCR), and two surround speakers (left surround (LS) and right surround (RS) and one low-frequency effects subwoofer channel, creating a significantly more immersive experience for the audience. In spatial terms, audio perception in X and Y provide a depth pan around the listener.

Now, in the era of immersive or 3D sound, the experience of listening has never been more lifelike. Technologies like Dolby Atmos have ushered in this era, creating a fully three-dimensional sound field, with even height channels incorporated to provide a true sense of sounds moving around and above the listener.
 
Immersive audio formats encode audio channels, audio objects, or higher-order ambisonics (HOA), which can be used alone or in combination with channels or HOA components. These can rendered to a number of standard or unique speaker configurations, using binaural rendering of sounds for headphone listening.

Channels to Objects 
In this new age of immersive audio, instead of just channels, sound designers can work with “sound objects.” They can place these objects anywhere in a 3D space, and the system will interpret where to direct that sound based on the listener’s speaker setup. This results in a listening experience that can closely mimic how we hear sound in real life, enveloping us in the audio and creating a sense of total immersion. And spatially, we are listening to content in X and Y and Z or height.

In the realm of visual technology, we began with standard definition (SD), providing minimal resolution and quality. High definition (HD) soon followed, offering a significant step up in both picture resolution and detail, transforming the viewing experience on television screens and monitors.

Ultra-high definition (UHD or 4K) further improved upon HD by providing even greater resolution, rendering more nuanced and lifelike visuals. The introduction of high dynamic range (HDR) brought about an ability to display a broader and richer range of colors, brighter whites, and deeper, more detailed blacks. This dramatically enhances the contrast and color range of the image, creating a more visually dynamic and realistic appearance.

Recently, the concept of high frame rates (HFR) has been introduced, particularly in cinema. HFR involves capturing and displaying more frames per second, leading to smoother motion representation, less blurring, and a heightened sense of reality.

Overall, the audio and visual technology landscapes have followed similar trajectories, each seeking to enhance their sensory experiences. Both have moved from mono/standard definition toward more immersive and lifelike experiences (immersive audio/UHD, HDR), increasing the level of immersion and realism.

However, it’s important to note that while the evolution of visual technology often garners more attention due to the visually dominant nature of human perception, advancements in audio technology have been just as crucial in shaping our media experiences. A high-quality visual experience can often feel inferior without equally high-quality audio, underscoring the importance of continuous evolution in both fields. 

Another way of thinking about this is to compare the finishing process of the moving image to the audio content. Comparing the process of finishing the components of the visual experience (i.e., the post-production in relation to the audio elements, albeit the original sound, foley, other sound FX, audio dubbing, and the soundtrack). Both focus on finessing and finishing.

The moving image post-production focuses on the final edit (pace and storytelling), the look or color grade (emotional tone) and combined elements to enhance realism (compositing and VFX). With the two processes, we focus a lot more on the moving image than the audio, and 80% of finessing is visual post-production and 20% on audio mixing and mastering. 

The irony is that the audio enhances the storytelling experience exponentially due to the extra sensory perception of our ears compared to our eyes and visual system. 
 
An Audio Definition Model (ADM) immersive audio file enables interoperability with any playback system, acting as a universal container that uses metadata to inform the playback system the location from which a sound should appear to be coming.

What Are the Delivery Requirements for Immersive Audio?
As with mono and stereo, which are directly linked with the evolution of physical media throughout the 20th century, the digital transition enabled the adoption and commercial distribution of multichannel audio consumer formats. Digital television, DVD, and Blu-Ray media contributed to the adoption of surround sound, even evolving from 5.1 to 7.1 surround speaker configurations. But unlike the success of surround sound in movies and TV production, attempts to convert music recording and distribution to multichannel was not as successful and was mostly abandoned with the demise of the two competing DVD-Audio (DVD-A) and Super Audio CD (SACD) physical disc formats.

Already in the 21st century, the adoption of file-based content production and distribution, followed by the introduction of streaming services, paved the way to the seamless adoption of multichannel audio and a renewed interest in immersive audio for music content.
 
Once an immersive audio file is created, dealing with critical processes such as loudness compliance, Dolby E encoding for distribution, downmixing or adding audio description requires understanding the file formats. That’s what Emotion Systems provides, adding automation tools for large quantities of files.

Immersive audio delivery is nuanced and relies on numerous factors, including the target platform, the distributor or exhibitor’s specifications, and the content type. A multifaceted approach is often required, keeping several considerations in mind.

First, the choice of audio format is critical. The audio must be delivered in an appropriate format, such as Dolby Atmos, Auro 3D, or DTS:X, as per the project’s specific requirements or platform. For television content, MPEG-H 3D Audio is an emerging audio coding standard with support for as many as 128 audio channels, audio objects, or higher-order ambisonics (HOA), which can be used to transmit immersive sound. MPEG-H is also the foundation for 360 Reality Audio, a music-specific immersive object-based format announced by Sony in 2019, and now available in TIDAL and Amazon streaming services, among other platforms.

Unlike the attempts to inspire the use of surround in music recording and distribution, immersive formats have been more widely embraced and Dolby Atmos Music and Sony 360 Reality Audio formats are currently available in mainstream music streaming platforms.

Second, the configuration of channels must align with the immersive audio format being used. For example, a 7.1.4 structure implies using seven traditional surround speakers, one subwoofer, and four overhead speakers. 

Third, ensuring that the immersive audio delivered can function seamlessly with various playback systems is essential. This ranges from high-end theater setups to home theater systems and could extend to headphones, provided a binaural mix is supplied.

Next, correct coding and encoding for the chosen immersive sound format are fundamental. Additionally, the audio should comply with all relevant loudness and quality standards.

The inclusion of essential metadata with the audio is another consideration. For instance, a Dolby Atmos mix is delivered as a traditional channel-based bed along with object-based metadata that outlines the movement and position of sound objects.

Consideration of legacy formats is also crucial when delivering immersive audio, as many users might need to be more capable of playback immersive audio. Therefore, often a high-quality stereo or surround downmix is required.

Last, conducting exhaustive quality control checks is essential to ascertain that the immersive audio mix is free from technical issues and functions optimally on various systems. One of the most revolutionary aspects of these formats is their capability to serve as a universal master, translating seamlessly across multiple listening environments.

The main idea behind this is the transition from channel-based to object-based audio. In traditional audio formats such as stereo or surround sound, each audio element is assigned to a specific channel or set of channels. However, in immersive audio formats, each sound is treated as an individual “object” that can be placed and moved anywhere in the three-dimensional sound field.

A typical immersive audio file such as Audio Definition Model (ADM) can support up to 128 audio channels with 118 defined objects. The metadata accompanying each object informs the playback system where that sound should appear to be coming from.

This object-based approach allows immersive audio to act as a universal master. Regardless of the end user’s speaker setup, the system can use the metadata to best represent the three-dimensional audio field intended by the sound designer. For example, in a home theater setup with overhead speakers, the system could use those speakers to represent overhead sounds.

But even in a traditional stereo setup or a binaural headphone mix, the system can use metadata to create a sense of three-dimensional sound through techniques such as psychoacoustic panning and binaural processing.

Essentially, instead of creating a separate mix for each potential playback format — a costly and time-consuming process — sound designers can create a single immersive audio mix that will translate effectively to all of them. This saves time and resources and ensures the audio experience is as close as possible to the original artistic intent, regardless of the listener’s setup. Thus, immersive audio provides a consistent, high-quality listening experience across platforms, making it a versatile and future-proof choice in audio content production and delivery.
 
Up mixing an original stereo mix to a more immersive mix of 5.1.2, 5.1.4, or even 7.1.2 is a common requirement that needs to be handled with attention to quality standards and format compliance.

How to Improve the Mastering and Versioning of Immersive Audio
The immersive audio experience has greatly revolutionized how we perceive audio-visual content. While this enriching experience brings us closer to reality, it poses several production and delivery challenges. Audio production and delivery technology have advanced in leaps and bounds to keep up with the increasing demand for quality content. Immersive audio adds a new layer of complexity to the audio post-production process.

This is where companies such as Emotion Systems provide significant efficiency gains by offering automated solutions for immersive audio processing and mastering.

Emotion Engine, an advanced automated audio processing and mastering system, is designed to handle the intricacies of immersive audio. The technology aims to streamline the audio post-production process and reduce the manual effort involved. Emotion Engine uses advanced audio algorithms to process and master immersive audio content.

Automation is crucial in managing the channels and objects involved in immersive audio content. Emotion Systems’ Engine automates rendering and encoding immersive audio mixes independently of being 9.1.6 or 7.1.4, or 7.1.2, enabling a seamless workflow from production to delivery.

One of the challenges with ADM is that it can carry up to 128 audio channels, which need to be rendered down to a channel-based format suitable for delivery. Fortunately, a 5.1.4 or 7.1.2 mix is included within the ADM file, allowing further downstream processing. 
 
Emotion System’s Engine automation workflow for the quality control of an immersive mix.

The Emotion Engine can process the 9.1.6 or 7.1.4 mix to create downmix versions for different delivery points, including 5.1.2, 5.1, Stereo and binaural stereo. This process allows for delivering immersive audio, even to setups that cannot handle object-based audio.

When working with immersive audio content, especially in theater presentations or high-production value broadcast situations, care and attention are required concerning what is played before and after the immersive experience. It is a common requirement to ensure no audio “dips” in the viewing experience, as this can be distracting and irritating or upsetting in the worst-case scenario.

Hence, balancing the immersive experience with up mixing the original stereo mix to a more immersive mix of 5.1.2, 5.1.4, or even 7.1.2 is paramount. Again, Engine has workflows for up-mixing for non-immersive audio content that reduces the distraction factor of experiencing serial non-immersive, immersive, and non-immersive audio.

The automation capability of an Emotion’s Engine system also extends to encoding the rendered immersive audio. It can encode the rendered audio to Dolby Digital Plus, a popular codec for delivering immersive audio. This way, Emotion’s Engine can support an end-to-end workflow, from receiving the ADM file to encoding it for delivery.

In addition to rendering and encoding, Emotion System’s Engine offers automated solutions for the quality control of the immersive mix, another critical aspect of audio post-production. The platform uses advanced algorithms to identify issues and errors in the mix, reducing the need for time-consuming manual checks.

And Emotion Systems constantly reviews emerging technologies and workflows that require automation.

In essence, automation capabilities offer an effective and efficient solution to handle the complexities of immersive audio processing and mastering, making it an indispensable tool for content creators and distributors.

Immersive audio offers an unprecedented opportunity to dramatically increase the emotional response to enjoying entertainment in the cinema, personal listening, and TV. Emotion Systems’ production-proven tools significantly improve efficiency when processing large volumes of immersive audio content for multiple distribution outlets and devices — while maintaining quality and compliance. aX

This article was originally published in audioXpress, September 2023.

 
Page description
About MC Patel
MC Patel is an industry veteran with nearly 50 years in the film and television industry. A graduate of Essex University in the UK, his final year project was displaying peak program meters on a TV display. During his time in the television industry, MC has b... Read more

related items