Speech Quality in Conferencing Scenarios: Understanding All Optimization Requirements

June 25 2020, 12:45

While on the one hand, we experience the benefit of having communication possibilities anytime and anywhere and use all types of communication platforms allowing us to interact in virtual conferences without being present physically - we also experience significant variations in communication quality, especially when being involved in virtual meetings. The question is: What is communication quality and which factors distinguish good communication quality from poor communication quality?

When defining communication quality you will find that there are varieties of aspects in speech quality, which need to be considered from the user's point of view, such as listening speech quality, and associated artifacts, such as interrupted and distorted voice; the delay to transport the speech signal from one end to the other, hampering the conversational flow; echo and howling effects especially in multi-party conferences; not being able to interrupt conversational partners, and more.

Additionally, the user's environment and the user's behavior plays an important role. Adverse conditions, noise, and reverberation may impair the speech quality on all sides of a virtual meeting. User movements, non-optimum positioning of devices, and their effects on communication quality need to be understood. Perceptually, latency, speech intelligibility and listening effort, loudness, speech sound quality, talking effort and echo performance, double-talk performance, and localization performance are all relevant parameters deserving technical investigation.

These parameters are perceptually relevant and can be evaluated subjectively. However, for (objective) system test and product optimization, procedures are required to help engineers in a lab-type environment. The instrumental procedures used nowadays to qualify and optimize products are based on perceptual investigations and can be applied with a good degree of confidence by engineers in the lab.

Laboratory-Based Testing
Technically, the conversational quality is determined by the quality of any device used in a connection. Transparency and compatibility of devices is key when interconnected. The quality of the communication terminal, the network quality, and the quality of the conference bridge are of equal importance.

The minimum requirements found in existing standards and which are typically applied by manufacturers of devices are insufficient and generally do not cover this complex scenario. For optimization, a variety of additional technical parameters can be measured and used.

The basic optimization process starts with the endpoints of the connection, the terminals - regardless of which type of terminal is used: handset or headset (wired and cordless), hands-free, conference system, smart speaker, PC, tablet or other smart devices.

Realistic Simulation of Real-Life Situations
In this context, it is also necessary to take into account the user behavior as well as the user's environment. Nobody can perform such tests in real-life situations, but fortunately there are laboratory-based solutions available.

Background noise simulation is important to validate the performance of noise cancellation and voice enhancement strategies for a variety of background noises found in daily life scenarios. ETSI TS 102 224 [1] specifies the recording and setup procedures as well as pre-recorded background noise for those simulations. Using an HEAD acoustics simulation system 3PASS allows this simulation for all types of terminals with acoustical interfaces.

An identical setup can be used for the simulation of reverberation, which is of special importance in the sending direction. Reverberation may impair speech communication as well as speech recognition. A simulation method for lab-based testing with reverberation, according to ETSI TS 103 557 [2], is used for that type of simulation.

User behavior can also be tested in different ways for different devices. A motorized handset positioner is used for testing positional robustness with all type of handheld devices. The efficiency of noise cancellation can be evaluated and optimized by simulating various user positions. The goal is to find a good balance between loudness, speech sound quality and the amount of background noise reduction in various positions and for various noise and room conditions. In a similar way, the speech sound quality and the listening effort [3] can be optimized in the receiving side.

Testing Conversational Performance
Testing the robustness of echo cancellation requires a variable echo path during the tests. An almost soundless rotating reflector can be used in the nearfield. The conversational performance of devices is determined by the delay introduced and the seamless behavior in all types of double-talk situations with and without background noise. During double-talk, no artifacts should be observed, neither from switching or echo nor any performance decrease with background noise. The signal processing quality in a terminal mostly determines all these parameters and can be optimized in the lab. A very good overview about testing and performance requirements can be found in the ETSI standard series [4] and [5]. A setup example is shown in Figure 1.

Figure 1: VoIP conferencing test setup with headsets.

Testing of headphones follows the same principle. Various positions typical for users are chosen for the tests as described above.

Measuring Speech Coding Performance
In speakerphone mode (e.g., with hands-free, laptop, tablet and handheld devices) and with conference devices, user behavior impact and the acoustical environment are even more critical. The acoustical coupling between loudspeaker(s) and microphone(s) is strong, leading to potential echo problems. User movements affect echo canceling performance more than in handheld devices. The distance between the talker and the microphone is much higher, and as a consequence, room noise and room reverberation have a much stronger impact on the performance than in handheld mode.

However, as with the handheld mode, similar testing strategies simulating reverberation and background noise ([1], [2]), user movement simulation by using a rotating reflector, and a turntable to rotate the device under test can be used. And the metrics used for qualifying are the same as in handheld mode. The parameters determining the quality from the user's point of view are almost the same.

Evaluating Network Performance
Speech coding performance in conjunction with network performance is essential for all types of endpoints including modems. Since most conferencing solutions are OTT solutions, no bandwidth guarantee is given and channel quality may vary depending on load conditions and routing. Lab-based testing is performed using a different type of network jitter and loss simulations applied to the speech codec. The parameters tested are listening speech quality and jitter buffer handling. With jitter and packet loss, it is key to limit the increase of the jitter buffer size while simultaneously keeping the speech quality high. This can be achieved by optimizing the jitter buffer handling and the packet loss concealment, which today is typically part of the speech codec used.

Help Is Available
All the procedures and setups described above are key in all HEAD acoustics' development work over the last few decades. Complete sets of automated test procedures and simulations are available for laboratory use, for all types of devices and configurations. In addition, setups for the optimization of the complete transmission chain are available. There is big potential in quality increase if such advanced optimization strategies are used.

HEAD acoustics provides complete and automated test systems to bring the user's environment to the lab. The most advanced tests and simulation methods are available and can be used for device and system optimization. When successfully applied, a much better user experience for all systems can be realized than often found today.

HEAD acoustics developed ABLE (Assessment of Binaural Listening Effort), a method to predict perceived listening effort that evaluates the influence of background noise on speech signals.

References:
[1] ETSI TS 103 224: Speech and Multimedia Transmission Quality (STQ): A sound field reproduction method for terminal testing including a background noise database, ETSI 08/2019

[2] ETSI TS 103 557: Speech and Multimedia Transmission Quality (STQ): Methods for reproducing reverberation for communication device measurements, ETSI 08/2019

[3] Gierlich, H.W.; Reimes, J.: Objective Listening Effort Evaluation, audioXpress 04/2020

[4] ETSI ES 202 737 - ETSI ES 202 740 series: Speech and multimedia Transmission Quality (STQ); Transmission requirements for narrowband @wideband VoIP terminals (handset, headset & hands-free) from a QoS perspective as perceived by the user, ETSI 03/2020

[5] ETSI TS 102 925: Transmission requirements for Super-Wideband/Fullband hands-free and conferencing terminals from a QoS perspective as perceived by the user, ETSI 10/2018

« Back