Subjective and objective assessment of sound quality: solutions and applications Carlos Herrero HUT, Telecommunications Software and Multimedia Laboratorio Carlos.Herrero@hut.fi Abstract The aim of the paper is to review current research projects and recommendations related to subjective and objective assessment of sound quality. The paper describes the problems and limitations of subjective testing, shows the results of evaluating ITU-R objective audio quality measurement method and also presents different application domains and recent research in this field. Table of Contents 1 2 Introduction.................................................................................................................... 1 Subjective assessment of sound quality......................................................................... 2 2.1 Review of ITU-R recommendations related to subjective testing ......................... 2 2.1.1 Recommendation ITU-R BS.1116 – Small impairments .............................. 2 2.1.2 Recommendation ITU-R BS.1534 – Intermediate quality ............................ 3 2.2 Limitations and problems of subjective tests......................................................... 5 3 Objective measurement of sound quality....................................................................... 5 3.1 Overview of ITU-R recommendations: PEAQ and PESQ .................................... 6 3.2 Evaluation of PEAQ .............................................................................................. 8 3.3 Application Domains of objective measurement of sound quality...................... 12 3.4 Beyond PEAQ and PESQ .................................................................................... 14 4 Conclusions.................................................................................................................. 18 1 INTRODUCTION Standards and recommendations about sound quality assessment are needed to correctly compare the performance of different audio systems and hardware. The main goal of this paper is to introduce and to explain objective measurements and subjective assessments of audio signals. The first part of the paper is dedicated to ITU-R recommendations related to subjective testing. It also tells about their problems and limitations, which serves as a nexus with the second part, where objective measurement methods are explained. 1 2 SUBJECTIVE ASSESSMENT OF SOUND QUALITY The digital audio chain contains different phases and equipments: microphone, recording, coding, transmission, decoding and loudspeakers. Linear and nonlinear errors accumulate in the audio chain. For recording, coding, decoding and transmission systems the goal is that the audio signal that comes out of the system should sound exactly like the input. The representation of audio signals by Pulse Code Modulation (PCM) can be made arbitrarily good simply by increasing the word-length, and transmission and recording of the PCM signals can be made arbitrarily precise by using appropriate error correction. In these stages of the audio chain there is always a design trade-off between computational complexity and audio quality. However, the converters, located at the beginning and at the end of the process, have analog limitations. Quantizers are inherently non-linear and sample rate conversion is needed to create consumer versions (at 44.1 kHz sample rate) from professional recordings (48 kHz). Nowadays, storage and transmission of music through Internet depends increasingly on lossy audio compression algorithms, which take advantage of the properties of the human auditory system using psychoacoustic models. With enough bit rates it is possible to control the resulting coding distortions, so that they are below the threshold of hearing, but, in many occasions, those distortions can still be easily detected. One of the best ways to compare coding algorithms, recording and transmission systems, or microphones and loudspeakers, is by using standardized methods for the subjective assessment of audio signals. Those methods have been historically defined by the International Telecommunication Union (ITU), and different recommendations have been proposed for different purposes, as can be seen in the following section. 2.1 Review of ITU-R recommendations related to subjective testing The methods used for the subjective assessment of audio quality itself and of the performance of audio systems depend somewhat on the intended purpose of the assessment. Hence, some recommendations are used when audio signals are tested altogether with pictures but, when this is not the case, two main situations can occur. If there are small impairments the ITU-R BS.1116 recommendation (ITU-R, 1997) is used. However, for evaluating audio signals with intermediate quality the recommendation ITU-R BS.1534 (ITUR, 2001), also known as MUSHRA, is the preferred method. Those are the recommendations most commonly used and are described in detail in following subsections. 2.1.1 Recommendation ITU-R BS.1116 – Methods for the subjective assessment of small impairments in audio systems, including multichannel sound systems It is the method intended for use in the assessment of systems which introduce small impairments. Those can be so small that to be able to detect them a rigorous control of the experimental conditions and appropriate statistical analysis are needed. If the analyzed systems introduce relatively large and easily detectable impairment, to use ITU-R BS.1116 recommendation leads to excessive expenditure of time and effort and results may also be less reliable than those obtained by employing a simpler test method. This recommendation is a basic reference for the other subjective assessment recommendations, which may contain additional special conditions or, as well, relaxations of the ITU-R BS.1116 requirements. The result of a test conducted according to ITU-R BS.1116 recommendation is the basic audio quality of the system under test. 2 During the test, the listener is free to listen to any of three audio sources, where one of them is known to be the reference signal. The other two sources may be either the test signal or the reference signal again. Listeners must be extensively trained, so that they are asked to rate those two audio sources in relation with the known reference signal. One of the sources must be indiscernible from reference signal and the other reveals impairments. A continuous five-grade impairment scale is defined, where 5.0 means imperceptible impairments, 4.0 perceptible but not annoying, 3.0 is given to slightly annoying impairments, 2.0 to annoying and 1.0 to very annoying impairments. To make the statistical analysis the listener’s rating are transformed into a single value, called the subjective difference grade (SDG), defined as the difference between the ratio of the test signal and the ratio of the reference signal. The SDG values 0 when tested signal contains imperceptible impairments or not impairments at all, an SDG of -4 warns that the test signal contains very annoying impairments. 2.1.2 Recommendation ITU-R BS.1534 – Method for the subjective assessment of intermediate quality level of coding systems A different recommendation is intended to cover the aspects of subjective assessment of intermediate quality level of coding systems. The advent of Internet multimedia has stimulated the development of several advanced audio and video compression technologies. To be used on Internet audio content is required to be coded with extremely low bit-rates, while preserving at a major extent the subjective quality of the original signal. Current Internet audio codecs experience a large variation in terms of the audio quality achieved at different bit-rates and for different audio signals. Subjective listening tests using a number of qualified listeners and a selection of audio sequences are still recognized as being the most reliable way of quality assessment. However, the test method defined in ITU-R BS.1116, and explained before, is not suitable for assessing such lower audio qualities; it is generally too sensitive, leading to a grouping of results at the bottom of the continuous five-grade impairment scale. Thus, the EBU Project Group B/AIM proposed a new test method, called MUSHRA (MUlti Stimulus test with Hidden Reference and Anchors). The method was designed to give a reliable and repeatable measure of the audio quality of intermediate-quality signals. The method was afterward standardized by the ITU-R and it is frequently used after that. Whereas ITU-R BS.1116 uses a “double-blind triple-stimulus with hidden reference” test method, MUSHRA is a “double-blind multi-stimulus” test method with hidden reference and hidden anchors. The first method is adequate when the test signal presents only small impairments and the assessor is asked to detect any perceptible annoyance caused by artifacts and distortions in the signal. When the test signal contains large impairments the assessor has not difficulties to detect the artifacts, but assessor must also grade the relative annoyances of various artifacts, which is a more difficult task. The perceptual distance between the reference and the test items is expected to be relatively large. Thus, if each system is only compared with the reference, the differences between any two systems may be too small to discriminate between them. Consequently, MUSHRA uses not only a high-quality reference but also a direct paired comparison between different systems. The assessor can switch at will between the reference signal and any of the systems under test. Because the assessors can directly compare the impaired signals, they can relatively easily detect differences between the impaired signals and can then grade them accordingly. This feature permits a high degree of resolution in the grades given to the systems. 3 The grading scale used in the MUSHRA process is different from the one used in ITU-R BS.1116, but it employs the scale traditionally used for the evaluation of picture quality: the five-interval Continuous Quality Scale (CQS). The intervals are described from top to the bottom as Excellent, Good, Fair, Poor and Bad. The listeners record their assessments of the audio quality in a suitable form; for example, using sliders on an electronic display. Figure 1. User interface for MUSHRA test. (Stoll, 2000) Figure 1 shows the user interface which was used for MUSHRA tests during an evaluation of Internet audio codecs by EBU B/AIM Project Group. The buttons represent the reference, which is specially displayed on the top left, and all the signals under test, including the hidden reference and two anchors, which are low-pass filtered versions (3.5 kHz and 7 kHz) of the unprocessed signal. Under each button, with the exception of the button for the reference, a slider is used to grade the quality of the test item according to CQS. Slides are typically 10 cm long or more, with an internal numerical representation in the range of 0 to 100. That is important because the statistical analysis of the results obtained is perhaps one of the most demanding tasks. The scores granted by each listener are normalized and combined with other listener’s scores. The calculation of those average scores will result in the Mean Subjective Score (MSS) for that signal. While SDG values obtained with ITU-R BS.1116 vary from 0 (excellent) to -4 (very annoying), the MSS values vary from the positive values 0 to 100, where here 0 corresponds to the bottom of the scale (bad quality). Figure 2 depicts one example of MUSHRA test results. There were 6 signals under test: one hidden reference, which always gets the maximum score, two anchors that get scores of 30 (3.5 kHz) and 55 (7 kHz); and, finally, three coded audio signals that get better scores for higher bit-rates. 4 Figure 2. AMR and AAC codecs compared with MUSHRA test. (Seppänen, 2004) 2.2 Limitations and problems of subjective tests The use of the human as an acoustic measuring device has a lot of well known disadvantages, the most important among them are the variety and variability of listeners (Rothauser, 1966). Moreover, in order to obtain reliable data, formal subjective tests should be performed under optimal listening conditions using careful experimental procedures and a sufficient number of expert listeners. Because of these constraints, many situations can arise where such listening tests are impractical (Treurniet, 2000). In most cases the result of a listening test is presented as a statement of the mean value of the listeners’ responses and of the variance of these responses. If these data are given, the general significance of the result is still unknown. Questions like the following have at least to be considered: What are the largest tolerable variances and the smallest number of listeners, so that the given mean value can really be representative for a greater community of people? Were the listeners selected randomly, or with regard to some special considerations like “students of HUT with normal hearing”? Was the group of listeners trained, and to what extent? Another fact to be taken into account is the ambiguity of the questioning. Thus, the interpretation of the respective subjective term is left to each listener himself. Beside to that, the translation into different languages of the questionnaire can provoke that the same test have different results for each type of audience. Standard and reproducible subjective measurement procedures were defined: MUSHRA and ITU-R BS.1116 recommendations. Those minimize the risk of having audience-dependent results, although are very expensive in terms of cost and time consumption. Because of the limitations and problems presented above, reliable methods for the objective measurement of perceived audio quality are highly desirable. Some of them are presented in the following section. 3 OBJECTIVE MEASUREMENT OF SOUND QUALITY Devising a method for predicting an average subjective quality rating using only objective measurements of audio signal characteristics is a significant challenge. It must include an 5 accurate model of psychoacoustic processes in order to predict the detectability of nearthreshold stimuli in various audio contexts, and it must also include knowledge about cognitive aspects of audio quality judgments. Those methods were first devised to be applied to speech codecs, and later on for widebandwidth signals. Several psychoacoustic models were proposed, for both narrow and wideband audio, thus, the emergence of various approaches emphasized the requirement for standardized methods. First, in 1996 was presented the ITU-R recommendation P.861 (ITUR, 1996), which describes an objective speech quality assessment algorithm for speech codecs, also called PESQ (Perceptual Evaluation of Speech Quality). Later on, in 1998, an algorithm for objective measurement of wide-band audio signals was presented, the ITU-R recommendation BS.1387 (ITU-R, 1998), also called PEAQ (Perceptual Evaluation of Audio Quality). Both measurement systems are described below, and, as well, a comparison of results obtained with subjective tests and objective measurements, which serves to evaluate the validity of PEAQ. The section finishes with some recent progress and advances in this field, research works that go beyond PESQ and PEAQ methods, as well as remaining challenges. 3.1 Overview of ITU-R recommendations related to objective measurement of sound quality: PEAQ and PESQ PEAQ and PESQ methods were not built form scratch, but combining ideas from several proposed methods (Thiede, 2000). For example, the road to the final standardization of PEAQ was as follows. In 1994 the ITU-R initiated a process to identify and recommend a method for objective measurement of perceived audio quality. The first task was to create a committee that should clarify the expected applications of such method, to examine the performance of existing methods, and to describe the method selected or, if existing methods were found to be inadequate, the new method created to meet performance requirements. A call for proposals resulted in responses from seven model proponents, and their performances were compared. But no one model was significantly better than all of the others, and the original proponents collaborate to develop a new improved model called PEAQ. Finally, two versions of the method were developed. The Basic Version of PEAQ is intended to be fast enough for real-time monitoring, whereas the Advanced Version requires more computational power to achieve higher reliability. A high-level representation of the PEAQ model is shown in Figure 1. In general it compares a signal that has been processed in some way with the corresponding original signal. Concurrent frames of the original and processed signal are transformed into a timefrequency representation by the psychoacoustic model. Then a task-specific model of auditory cognition reduces these data to a number of model output variables (MOV), and finally, those scalar values are mapped to the desired quality measurement. The psychoacoustic model in the Basic Version uses a Discrete Fourier Transform (DFT) to transform the signal into a time-frequency representation; however, the Advanced Version uses both a DFT and a filter bank. The data from the DFT is mapped from the frequency scale to a pitch scale, the psychoacoustic equivalent of frequency. For the filter bank, the frequency to pitch mapping is implicitly taken into account by the bandwidths and spacing of the bandpass filters. 6 Figure 3. High-level description of model. (Treurniet, 2000) The psychoacoustic model of PEAQ produces two different representations of the input signals. Those representations are compared by the cognitive model to calculate the MOVs values that summarize psychoacoustic activity over time. Important information for making the quality measurement is derived from the differences between the frequency and pitch domain representations of the reference and test signals. In the frequency domain, the spectral bandwidths of both signals are measured and the harmonic structure in the error is determined. In the pitch domain, error measures are derived from the excitation envelope modulations, the excitation magnitudes, and the excitation derived from the error signal calculated in the frequency domain. The model variables, MOVs, are used by the model to predict the subjective quality rating that would be assigned to the processed signal in a formal ITU-R BS.1116 based listening test. This prediction of the SDG is called the objective difference grade (ODG), and it has obviously the same meaning as the SDG value, which was explained previously in section 2.1.1. The PEAQ quality measurement is based on eleven MOVs for the Basic Version, and on five variables for the Advanced Version. The transformations from the MOVs to the ODG were optimized using data from previously conducted listening tests (Thiede, 2000). Similarity, the model for perceptual evaluation of speech quality, or PESQ, is based on an integration of two previous models, the perceptual speech quality measure, known as PSQM, and the perceptual analysis measurement system, PAMS. PESQ uses a psychophysical model of the human hearing system, as well as a cognitive model. The quality score is based on the average distance between the transforms of the reference and degraded signals. The quality score that PESQ produces is a prediction of perceived listening quality based on the absolute category rating method (ACR). In this method, listeners hear a number of degraded recordings, and are prompted to vote on each one according to an opinion scale such as the 5point listening quality (LQ) scale. In LQ scale 1 means bad quality, 2 poor, 3 fair, 4 good and 5 means excellent quality. The ACR method with the LQ opinion scale is the most commonly used method in telecommunications assessment, and was the primary focus during development of PESQ (Rix, 2000). 7 3.2 Evaluation of PEAQ In the previous section we have seen how PEAQ and PESQ methods work to produce estimations of perceived audio or speech quality, this section tells about the validity of PEAQ results. Different evaluation tests were performed to know the efficiency of the seven model proponents at the beginning of the PEAQ standardization process, comparing their results with the data available from multiple subjective listening tests. In order to compare the performance of the different models or model versions, a number of different criteria are relevant (Thiede, 2000): - Tolerance scheme. A tolerance scheme was designed to weight differently the deviations of the ODG values from the SDG ones at the upper and lower ends of the impairment scale. This is so because a difference of 0.5 grade has not the same significance near the lower end of the quality scale as near the upper end. Then, a tolerance region is created, which is related to the confidence intervals (CI) of the listening tests. The average distance from the ODGs outside the tolerance region to the boundaries is one criterion for evaluating measurement methods. As can be seen in Figure 4, error need to be larger for lower quality signals than for highquality signals in order to have an effect on the average. Figure 4. Tolerance region (minimum confidence interval=0.25) . (Thiede, 2000) - - - Correlation. The correlation coefficient is often used to express the strength of the linear relationship of one variable with another. Further, the squared correlation coefficient is a measure of the variance in one variable accounted for by the variance in the other. Since a linear relationship is expected between SDG and ODG variables, the correlation coefficient should be a useful criterion. However, the magnitude of the correlation can be affected drastically by the presence of few extreme outliers, so, this criterion should not be used in isolation. Absolute Error Score. The absolute error score (AES) was introduced to relate the accuracy of a model to the accuracy of the listening test. AES value is calculated in a similar way to the correlation, but it also depends on the confidence interval, which is different for each SDG value. Again, AES value gives useful hints, but it also should not be used in isolation to measure overall performance. Number of outliers. The number-of-outliers criterion is based on the premise that any prediction error exceeding the tolerance region boundaries is as severe as any other, independent of the absolute value of the error. This method consists on 8 simply counting all occurrences of errors larger than the SDG confidence interval, where normally the limits for the allowed error margin are asymmetric. Some of the previous criteria tell how much the algorithm fails, whereas others tell how often the algorithm fails. Following figures show the relation between subjective quality and the signal-to-noise ratio (SNR), and between both versions of PEAQ (Thiede, 2000). The solid lines represent the tolerance region. Looking at the figures we can conclude that the advanced version of PEAQ is superior to the basic version, the SNR is clearly not a viable measure of quality for audio signals. Figure 5. Relation between SDG and SNR, Advanced and Basic PEAQ. (Thiede, 2000) In the following example (Treurniet, 2000) we can see another evaluation experiment. In this case the performance assessment was divided into two parts: comparison by audio items and comparison by systems. The first 21 expert listeners evaluated the quality of eight audio items processed by 17 systems. A system is defined as a codec operating at a particular bit rate (6 codecs were studied). By averaging over listeners, the subjective data set was reduced to 136 mean SDGs. The performance of the objective measurement method (PEAQ) was evaluated by predicting the mean subjective quality rating for each item by system conditions. Figure 6 shows the relationship between the mean SDG and the ODG for the 136 items, i.e., comparison by audio items. The linear correlation between these variables is 0.85, and the slope of the regression line is 0.79. 9 Perfect correspondence between the objective measurements and the subjective quality ratings was not achieved since not all of the data points fall on the diagonal. However, the objective measurements agree reasonably well with the subjective quality grades. Some noticeable outliers suggest that the accuracy of the objective measurement method may be influenced by the nature of the audio material. An investigation of the most severe outliers indicated that they are due to two codecs processing two audio items. Figure 6. Correlation of mean-item ODG with SDG (r = 0.85). (Treurniet, 2000) The overall subjective quality of a particular system was defined as the average of the mean SDGs for the eight items processed by that system. The corresponding overall objective quality measurement was obtained by averaging the ODGs for the same eight items. Figure 7 shows the relationship between the 17 overall mean SDGs and ODGs, i.e., comparison by system. The linear correlation is 0.97 and the slope of the regression line is 0.95. Figure 7. Correlation of system ODG and SDG (r = 0.97). (Treurniet, 2000) Comparison by systems shows a correlation much stronger than the first method, correlation of grades for audio items. This can be understood as a consequence of averaging over subsets of audio items. Figure 8 shows the difference between the overall mean SDG and ODG for each of the systems. A positive value indicates that PEAQ underestimated the quality rating for that system, whereas a negative value indicates the opposite situation. It can be seen from this figure that the absolute value of this difference is always less than 0.5. Other conclusion that can be obtained from this graph is, for example, that the overall objective qualities of codec U and Z are somewhat lower than their subjective qualities. Such consistencies within codec families might be due to some unspecified types of distortions generated by the coding algorithms that are resolved suboptimally by PEAQ method. 10 Figure 8. Difference between SDG and ODG per system. (Treurniet, 2000) This section finishes with a performance assessment (Schmidmer, 2005) where PEAQ is not compared with the results of a ITU-R BS.1116 based listening test, but compared with a MUSHRA based test. The experiment is again related to audio coding and the seven audio codecs under test were in this case mentioned: o o o o o o o Microsoft Windows Media 4 MPEG-4 AAC (Fraunhofer) MP3 (Fraunhofer) Quicktime 4, Music-Codec 2 (Qdesign) Real Audio 5.0 RealAudio G2 MPEG-4 TwinVQ (Yahama) 48 kbps Stereo - DR 100 Score 80 60 Subjective 40 Objective 20 0 1 2 3 4 5 6 7 8 Codec No. 64 kbps Stereo - DR 100 Score 80 60 Subjective 40 Objective 20 0 1 2 3 4 5 6 7 8 Codec No. Figure 9. Differences between objective and subjective assessment of audio quality. (Schmidmer, 2005) 11 3.3 Application domains of objective measurement of sound quality The performance of microphones, recording and transmission equipment, and loudspeakers has been made better over time by successive incremental improvements. The present practice utilizes several kinds of measurements that date from the very early days of audio, to characterize the linear and nonlinear errors that accumulate in the audio chain. For some new processes, specifically low bit-rate audio codecs, the measurement of these traditional audio parameters has been never been strictly appropriate. Low bit-rate codecs introduce new kinds of errors that the traditional measurements were not designed to detect, and in fact these kinds of systems could even be designed to measure well even when they don’t sound good. The objective measurement recommendations presented in this section have been done to try to assess automatically the degradation of audio quality in different stages of the audio chain. The systems were designed to emulate the way human hearing works to distinguish different sounds from one another. Sometimes the system has to work on real-time, whereas non-real-time measurement is sufficient for other applications; it determines the version of PEAQ to be used. Some of the possible application scenarios for objective measurement techniques are listed (Thiede, 2000): - Assessment of implementations. Procedure for characterizing different implementations of audio processing equipment, in many cases audio codecs. - Perceptual quality lineup. Fast procedure that tests equipment or circuits before putting them into service. - On-line monitoring. Continuous process to monitor audio transmission in service. - Equipment or connection status. Detailed analysis of a piece of equipment or a circuit. - Codec identification. Procedure to identify type and implementation of a particular codec. - Codec development. Procedure characterizing performance of a codec in as much detail as possible. - Network planning. Procedure to optimize cost and performance of a transmission network under given constraints. - Aid to subjective assessment. Tool for identifying critical material to include in a subjective listening test. As an example, we consider the results of an investigation (Benjamin, 2002) that used PEAQ to measure the audio degradation caused by Sample Rate Conversion, Analog to Digital Converters (ADC) and Digital to Audio Converters (DAC). Sample Rate Conversion works by interpolating new samples to redefine the waveform that was described by the original samples. The interpolation error is likely to be greater at high frequencies than at low frequencies because high frequency waveforms change value more between samples than do low frequency waveforms. In the study the input signal was a 20 kHz full-scale sine wave sampled at 44.1 kHz, which was sample rate converted up to 48 kHz. It was noticed that the sample rate conversion process introduced numerous artifacts in the output signal. PEAQ cannot directly compare programs at different sample rates. For this reason the experiments were performed by doing sample rate conversion in pairs. In principle, the amount of distortion can be controlled by adjusting the length of the interpolation filter. On the other hand, the requirement about allowed time to perform conversion limits the length of the filter. The material was twice converted, first from 44.1 kHz to 48 kHz, and then back to 44.1 kHz again, using several prototype and commercial sample rate conversion programs and devices. The twice converted files were then assessed for quality using PEAQ, 12 the original files acted as reference. Figure 10 shows the progressive degradation of audio quality as the number of conversions is increased. Figure 10. Audio quality after tandem sample rate conversions. (Benjamin, 2002) The length of the interpolation filter does not seem to be a strong determinant of audio quality until the length of the interpolation filter is reduced to 33 or 17. The degradation associated with higher length interpolation filters could entirely be due to the issue of cumulative round-off error. The quality of both type of converters, ADC and DAC, is the subject of intensive effort and discussion. For example, the audiophile press claims that PCM process and the converters associated with Compact Disc format are the cause of substantial audio degradation. Several methods were exposed for evaluating digital audio converters, e.g., looking at the spectrum of sine-wave stimuli, but became more difficult to apply with complex signals, either music or speech. PEAQ is obviously not able to assess impairments in the analog domain. In order to measure the impairments associated with the conversion process it is necessary to measure the composite effect of the ADC and DAC. This has a clear disadvantage, the process cannot directly distinguish whether the artifacts are due to the ADC, or to the DAC, or both. Figure 11 shows how PEAQ was used to measure the audio quality of DACs and ADCs. Figure 11. Block diagram of DAC/ADC evaluation. (Benjamin, 2002) The original program material was played back in real-time, either from a CD player, a DVD-Audio player, or a computer hard disk, in all cases using digital output interfaces. The digital output is sent to a pair of DACs and DACs and the twice-converted file is recorded at the same time that the original file is recorded. Then the two files, representing the original 213 channel program and the degraded program, are applied to the PEAQ process, which then give us an ODG value. The degraded file can be put through the conversion process any number of times to increase the degradation caused by the conversion process. Even if the combination of DAC and ADC is nearly transparent, some number of passes through the conversion process will cause audible degradation. The programs chosen for the test where all from the collection of recommended material for subjective quality assessment created by EBU. The harpsichord arpeggio and castanets were chosen as representative of instruments with extended high frequency spectral content. The results of the tests are shown in Figure 12. Figure 12. Quality after repetitions of DAC/ADC conversion. (Benjamin, 2002) The ODG shows a consistent decrease in quality as the number of conversions is increased. After only one pair of conversions is very small (-0.09). But, after 50 conversions the quality has dropped to about -1.5, a score between “Perceptible, but not annoying”, and “Slightly annoying”. The figure shows an abrupt decrease in quality after conversion 16. Based on his own experience, the author of the investigation concluded that PEAQ does a very good job of predicting the audibility of errors, and that PEAQ can be used to measure very small changes in audio quality, even smaller than can be detected in listening tests. 3.4 Beyond PEAQ and PESQ The estimation of audio quality is becoming more important especially in telecommunication applications, where Quality of Service is one of the key considerations. Thus, at this moment, there are many on-going research projects and advances in this area, as there are remaining challenges as well. The IEEE Signal Processing Society has recently published a Call for Papers for a special issue of IEEE Transactions on Speech and Audio Processing that will focus on objective quality assessment of speech and audio. Contributions will be received until February 2006 and must be related to some of the following topics: - Subjective basis for objective quality assessment. Waveform models - based on waveforms of speech and audio Parametric models - based on telecommunication or broadcast network parameters Intrusive models Non-intrusive (single-ended or output-based) models Objective diagnosis of quality impairment 14 - Objective and subjective assessment of conversational quality Issues and applications relevant to real-world problems The tentative publication date for this issue is January 2007, so, anyone interested in the topic should put attention on it. Besides that, this section will tell about some of the research papers that have presented more recently and serve to discuss about future and applications of objective measurement of audio quality. Motivation of these investigations used to be that listening tests are reliable but very expensive, time consuming and, sometimes, impractical. On the other hand, existing objective quality assessment methods require either the original audio signal or complicated computation model, which makes some applications of quality evaluation impossible. Libin Cai and Jiying Zhao, working at the University of Ottawa, proposed to used digital audio watermarking to evaluate the quality of speech (Cai, 2005) and audio signals (Cai, 2004). As shown in Figure 13, in order to measure the audio quality, the new scheme proposed only needs the quantization scale and watermarking key which are used in the embedding process. Figure 13. Audio quality measurement based on watermarking. (Cai, 2004) When a watermarked audio signal is distorted, the correct watermark detection rate decreases accordingly. In an ideal measurement, the percentages of correct watermark extraction for all watermarked audio should have the same proportion to the same distortion. However, different audio signals comprise different frequencies and amplitudes, hence they have different robustness to the same distortion. It is difficult to measure audio quality if a fixed quantization step is used. Thus, authors employ an adaptive control method to obtain the optimized quantization step for different audio signals. Using fixed quantization steps, the 15 system produces the lowest percentage of correct watermark extraction. At the end of the process the extracted watermark is compared with the original watermark to obtain the Percentage of Correct Extracted Watermarks bits (PCEW). The signal was artificially attacked with additive noise, Gaussian noise and low-pass filters. The following figures show the average PCEW value that can be used to measure the audio quality. Figure 14. Effects of additive noise, Gaussian noise and low-pass filtering. (Cai, 2004) The evaluation of this method is more clearly exposed in the paper referred to speech, which presents correlation coefficients between PCEW and PESQ MOS. The Absolute Residual Error (ARE) and correlation coefficients are shown in Figure 15. Figure 15. Accuracy of the watermarking-based assessment method. (Cai, 2005) 16 Rahul Vanam and Charles D. Creusere (Vanam, 2005) demonstrated that the Advanced Version of PEAQ performs poorly when compared to the previously developed Energy Equalization Approach (EEA) for evaluating quality of low bit-rate scalable audio (supported for example in MPEG-4 standard). They also created a modified version of PEAQ, including Energy Equalization parameter to the other MOVs, and the performance was improved significantly; even compared with EEA. Scalable audio compression means that system encodes audio data at a bit-rate and decodes it at bit-rates less than or equal to the original bit-rate. Objective quality measurement of low bit-rate scalable audio using PEAQ has been found to be poor for the Basic Version. The Advanced Version, which was tested during the investigation (Vanam, 2005), also performs poorly; and EEA is superior to it, as can be seen in Figure 16. Figure 16. Evaluation of both versions of PEAQ and EEA for scalable audio codecs. (Vanam, 2005) The corresponding correlation coefficients are 0.365 for Basic Version of PEAQ, 0.325 for the Advanced Version, and 0.669 for the Energy Equalization approach. Those values are far from being acceptable, so a new modified version of PEAQ was proposed. The Advanced Version is modified and an additional MOV is used to calculate the ODG values. The correlation coefficient for the modified Advanced Version is found to be 0.8254, indicating that it has superior performance over EEA, as can be seen in Figure 17. Figure 17. Evaluation of BasicPEAQ and EEA for scalable audio codecs. (Vanam, 2005) 17 While the previous research work is based on a more complex version of PEAQ method, the last investigation mentioned in this paper is based on a simplified version of PESQ. The goal of S. Voran (Voran, 1998) was to simplify PESQ algorithm while having minimal effect on its performance. Modified algorithms reduced the number of floating point operations in 64% with only a 3.5% decrease in average correlation to listener opinions. There were 6 components of PESQ algorithm that were removed or re-adjusted to create also 6 different modified versions of PESQ. Figure 18 shows what are the elements under consideration for each version and their complexity-performance trade-off. Figure 18. Description of six simplified versions of PESQ and their performance. (Voran, 1998) According to this study, it appears that a portion of the PESQ algorithm complexity is not contributing much to the perceived speech quality estimation. At least for the seven subjective tests considered in the study. Using the proposed simplifications, the algorithm may be a candidate for inclusion in speech coders. It might provide feedback to parameter selection, excitation search, and bit-allocation algorithms to ensure that the highest possible signal quality is obtained at the lowest possible bit-rate. 4 CONCLUSIONS This paper has presented two ITU recommendations for the subjective assessment of sound quality. By using them it is possible to compare the performance of different audio systems and devices in a reliable manner. As we have seen, the ITU-R BS.1116 recommendation is very efficient for evaluating small impairments in audio signals, while the ITU-R BS.1534 is intended for intermediate quality signals. Thus, the first one can be applied 18 for example to compare performance of Analog Digital Converters, and the latter for comparing audio codecs at low bit-rates. Listening tests are very reliable but also very expensive, time consuming and, sometimes, impractical. Because of that, recommendations for objective measurement of sound quality have been proposed. The methods standardized by ITU are PEAQ, for wideband audio signals, and PESQ, for speech signals. Objective methods try to imitate the way human listeners perceive sounds using psychoacoustic and cognitive models. PEAQ algorithm, for example, was created combining seven model proponents. When models were evaluated their performances were not significantly different, so, the original proponents were called to collaborate developing a new improved model. Objective measurement algorithms are periodically evaluated, comparing their results with those of subjective tests assessments and looking the correlation coefficients. PEAQ and PESQ algorithms seem to be highly reliable in many cases, but for some specific scenarios, like for evaluating quality of low bit-rate scalable audio, perform very poorly. This is one of the main motivations for further research in this field. Other investigations follow different approaches, for instance, trying to achieve similar quality but with simplified versions of PEAQ and PESQ, or, even, with alternative solutions to them, we have seen for example methods based on watermarking. At this moment, objective measurements can not be considered totally reliable and subjective assessments based on listening tests are still needed, but for many applications they offer sufficient quality, and in some cases are more practical. REFERENCES Benjamin, E. 2002. Evaluating digital audio artifacts with PEAQ. AES Convention Paper. 113th AES Convention, Los Angeles, CA, October 2002. Cai, L. And Zhao, J. 2004. Audio quality measurement by using digital watermarking. Proceedings of IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) 2004, Niagara Falls, Ontario, Canada, pp.1159-1162, May 2-5, 2004. Cai, L. And Zhao, J. 2005. Speech Quality Evaluation: A New Application of Digital Watermarking. Proceedings of 2005 IEEE Instrumentation and Measurement Technology Conference, Ottawa, Ontario, Canada, pp.726-731, 17-19 May 2005. ITU-R BS.1116, Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems. 1997. ITU-R BS.1387, Method for objective measurement of perceived audio quality. 1998. ITU-R BS.1534, Method for the subjective assessment of intermediate quality level of coding systems. 2001. ITU-R P.861, Objective quality measurement of telephone-band speech codecs. 1996. Rix, A., Beerends, J., Hollier, M. and Hekstra, P. 2000. PESQ – the new ITU standard for end-to-end speech quality assessment. AES Convention Paper. 109th AES Convention, Los Angeles, CA, September 2000. Rothauser, H. and Urbanek, G. 1966. Some problems in subjective testing. AES Convention Paper. 31st AES Convention, New York, October 1966. 19 Seppänen, J. 2004. Mobile multimedia codecs and formats. Multimedia Seminar lecture, Fall 2004. Available at: http://www.tml.tkk.fi/Studies/T-111.550/ Schmidmer, C. 2005. Perceptual wideband audio quality assessments using PEAQ. 2nd Workshop on Wideband Speech Quality. Mainz, Germany, June 2005. Stoll, G. and Kozamernik, F. 2000. EBU listening tests on Internet audio codecs. EBU Technical Review, June 2000 Thiede, T., Treurniet, William C., Bitto, R., Schmidmer, C., Sporer, T., Beerends, John G., Colomes, C., Keyhl, M., Stoll, G., Brandenburg, K. and Feiten, B. 2000. PEAQ - The ITU Standard for Objective Measurement of Perceived Audio Quality. Journal of the Audio Engineering Society (AES), vol. 48, Number 1/2, Jan/Feb 2000. Treurniet, William C. and Soulodre G. 2000. Evaluation of the ITU-R Objective Audio Quality Measurement Method. Journal of the Audio Engineering Society (AES), vol. 48, Number 3, March 2000. Vanam, R. And Creusere, C. 2005. Evaluating low bitrate scalable audio quality using advanced version of PEAQ and energy equalization approach. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2005, Vol. 3, pp.189-192, Philadelphia, PA, March 18-23, 2005. Voran, S. 1998. A simplified version of the ITU algorithm for objective measurement of speech codec quality. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 1998, Vol. 1, pp.537-540, Seattle, WA, May 12-15, 1998. 20
© Copyright 2024