Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Potential Biases in MUSHRA Listening Tests ? (Read 9922 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Potential Biases in MUSHRA Listening Tests ?

Did anyone attent the presentation or read this paper? I'm curious to hear comments. 20% seems a lot, but it might be good to know in which conditions that occurs.
Quote
P3-8 Potential Biases in MUSHRA Listening Tests—Slawomir Zielinski, Philip Hardisty, Christopher Hummersone, Francis Rumsey, University of Surrey - Guildford, Surrey, UK
The method described in the ITU-R BS.1534-1 standard, commonly known as MUSHRA (MUltiple Stimulus with Hidden Reference and Anchors), is widely used for the evaluation of systems exhibiting intermediate quality levels, in particular low-bit rate codecs. This paper demonstrates that this method, despite its popularity, is not immune to biases. In two different experiments designed to investigate potential biases in the MUSHRA test, systematic discrepancies in the results were observed with a magnitude up to 20 percent. The data indicates that these discrepancies could be attributed to the stimulus spacing and range equalizing biases.
Convention Paper 7179

Potential Biases in MUSHRA Listening Tests ?

Reply #1
I imagine it touches on some topics which can be found in this paper:

www.surrey.ac.uk/soundrec/ias/papers/Zielinski.pdf

Regarding one point in the above paper, I have often wondered how valid it is to average scores across listeners when we do group tests.  The data is all there, I guess, for anybody who has an uncontrollable urge to look for multi-modal distributions and figure out where they might come from :-)

ff123

Potential Biases in MUSHRA Listening Tests ?

Reply #2
In short, what the paper pointed out is that anchoring is essential to the test, and that you can get rather surprising results when anchors are not in the positions expected.

Also, the scale was shown to stretch and contract with various qualitites of stimulii.

My problem is simpler.

It is well known and demonstrated that if one has 4 different signals, a may be preferred to b, b to c, c to d, and d to a. There is no "transitive" nature to the subjective response.

But MUSHRA ignores that.
-----
J. D. (jj) Johnston

Potential Biases in MUSHRA Listening Tests ?

Reply #3
A conference paper by Zielinski fleshed the ideas in the AES presentation out, and is available online

http://www.surrey.ac.uk/soundrec/ias/papers/Zielinski.pdf

The work was also published in JAES last year, as

S. Zielinski., F. Rumsey, and S. Bech. On Some Biases Encountered in Modern Audio Quality Listening Tests - A Review. J. Audio Eng. Soc. Vol. 56, 6, pp. 427-451 (June 2008).

Potential Biases in MUSHRA Listening Tests ?

Reply #4
A conference paper by Zielinski fleshed the ideas in the AES presentation out, and is available online

http://www.surrey.ac.uk/soundrec/ias/papers/Zielinski.pdf

The work was also published in JAES last year, as

S. Zielinski., F. Rumsey, and S. Bech. On Some Biases Encountered in Modern Audio Quality Listening Tests - A Review. J. Audio Eng. Soc. Vol. 56, 6, pp. 427-451 (June 2008).


An intesting and it seems an insightful paper. This jumped out at me:

"It was shown that hedonic judgments (related to pleasantness) may introduce more bias to the results of audio quality listening tests than sensory judgments. Consequently, hedonic judgments should be avoided in audio listening tests if possible. For instance, the participants could be asked to evaluate sound character or audio fidelity (trueness with respect to a reference) rather than how much they like, dislike, prefer or desire certain audio stimuli."

Also:

"One may argue that the two currently most popular methods for evaluation of audio quality [8], [9] are free from the aforementioned biases, as they use an emotion-free definition of audio quality which is substantially different from the definitions quoted above. According to both standards, the basic audio quality is defined as a single, global attribute used to judge any and all detected differences between the reference and the object. This definition does not make any references to the “satisfaction”, “adequacy” or “desired nature” of a sound but to the perceptual “difference” between the audio reference and the object under evaluation. Since the perceptual “difference” can be considered as an emotion-free attribute, one could conclude that in these two standardised methods there is no place for any hedonic judgments. However, a close examination of the grading scales used in these standard techniques reveals that this conclusion is flawed. According to the ITU-R BS. 1116 recommendation, a 5-point impairment scale should be used in listening tests involving small audio quality impairments [8]. It can be seen in Fig. 2 that the two ends of the scale do not contain bipolar labels, as the top end of the scale is concerned with imperceptibility of impairments whereas the middle and bottom parts of the scale are used to represent different levels of annoyance. In other words, this scale can be described as a “hybrid”, combining two different perceptual constructs at two ends of the scale; perceivability at the top and annoyance at the bottom. Since the “annoyance” construct is directly related to disliking, it can be inferred that the middle and bottom part of the scale will involve a substantial proportion of hedonic judgments. Hence, all the biases discussed in the previous section can potentially affect the results obtained using the ITU-R BS. 1116 recommended method."

There is another kind of bias to hedonic judgements which is indirect. Let's say that I was comparing a 2-channel system to a 7.1 channel system. if the program material is exploiting the 7.1 system than identifying each system's identity is pretty trivial. How do we keep people from given responses that are biased by this obvious identification? Isn't an blind test where the results are obvious subject to some of the same or at least similar biases as a sighted evaluation?

Potential Biases in MUSHRA Listening Tests ?

Reply #5
An example from codec testing:

Tester 1:  "Hmm, I can hear the stereo narrowing; I know this codec must be Ogg Vorbis.  I'm going to mark this one down."
Tester 2:  "Ack, I notice the stereo narrowing on that one, but this one here really plays havoc with the high frequencies.  I'd recognize it a mile away as WMA.  Yuck, I really hate that."
Tester 3:  "I really can't hear anything except this slight smearing of transients.  I'd guess that to be mp3.  Better give it a lower rating."

Each of these tester's ratings gets averaged together to come up with one overall rating.  Seems like it's subject to listener bias, similar to sighted tests, as Arny points out (or as Monty pointed out long ago:  "None of my listening tests are really blind."), and seems like assigning a single rating is not quite the right thing to do.

Potential Biases in MUSHRA Listening Tests ?

Reply #6
seems like assigning a single rating is not quite the right thing to do.

Such is always the nature of complex judgements. Sometimes you need a single figure judgement, but you need to know why you need it, and not assume that it tells you everything.

So he says, after a working life grading student work and being forced to express complex combinations of different strengths and weaknesses in a single mark: but it shows in reports of codec tests. The overall grade for someone sensitive to preecho would be different to the appropriate grade to someone who isn't; and a rating might well be dependant on the listener's high-frequency hearing.

So, at the most general level, I'd think that's one of the things you have to constantly work with, rather than hoping you can eliminate it. You might, for instance, eliminate hedonic considerations from audio testing, but only at the cost of removing it from the whole point of the thing, which is producing sounds people like listening to.

 

Potential Biases in MUSHRA Listening Tests ?

Reply #7
A conference paper by Zielinski fleshed the ideas in the AES presentation out, and is available online

An intesting and it seems an insightful paper. This jumped out at me:

"It was shown that hedonic judgments (related to pleasantness) may introduce more bias to the results of audio quality listening tests than sensory judgments. Consequently, hedonic judgments should be avoided in audio listening tests if possible. For instance, the participants could be asked to evaluate sound character or audio fidelity (trueness with respect to a reference) rather than how much they like, dislike, prefer or desire certain audio stimuli."



The problem with this statement is that in the real world  (i.e. not universities) humans routinely make hedonic judgments (i.e. preferences) when making purchase decisions. So I cannot avoid making preference measurements when doing competitive benchmarking of consumer, professional or automotive audio products. Ideally, you should measure both preferences and the various sound quality attributes (spectral balance, spatial attributes, distortion,etc) so that you understand the relationship between the two.

There is always potential bias in both preference AND non-hedonic attribute judgments. To say that you must avoid hedonic measurements is not  very useful for scientists who have to use them when doing listening tests aimed at real-world applications.  You just have to learn how to deal with the biases, rather than trying to run away from them.

Cheers
Sean Olive
Director of  Benchmarking & Acoustic Research
Harman INternational
My Audio Blog