Blind listening tests
Many experiments have proven that audible differences that listeners can hear between audio sources are sometimes the product of imagination. These illusions can be strong, durable, shared by many listeners, and consistently associated with the knowledge of the audio source that is listened to.
A Double Blind listening Test (DBT) is a listening setup that allows to confirm that a given audible difference is indeed caused by the audio sources, and not just by the listener's impressions.
In an ABX double blind listening test, the listener has access to three sources labeled A, B, and X. A and B are the references. They are the audio source with and without the tweak. For example the wav file and the MP3 file. X is the mystery source. It can be A or B. The listener must guess it comparing it to A and B.
But if the listener says that X is A, and that X is actually A. What does this prove ?
Nothing of course. If you flip a coin in my back and a state that it's heads, and I'm right, it doesn't prove the existence of my para-psychic abilities that allow me to see what's in my back. This is just luck, nothing more !
That's why a statistical analysis is necessary.
Let's imagine that after the listener has given his answer, the test is run again, choosing again X at random 15 times. If the listener gives the correct answer 16 times, what does it prove ? Can it be luck ?
Yes it can, and we can calculate the probability for it to happen. For each test, there is one chance out of two to get the right answer, and 16 independant tests are run. The probability to get everything correct by chance is then 1/2 at the power 16, that is 1/65536. In other words, if no difference is audible, the listener will get everything correct one time out of 65536 in average.
We can thus choose the number of trials according to the tweak tested. The goal being to get a success probability inferior to the likelihood, for the tweak, to actually have an audible effect.
For example if we compare two pairs of speakers. It is likely that they won't have the same sound. We can be content doing the test 7 times. There will be 1 chance out of 128 to get a "false success". In statistics, a "false success" is called a "type I error". The more the test is repeated, the less type I errors are likely to happen.
Now, if we put an amulet besides a CD player. There is no reason that it changes the sound. We can then repeat the test 40 times. The success of probability will then be one out of one trillion (2 to the power 40). If it ever happens, there is necessarily an explanation : the listener hears the operator moving the amulet, or the operator always takes more time to launch the playback once the amulet is away, or maybe the listener perceives a brightness difference through his eyelids if it is a big dark amulet, or he can smell it when it is close to the player...
Let p be the probability of getting a success by chance. It is generally admitted that a result whose p value is inferior to 0.05 (one out of 20) should be seriously considered, and that p < 0.01 (one out of 100) is a very positive result. However, this must be considered according to the context. We saw that for very suspectful tweaks, like the amulet, it is necessary to get a very small p value, because between the expected probability for the amulet to work (say one out of a billion, for example), and the probability for the test to succeed by chance (1 out of 100 is often chosen), the choice is obvious : it's the test that succeeded by chance !
Here's another example where numbers can fool us. If we test 20 cables, one by one, in order to know if they have an effect on the sound, and if we consider that p < 0.05 is a success, then in the case where no cable have any actual effect on the sound, since we run 20 tests, we should all the same expect in average one accidental success among the 20 tests ! In this case we can absolutely not tell that the cable affects the sound with a probability of 95%, even while p is inferior to 5 %, since anyway, this success was expected. The test failed, that's all.
But statistic analyses are not limited to simple powers of 2. If, for example, we get 14 right answers out of 16, what happens ? Well it is perfectly possible to calculate the probability that it happens, but mind that what we need here is not the probability to get exactly 14/16, but the probability to get 16/16, plus the one to get 15/16, plus the one to get 14/16.
An Excel table gives all needed probabilities : http://www.kikeg.arrakis.es/winabx/bino_dist.zip (http://www.kikeg.arrakis.es/winabx/bino_dist.zip) . It is based on a binomial distribution.
Now, how to setup the listening test so that its result, if positive, is really convincing ? There are rules to observe if you don't want, in case of a success, have all your opponent laugh at you.
Rule 1 : It is impossible to prove that something doesn't exists. The burden of the proof is on the side of the one pretending that a difference can be heard.
If you believe that a codec changes the sound, it is up to you to prove it, passing the test. Someone pretending that a codec is transparent can't prove anything.
2. The test should be performed under double blind conditions (*).
In hardware tests, this is the most difficult requirement to meet. Single blind means that you can't tell if X is A or B otherwise than listening to it. Double blind means that nobody in the room or the imediate surrounding can know if X is A or B, in order to avoid any influence, even unconcious, on the listener. This complicates the operations for hardware testing. A third person can lead the blindfolded listener out of the room while the hardware is switched. High quality electronic switches have been made for double blind listening tests ( http://sound.westhost.com/abx-tester.htm (http://sound.westhost.com/abx-tester.htm) ) : a chip chooses X at random, and a remote control allows to compare it to A and B at will.
Fortunately, in order to double blind test audio files on a computer, some ABX programs are freely available. You can find some in our FAQ (http://www.hydrogenaudio.org/forums/index.php?act=ST&f=5&t=7516#entry74066).
3. The p values given in the table linked above are valid only if the two following conditions are fulfilled :
-The listener must not know his results before the end of the test, exept if the number of trials is decided before the test.
...otherwise, the listener would just have to look at his score after every answer, and decide to stop the test when, by chance, the p value goes low enough for him.
-The test is run for the first time. And if it is not the case, all previous results must be summed up in order to get the result.
Otherwise, one would just have to repeat the serial of trials as much times as needed for getting, by chance, a p value small enough.
Corollary : only give answers of which you are absolutely certain ! If you have the slightest doubt, don't answer anything. Take your time. Make pauses. You can stop the test and go on another day, but never try to guess by "intuition". If you make some mistakes, you will never have the occasion to do the test again, because anyone will be able to accuse you of making numbers tell what you want, by "starting again until it works".
Of course you can train yourself as much times as you whish, provided that you firmly decide beforehand that it will be a training session. If you get 50/50 during a training and then can't reproduce this result, too bad for you. the results of the training sessions must be thrown away whatever they are, and the results of the real test must be kept whatever they are.
Once again, if you take all the time needed, be it one week of efforts for only one answer, in order to get a positive result at the first attempt, your success will be mathematically unquestionable ! Only your hifi setup, or your blind test conditions may be disputed. If, on the other hand, you run again a test that once failed, because since then, your hifi setup was improved, or there was too much noise the first time, you can be sure that there will be someone, relying on statistic laws, to come and question your result. You will have done all this work in vain.
4. The test must be reproducible.
Anyone can post fake results. For example if someone sells thingies that improve the sound, like oil for CD jewel cases of cable sheath, he can very well pretend to have passed a double blind ABX test with p < 0.00001, so as to make people talk about his products.
If someone passes the test, others must check if this is possible, by passing the test in their turn.
We saw what is an ABX test, with the associated probability calculus, that is perfectly suited for testing the transparency of a codec, or the validity of a hifi tweak. But this is only the ABC of statistic tests.
For example, in order to compare the quality of audio codecs like MP3, in bigger scaled tests, ABC/HR test are used (see http://ff123.net/abchr/abchr.html (http://ff123.net/abchr/abchr.html) ), that are more sophisticated. Each listener has two sliders and three buttons for every audio codec tested. A and B are the original and the encoded file. The listener doesn't know which one is which. C is the original, that stands as a reference. He must give, using the sliders, a mark between 1 and 5 to A and B, the original getting 5 in theory.
A probability calculation allows then not only to know if the tested codec audibly alters the sound, but also to estimate the relative quality of the codecs for the set of listeners involved, and this, still under double blind conditions, and with a probability calculus giving the relevance of the result. These calculus, according to the needs of the test, can be performed with the Friedman method (http://www.graphpad.com/articles/interpret/ANOVA/friedmans.htm), for example, that gives a ranking for each codec, or also with the anova (http://www.psychstat.smsu.edu/introbook/sbk27.htm) one, that gives an estimation of the subjective quality perceived by the listeners on the 1 to 5 scale.
Note that this kind of statistical analysis is mostly used in medicine, and that to get an authorization, any drug must prove its efficiency in double blind tests (both the physicians and the patients ignore if the pill is a placebo or a medication) against placebo (the drug must not only prove that it works, but that it works better than a placebo, because a placebo alone works too), and the decision is based on mathematical analyses such as the one we just saw. Thus they are not quickly made guidelines for hifi tests. They are actually general testing methods used in scientific research, and they remain entirely valid for audio tests.
(*) The double blind setting may be replaced by a carefully set simple blind setting. I saw two accounts of simple blind listening tests that failed, proving that, when done carefully, a simple blind setting is enough to fool the listener.
Interpretation of a blind listening test
Of course ABX test are not infaillible.
Chaudscisse gave an excellent summary of the drawback of ABX testing in a french forum : http://chaud7.forum-gratuit.com/viewtopic....&start=450#5543 (http://chaud7.forum-gratuit.com/viewtopic.php?t=17&postdays=0&postorder=asc&start=450#5543)
However, since even for french native speakers the text is almost incomprehensible, I'll have to make a summary.Most often, it is admitted that an event whose probability of not occuring is smaller than 1/20 is "statistically significant". No interpretation, this p value is the result of a mathematical calculus relying only on what have been observed. Former results from similar tests, the quality of the test, and other statistic calculations are not taken into account. However, these events have an influence on the probability that the observed difference is real.
- Number of testers : Studies made with a small umber of listeners are more sensitive to mistakes occuring in the test setup. Wrong stimulus presented, mistakes copying the results etc. For this reason, when the result depends on one or two people, conclusions must be cautious.
- Predictability level : there are more chances to have got a success after N tests have been performed, than performing only one test. For example, if we want to test something that has no effect, the result that we get will be decided by chance only. Imagine that 20 people run independant tests. According to chance, in average, one of them should get a "false positive" result, since a positive result is by definition something that occur no more than one time out of 20. The p calculation of each test does not take this into account.
- Multiple comparisons : if we select two groups in the population, using one criterion, there will be less than 1 chance out of 20 to get a "statistical difference" between the two. However, if we consider 20 independant criterions, the probability to get a significant difference according to one of them is much higher than 1/20.
For example, if people are asked to rate the "dynamics", "soundstage", and "coloration" of an encoder, the probability to get a false positive is about thrice as high as with one criterion only, since there are three possibilities for the event to occur. Once again, the p value associated with each comparison is inferior to the real probability to get a false positive.
The original text is much longer, with some repetitions, and other ideas that I didn't translate, because they are not directly related with ABX tests reliability.
I would like however to add an important point. The interpretation of the p value.
It is by convetion admitted that p<5 % is an interesting result, and p<1% a very significant one. This does not take into account the tested hypothesis itself.
If we are testing the existence of Superman, and get a positive answer, that is "Superman really exists because the probability of the null hypothesis is less than 5%". Must we accept the existence of Superman ? Is it an infaillible, scientific proof of its existence ?
No, it's just chance. Getting an event whose probability is less than 5% is not uncommon.
However, when a listening test about MP3 at 96 kbps gives a similar significant result, we accept the opposite conclusion ! That it was not
chance. Why ?
Why does the same scientific result should be interpreted in two opposite ways ? This is because we always keep the most probable hypothesis. The conclusion of an ABX test is not the p value alone, it is its comparison with the subjective p value of the tested hypothesis.
Testing MP3 at 96 kbps, what do we expect ? Anything. We start with the assumption that the odds of success are 1/2. The ABX result then tells us that the odds of failure are less than 1/20. Conclusion, the success is the most probable hypothesis.
Testing the existence of Superman, what do we expect ? That he does not exists. We start with the assumption that the odds of success are less than one in a million. The ABX result then tells us that the odds of failure are less than 1/20. Conclusion, the failure is still the most probable hypothesis.
That's why, in addition with all the statistical bias already mentionned above
we should not always take 1/20 or 1/100 are a target final p value. This is correct for tests where we don't expect a result more than another, but for tests where scientific knowledge already gives some information, smaller values can be necessary.
Personnaly, in order to test the existence of Superman, i'd rather target p<1/100,000,000
Examples of false positive results :
Regular ABX, 12/13 right answers by chance. (http://www.hydrogenaudio.org/forums/index.php?showtopic=6651&st=25&p=70284&#entry70284)
Sequencial ABX, many results with p < 0.01 (http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&st=0&p=151930&#entry151930)
This topic can be discussed here : http://www.hydrogenaudio.org/forums/index....topic=43516&hl= (http://www.hydrogenaudio.org/forums/index.php?showtopic=43516&hl=)
Addition of two introducing sentences that better define the goal of blind listening tests.