Should HA promote a more rigorous listening test protocol?
Reply #47 – 2012-11-28 19:08:15
If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained. My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse). I was suggesting that before the main test (which still has a lot of testers and a lot of samples), appropriately close anchors could be chosen by a short test on only a few samples which rules out anchors that are vastly superior or vastly inferior to the codecs under test. I don't think Woodinville believes it is essential that the anchors must be outside the range of the codecs under test (i.e. consistently lower and higher) but could be fairly consistenty towards the low end and fairly consistently towards the high end, assuming we used two anchors. I think Woodinville mentioned the trickiest thing to get right. If we presume that the nature of low-pass filter quality degradation is too different to the nature of typical codec flaws (warbling, sparklies, tonal problems, transient smear, pre-echo, stereo image problems etc.) then we'd be looking for anchors instead among other encoders and settings not under test, or from consistent distortions of a similar nature. For example, we might choose a prior generation codec, even at a slightly higher bitrate as a lowish anchor. Maybe Lame -V7, for example, or l3enc at 160 kbps -hq or 192 kbps, or toolame at 128 kbps, perhaps, or FhG fastenc MP3 at a setting with Intensity Stereo rather than safe joint stereo. Perhaps a high anchor could be a previous test-winner at a slightly higher bitrate where some flaws are still evident (so that it still acts as a Positive Control - i.e. distinguishable from the original audio). There are certain encoders so badly flawed that some testers will immediately identify them, so I suppose Xing (old) with no short blocks or BLADEenc would not be good choices. It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test. Another potential use of the anchors would be to calibrate and normalize the quality scales used by different listeners, though the validity of this is questionable as some people find pre-echo more annoying than tonal problems, or find stereo collapse less objectionable than high-frequency sparklies for example, while others have the reverse preferences. The preferences here are part of the reason that results can be intransitive. Once or twice, anchors have also been used to address a common claim or myth (e.g. that WMA 64 kbps is as good as MP3 and 128 kbps). For guruboolez, some 80-96 kbps tests used lame at about 128 kbps as an anchor to assess where the truth lay at the time to his ears, for example. I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time. These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude. We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.