Some statistical analysis by AMTuring about this listening test...
[{POST_SNAPBACK}][/a]
Oh please, can we keep the crackpot science out of this forum? This person wasn't banned here for no reason. Just juggling around scientific words doesn't magically make anything you say sensible, lest alone correct.
It is well known that some songs are more difficult to encode than others, and they result in lower quality encoded files regardless of the encoder used. So the assumption of equal means amongst experiments is violated.
Bzzzt. This was a VBR test. Meaning, although the average was 128kbps (or slightly more), the codecs could spend as much bits as necessary to keep all clips at a constant quality. This means you cannot immediately assume the means aren't equal, in fact it should be the opposite.
So, what happens if we actually look at the data? (note that he provides many graphcs to 'illustrate' his points, except the ones where, well, the data doesn't support his claims anywhere) The variance of the means of the samples is much less than the difference between the codecs themselves. (exluding Shine, which is CBR)
In other words, VBR works. I would have thought that that was "well known" by now.
The following table shows the Tukey HSD applied to the ranks.
Say what? You cannot apply plain Tukey HSD to rank scores, it's a parametric test. Now, I'm willing to argue that we shouldn't use parametric analysis (because the top end of the results clips at 5.0, and you can see this by observing that the lower rated the codec, the higher the variance). However, if anything parametric analysis gives stronger results. If you use rank scores, let's actually use the rank score version of Tukey HSD to analyze the results:
FRIEDMAN version 1.24 (Jan 17, 2002) [a href=\"http://ff123.net/]http://ff123.net/[/url]
Nonparametric Tukey HSD analysis
Number of listeners: 18
Critical significance: 0.05
Nonparametric Tukey's HSD: 25.894
Ranksums:
Vorbis iTunes WMA Nero LAME
73.00 62.50 49.50 48.50 36.50
-------------------------- Difference Matrix --------------------------
iTunes WMA Nero LAME
Vorbis 10.500 23.500 24.500 36.500*
iTunes 13.000 14.000 26.000*
WMA 1.000 13.000
Nero 12.000
-----------------------------------------------------------------------
Vorbis is better than LAME
iTunes is better than LAME
Gee, where did those "extra" conclusions go?
Let's compare this to the means with parametric Tukey HSD:
FRIEDMAN version 1.24 (Jan 17, 2002) [url=http://ff123.net/]http://ff123.net/[/url]
Tukey HSD analysis
Number of listeners: 18
Critical significance: 0.05
Tukey's HSD: 0.110
Means:
Vorbis iTunes WMA Nero LAME
4.79 4.74 4.70 4.68 4.60
-------------------------- Difference Matrix --------------------------
iTunes WMA Nero LAME
Vorbis 0.049 0.090 0.106 0.193*
iTunes 0.041 0.056 0.143*
WMA 0.016 0.103
Nero 0.087
-----------------------------------------------------------------------
Vorbis is better than LAME
iTunes is better than LAME
Coincidence? Hardly. If you derive the rank scores from the means, how can you expect a different conclusion? How would you expect throwing away information to increase the significance? It won't, unless you use a completely wrong analysis method.