New Public Multiformat Listening Test (Jan 2014)
Reply #160 – 2013-12-13 22:40:24
If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing. The point is that you get the same accurate result regarding the variance of the total sample (all samples together), so using 2 listeners makes you actually lose significant information on a per-sample basis. But if the only question is "how can I minimize the error of the overall result", i.e. find the best encoder on average, you can easily disregard that information. So, semi-intuitively this result seems to be understandable to me, but still mind-blowing, indeed. That's statistics. :-) That's why I advocate to do statistics only for each sample, and leave the interpretation towards overall quality to the user. I think this represents reality best, especially as the outcome of the various samples doesn't have the same meaning to every user. A person who is very sensitive towards transients for instance will give these samples a much stronger weight than a person who is pretty insensitive here. I love the diagrams where the samples are shown on the x-axis and their average (and maybe more statistical) outcome on the y-axis, the outcome for each encoder shown in a different color. It shows it all on one glance without any over-simplification. Even for readers who don't want to go much into detail this diagram shows which encoders are attractive to use and which are not. Most important: this way important information on sample performance is kept and not extremely aggregated into just one average plus additional statistical information the exact meaning of which is hardly understood by anybody turning us all to beleivers.