Concerning bias of listening test due to variance of codec bitrates.
The table of codec bitrates for previous HA@96 listening test shows that resulting bitrates of vbr encoders are not equal for selected test set of sound samples (the test set).
Nero CVBR TVBR FhG CT low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 94.9 100.9 93.45 100.4 100.0 99.6
It looks like everybody understands that such inequality favors some codecs in the listening test. At least this is not a secret and IgorC mentioned about that here.
Let's define the issue more clearly. We have the table of codec per-sample bitrates (above) and the table of codec per-sample scores:
Nero CVBR TVBR FhG CT low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 3.69 4.42 4.36 4.26 4.08 1.52
For each sound sample and four vbr encoders (first four columns) we can calculate coefficient of correlation between bitrates and corresponding scores. These twenty coefficients are below:
Sample01 0.6454
Sample02 0.6352
Sample03 0.7327
Sample04 0.2685
Sample05 -0.3851
Sample06 0.6219
Sample07 0.5927
Sample08 0.2423
Sample09 0.7509
Sample10 0.8660
Sample11 -0.4295
Sample12 0.6259
Sample13 0.6286
Sample14 0.7710
Sample15 0.5018
Sample16 0.1358
Sample17 -0.5315
Sample18 0.8167
Sample19 -0.4780
Sample20 0.2855
And here is bootstrap mean of these coefficients:
We can see strong evidence of correlation between bitrates and scores (all means are significantly far from zero). In simple words, the final scores depend on resulting bitrates. This is a bias.
Once again, it seems that people here are well aware of this dependence but prefer to think that this bias is acceptable and even justifiable by the “nature of vbr encoding”. It is considered that target bitrates should be calculated using as big and varied as possible music library and inevitable inequality of bitrates with the test set is a consequence of their natural behavior and should be kept. So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient. You can't choose which hypothesis is true until you get the scores of perceptual quality. The variance of bitrates itself (without scores) can be interpreted both ways – as a smart decision of efficient vbr codec and as protective response of poor one. In other words, the variation of bitrates itself has no any useful meaning, it is just a random variation that introduces noise to the results of the test. The noise is so heavy (max. difference between bitrates is 8%) that all the punctiliousness with calculation of p-values looks even funny.
Consequently, if we want to compare efficiency of vbr codecs - their target bitrates with the test set should be set as close as possible to each other (s0). If this is not possible (due to discrete q-values), the goals of a listening test should be redefined because the test no longer compares efficiency of their algorithms but compares perceived quality of particular settings of the encoders. Such test can be very useful as well, the only question is how to choose particular settings. Several options could be proposed:
[blockquote](s1) natural (integer) settings; results are easy to interpret and use
(s2) settings that produce equal bitrates with music of some genre (classic rock for example) or some predefined mix of genres; while one genre is acceptable to some extent, any mixture of them makes interpretation of results less clear.
(s3) settings that produce equal bitrates with personal music library of Bob; results are perfectly useful for Bob.
(s4) settings that produce equal bitrates with combined personal music libraries of Bob and Alice; results are less useful for both Bob and Alice; increasing number of participants worsens the usefulness further.
(s5) settings that produce equal bitrates for the whole population of music; results are useful for nobody, because it's hard to realize how your particular music (the one you usually deal with) sorts with that universe and how your particular bitrates sort with those “global” ones.[/blockquote]
Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties. The one thing we can be sure of is that it is not homogeneous, it is structured by genres at least. And here comes the main problem with calculation of “global” bitrates. This calculation is based on the assumption that for gradually increasing amount of music material the final bitrate of a codec tends to some certain value. It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this. In reality we calculate bitrates using some limited music material that a few people had at hand at the moment. If we add good portion of classical music the values will change, if we add proportional amount of space ambient the values will change again. Having restricted access to the population of music, this process is practically endless and does not lead to any final value. So the bitrates calculated this way can be safely considered as random because we can't even estimate how far they are from true “global” bitrates.
Anyway, even if we could manage to accomplish this task and calculate those “global” bitrates, they would have no practical meaning at all as already explained. Thus calculation of the bitrates (and corresponding encoders' settings) using aggregated music material (even all of it) has no any practical sense. It is just a very sophisticated way of choosing random bias for a listening test.
One more method should be mentioned for completeness (s6). Settings can be tuned for each sound sample to provide the same bitrate. Such test would be perfectly valid as it would show how efficiently each encoder uses the same amount of bits with each sample. Unfortunately this method is suitable only for encoders with continuous scale of q-parameter.
My conclusions. There are only two reasonable ways of setting vbr encoders for a listening test:
[blockquote](s0) settings that provide equal bitrates for all encoders with selected test set; in this case the listening test compares efficiency of vbr algorithms; the closer the bitrates, the more accurate the results (less noise due to variance of bitrates).
(s1) natural (integer) settings; in this case the test compares particular (popular) settings of encoders (in many cases results can be bias corrected afterwards, if this is the case (need research) then there is still a chance to make inference about efficiency of encoders).[/blockquote]