Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: New Public Multiformat Listening Test (Jan 2014) (Read 145160 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #300
Update

Well, Plan B (~100 kbps)?

New Public Multiformat Listening Test (Jan 2014)

Reply #301
IMO the average bitrate of an encoder setting should be taken from a test set. ... .

Why are You saying it now and not a few days ago?
It's contrary what You've said a few days ago:
I prefer using a test set of hopefully representative music for deciding upon which settings to use.

How do You expect people should get your posts if You quickly change your mind?  Huh?

??? It was/is meant to be exactly the same thing. I must have expressed myself badly.
lame3995o -Q1.7 --lowpass 17

New Public Multiformat Listening Test (Jan 2014)

Reply #302
I split the two past listening test. The samples used in the 64kbps Multiformat test is more critical than the AAC 96kbps samples.

2011 AAC 96kbps 20 samples
103112 34.5 qaac_2.32 --cvbr 96 -o %o %i
94985 34.1 qaac_2.32 --tvbr 45 -o %o %i
108696 35.5 ffmpeg_r59211 -i %i -c:a libfdk_aac -vbr 3 %o
94132 43.2 0.1.8-win32\opusenc --bitrate 88 %i %o
96191 43.3 0.1.8-win32\opusenc --bitrate 90 %i %o
98257 42.7 0.1.8-win32\opusenc --bitrate 92 %i %o
100366 40.4 0.1.8-win32\opusenc --bitrate 94 %i %o
102434 41.7 0.1.8-win32\opusenc --bitrate 96 %i %o
103480 41.8 0.1.8-win32\opusenc --bitrate 97 %i %o
90855 24.7 venc(aoTuV 6.03) -q1.99 %i %o
97229 24.4 venc(aoTuV 6.03) -q2 %i %o
98033 24.5 venc(aoTuV 6.03) -q2.1 %i %o
99510 24.3 venc(aoTuV 6.03) -q2.2 %i %o
101652 24.3 venc(aoTuV 6.03) -q2.4 %i %o

2011 Multiformat 64kbps 30 samples
104653 33.9 qaac_2.32 --cvbr 96 -o %o %i
101608 34.1 qaac_2.32 --tvbr 45 -o %o %i
115220 35.8 ffmpeg_r59211 -i %i -c:a libfdk_aac -vbr 3 %o
101793 41.9 0.1.8-win32\opusenc --bitrate 88 %i %o
104016 41.7 0.1.8-win32\opusenc --bitrate 90 %i %o
106244 41.8 0.1.8-win32\opusenc --bitrate 92 %i %o
108471 41.9 0.1.8-win32\opusenc --bitrate 94 %i %o
110690 41.9 0.1.8-win32\opusenc --bitrate 96 %i %o
111806 42.0 0.1.8-win32\opusenc --bitrate 97 %i %o
102545 24.5 venc(aoTuV 6.03) -q1.99 %i %o
110032 24.4 venc(aoTuV 6.03) -q2 %i %o
110863 24.4 venc(aoTuV 6.03) -q2.1 %i %o
112330 24.5 venc(aoTuV 6.03) -q2.2 %i %o
114934 24.6 venc(aoTuV 6.03) -q2.4 %i %o

New Public Multiformat Listening Test (Jan 2014)

Reply #303
A verification of bitrate is near to be closed.

A spreadsheet.

Thank You to participants for their help.  Well, it's pretty clear what settings to use.


The following settings will be used:
LAME 3.99.5 -V5
Apple AAC CVBR 96 - 101.5 kbps
Opus 1.1 --bitrate 96 kbps - 101.7 kbps
Vorbis b6.03. aoTuV -q2.2 - 101.5 kbps
+middle-low anchor  FAAC 96 kbps
+low anchor



Agenda
Now a discussion about test samples is open. How to choose them, quantity etc. You can submit your own samples as well.
A holidays are near  but  we  will still have time to choose samples until January 5 or so.

New Public Multiformat Listening Test (Jan 2014)

Reply #304
Great work everyone  I look forward to participating.

Merry Christmas
Scott

New Public Multiformat Listening Test (Jan 2014)

Reply #305
This is what I believe as better version, but with the same conclusion:

New Public Multiformat Listening Test (Jan 2014)

Reply #306
Agenda
Now a discussion about test samples is open. How to choose them, quantity etc. You can submit your own samples as well.
A holidays are near  but  we  will still have time to choose samples until January 5 or so.


What length of sample should we target?  I think 30s is a pretty hard upper-limit, but I'd prefer things down around 10-12s.  I personally have a hard time comparing longer samples.

New Public Multiformat Listening Test (Jan 2014)

Reply #307
Yes, I would say no less than 8 seconds and no more than 10-12.

P.S. The first 1-2 seconds are cut. So 10 seconds should be fine.

P.S2: Anyway it's discussable.

New Public Multiformat Listening Test (Jan 2014)

Reply #308
I think the length of 10 seconds would give too much advantage to Opus. The file should be around 120KB, and the header of Vorbis is few kilobytes. Let it 20 or 30 seconds. Testers don't need to hear the entire length of the sample in every ABX sessions, sometimes it's too obvious.

New Public Multiformat Listening Test (Jan 2014)

Reply #309
It's possible to encode a 30 sec sample and then indicate a  trim  to first 10 seconds an additional offset in ABC/HR Java program. Or use any other apllication to cut a decoded .wav

New Public Multiformat Listening Test (Jan 2014)

Reply #310
During the previous public test a large list of samples was made. 
Then 20 samples were randomly picked.

Everybody is welcome to submit samples in Samples for a new multiformat public test, an upload thread


Also a few items to talk about:
Quantity of samples.
Sample duration
What portions of a new samples and of a killer samples to include.

New Public Multiformat Listening Test (Jan 2014)

Reply #311
During the previous public test a large list of samples was made. 
Then 20 samples were randomly picked.


I like the way the samples were chosen for the previous test. Specifically I like that there were buckets for the different types of music and a few were randomly selected from each bucket.  I don't think we should include speech buckets this time around as 100kbps seems an unlikely bitrate for speech encoding.  I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Regarding the number of samples in total, I think we should aim high.  I suspect at 100kbps a large number of the samples will be indistinguishable.  I think having a large total number of samples will increase the chance that we still get usable data.  If we make it clear that not hearing a difference is an okay and expected result, then I don't think it unduly strains the listener to have a large number of test samples.

The area I don't have a good feeling for is the killer samples.  I absolutely want them included. I'd much rather use a codec that produces barely detectable differences 20% of the time instead of a codec that's indistinguishable 90+% of the time but has clearly audible problems when it does falter.

I just don't have a sense how to include the killer samples fairly.  If we include 4 mp3 killers and 1 opus killer, does that penalize mp3 4 times as much?  Or is that fair if mp3 runs into trouble 4 times as often?  I was thinking maybe we could have killer buckets that we pick from evenly (1 or 2 samples each): opus killers, mp3 killers, aac killers, vorbis killers. 

Do we have aac killers?  Sample #7 from the 2011 test gave both aac encoders trouble while opus and vorbis did well.  Sample #14 and #29 seemed to give Vorbis the most trouble while not causing as many problems for opus and aac.  The harpsichord sample #2 seemed to be opus 1.0's achilles heel.

New Public Multiformat Listening Test (Jan 2014)

Reply #312
It's possible to encode a 30 sec sample and then indicate a  trim  to first 10 seconds an additional offset in ABC/HR Java program. Or use any other apllication to cut a decoded .wav

I'm not sure why we would not just trim the sample's wav before encoding.  Is the concern that some encoders "come up to speed" over 1-2s and would be unfairly penalized by including the first couple seconds in the test samples?  Also I am assuming all the test clips will all be converted back to wav (flac) after encoding so headers and file sizes should have no effect.

I think we should take care to normalize the volume of the clips over the specific range that will be tested.

New Public Multiformat Listening Test (Jan 2014)

Reply #313
I would like to see:

64k, 96k and 128k for

  • Opus 1.1.
  • AAC HE/LC: Apple and Fraunhofer (FDK) codecs.
  • Ogg Vorbis.
  • Sony ATRAC3+ (yeah, I'm crazy).


Testing bitrates over and equal to 192k seems to be pointless since all these codecs provide almost 100% transparency at high bitrates.

I don't want to see Lame MP3 since for bitrates lower than 128k it struggles to provide any decent quality.

 

New Public Multiformat Listening Test (Jan 2014)

Reply #314
I don't think we should include speech buckets this time around as 100kbps seems an unlikely bitrate for speech encoding.  I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Video streaming sites, i.e.  Youtube use 96kbps for a default resolution (360p). In many cases it's a speech.
Also people were interested to see how well codecs have done on speech during the last test.

I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Agree.
Good indication.

I just don't have a sense how to include the killer samples fairly.  If we include 4 mp3 killers and 1 opus killer, does that penalize mp3 4 times as much?  Or is that fair if mp3 runs into trouble 4 times as often?  I was thinking maybe we could have killer buckets that we pick from evenly (1 or 2 samples each): opus killers, mp3 killers, aac killers, vorbis killers. 

Do we have aac killers?  Sample #7 from the 2011 test gave both aac encoders trouble while opus and vorbis did well.  Sample #14 and #29 seemed to give Vorbis the most trouble while not causing as many problems for opus and aac.  The harpsichord sample #2 seemed to be opus 1.0's achilles heel.

Killer samples have some characteristics in common. They can contain sharp transients, pure tones, wide stereo separation or any combination of these signals. So it's possible to detect them without targeting one particular codec.
What if instead of submitting a killer samples for particular codec we will prepare two lists with somewhat hard samples and killers samples. Later we will randomly choose a test samples in proportion approx. 80/20 (?) of somewhat hard samples/ killers samples?

I think we should take care to normalize the volume of the clips over the specific range that will be tested.

Yes, a normalization is always done.


New Public Multiformat Listening Test (Jan 2014)

Reply #316
And we had only 2 speech samples last time. A male English and another female English (singing). Samples 06 and 18. That doesn't hurt.


New Public Multiformat Listening Test (Jan 2014)

Reply #317
Concerning bias of listening test due to variance of codec bitrates.

The table of codec bitrates for previous HA@96 listening test shows that resulting bitrates of vbr encoders are not equal for selected test set of sound samples (the test set).

Code: [Select]
                Nero	CVBR	TVBR	FhG	CT	low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 94.9 100.9 93.45 100.4 100.0 99.6
It looks like everybody understands that such inequality favors some codecs in the listening test. At least this is not a secret and IgorC mentioned about that here.

Let's define the issue more clearly. We have the table of codec per-sample bitrates (above) and the table of codec per-sample scores:

Code: [Select]
                Nero	CVBR	TVBR	FhG	CT	low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 3.69 4.42 4.36 4.26 4.08 1.52

For each sound sample and four vbr encoders (first four columns) we can calculate coefficient of correlation between bitrates and corresponding scores. These twenty coefficients are below:

Code: [Select]
Sample01    0.6454
Sample02    0.6352
Sample03    0.7327
Sample04    0.2685
Sample05  -0.3851
Sample06    0.6219
Sample07    0.5927
Sample08    0.2423
Sample09    0.7509
Sample10    0.8660
Sample11  -0.4295
Sample12    0.6259
Sample13    0.6286
Sample14    0.7710
Sample15    0.5018
Sample16    0.1358
Sample17  -0.5315
Sample18    0.8167
Sample19  -0.4780
Sample20    0.2855

And here is bootstrap mean of these coefficients:


We can see strong evidence of correlation between bitrates and scores (all means are significantly far from zero). In simple words, the final scores depend on resulting bitrates. This is a bias.

Once again, it seems that people here are well aware of this dependence but prefer to think that this bias is acceptable and even justifiable by the “nature of vbr encoding”. It is considered that target bitrates should be calculated using as big and varied as possible music library and inevitable inequality of bitrates with the test set is a consequence of their natural behavior and should be kept. So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient. You can't choose which hypothesis is true until you get the scores of perceptual quality. The variance of bitrates itself (without scores) can be interpreted both ways – as a smart decision of efficient vbr codec and as protective response of poor one. In other words, the variation of bitrates itself has no any useful meaning, it is just a random variation that introduces noise to the results of the test. The noise is so heavy (max. difference between bitrates is 8%) that all the punctiliousness with calculation of p-values looks even funny.
 
Consequently, if we want to compare efficiency of vbr codecs - their target bitrates with the test set should be set as close as possible to each other (s0). If this is not possible (due to discrete q-values), the goals of a listening test should be redefined because the test no longer compares efficiency of  their algorithms but compares perceived quality of particular settings of the encoders. Such test can be very useful as well, the only question is how to choose particular settings. Several options could be proposed:
[blockquote](s1) natural (integer) settings; results are easy to interpret and use

(s2) settings that produce equal bitrates with music of some genre (classic rock for example) or some predefined mix of genres; while one genre is acceptable to some extent, any mixture of them makes interpretation of results less clear.

(s3) settings that produce equal bitrates with personal music library of Bob; results are perfectly useful for Bob.

(s4) settings that produce equal bitrates with combined personal music libraries of Bob and Alice; results are less useful for both Bob and Alice; increasing number of participants worsens the usefulness further.

(s5) settings that produce equal bitrates for the whole population of music; results are useful for nobody, because it's hard to realize how your particular music (the one you usually deal with) sorts with that universe and how your particular bitrates sort with those “global” ones.[/blockquote]
Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties. The one thing we can be sure of is that it is not homogeneous, it is structured by genres at least. And here comes the main problem with calculation of “global” bitrates. This calculation is based on the assumption that for gradually increasing amount of music material the final bitrate of a codec tends to some certain value. It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this. In reality we calculate bitrates using some limited music material that a few people had at hand at the moment. If we add good portion of classical music the values will change, if we add proportional amount of space ambient the values will change again. Having restricted access to the population of music, this process is practically endless and does not lead to any final value. So the bitrates calculated this way can be safely considered as random because we can't even estimate how far they are from true “global” bitrates.

Anyway, even if we could manage to accomplish this task and calculate those “global” bitrates, they would have no practical meaning at all as already explained. Thus calculation of the bitrates (and corresponding encoders' settings) using aggregated music material (even all of it) has no any practical sense. It is just a very sophisticated way of choosing random bias for a listening test.

One more method should be mentioned for completeness (s6). Settings can be tuned for each sound sample to provide the same bitrate. Such test would be perfectly valid as it would show how efficiently each encoder uses the same amount of bits with each sample. Unfortunately this method is suitable only for encoders with continuous scale of q-parameter.

My conclusions. There are only two reasonable ways of setting vbr encoders for a listening test:
[blockquote](s0) settings that provide equal bitrates for all encoders with selected test set; in this case the listening test compares efficiency of vbr algorithms; the closer the bitrates, the more accurate the results (less noise due to variance of bitrates).

(s1) natural (integer) settings; in this case the test compares particular (popular) settings of encoders (in many cases results can be bias corrected afterwards, if this is the case (need research) then there is still a chance to make inference about efficiency of encoders).[/blockquote]
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #318
I think Serge's points have merit.

I think making sure every test sample is encoded at the exact same bitrate is an excellent idea for a 96kbps CBR listening test.
I think making sure each encoder averages the same bitrate over the test samples is an excellent idea for a 96kbps ABR listening test.
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.

New Public Multiformat Listening Test (Jan 2014)

Reply #319
I think Serge's points have merit.

I think making sure every test sample is encoded at the exact same bitrate is an excellent idea for a 96kbps CBR listening test.
I think making sure each encoder averages the same bitrate over the test samples is an excellent idea for a 96kbps ABR listening test.
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.

And what is your variant for comparing CBR and VBR? And if for example some codec use cbr and vbr alternately. And if developers of codecs invent something completely different? Shouldn't we have a common procedure for testing efficiency of codecs regardless of their internal mechanics whatever it is? The more so as most listening tests are organized exactly for comparing alternative algorithms of coding. Efficiency, in the end, is a very simple concept - ratio of allowed bits and resulting perceived quality.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #320
Serge,
Stop poluting the  thread. If You have strong disagreements with a lot of people here about testing methology maybe you should open a separate thread. This disagreement comes already for years and jumping now into preparation discussion after two years of break (since a last test) is an inapropriate act given that we have a work to do in a short period of time.

New Public Multiformat Listening Test (Jan 2014)

Reply #321
Serge,
Stop poluting the  thread. If You have strong disagreements with a lot of people here about testing methology maybe you should open a separate thread. This disagreement comes already for years and jumping now into preparation discussion after two years of break (since a last test) is an inapropriate act given that we have a work to do in a short period of time.

I see some flaws in your test setup that decrease accuracy of the test results, so I do my best to describe them providing arguments. You are the conductor of this test, so it's up to you whether to consider them or not. If you can't decide for yourself ask help from community. I just want this test to be properly organized and everybody understands what and why we do this and that in this test, what is the goal of the test. And preparation discussion is the best place for such controversy imho.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #322
I don't know if you have noted but most of people who post here are actually a listeners of previous tests. They were involved for a while. 
The problem is that we aren't sure about your approach. According to your tests mp3 64 kbps, 22 kHz is better ranked than Vorbis 64 kbps. That is a problem.

New Public Multiformat Listening Test (Jan 2014)

Reply #323
I think Serge's points have merit.

[...]
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.


You are misunderstanding what Serge is saying. He says that despite a codec having a VBR setting that averages 96kbps over a superset of all music, we should use *another* bitrate/setting calibrated to produce 96kbps over the *test samples* because...VBR codecs that identify which samples need more bits produce better scores. This is somehow considered undesirable....because...don't ask me, it makes no sense.

I think the argument has been made before, and it was just as wrong back then as it is now. The point of VBR is so a codec can spend more bits where it needs to. Serge is now advocating that the working of VBR is "filtered" out of the test? If you want to do this you do a CBR test.

There is no point in using a mode and then trying to disable exactly the effect of that mode. This is insanity.

Quote
So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient...– as a smart decision of efficient vbr codec and as protective response of poor one.


I don't even get what this is supposed to mean, or why it would matter. The codecs produce the expected VBR bitrates over a large corpus. Why does it matter for what reason they're varying their bitrates over the test set? I can't even make sense of what point your last sentence is supposed to make, as far as I can tell you're making an artificial distinction so you can go on and fail to make any inference from that.

Quote
It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this.


I outlined such a method that is not very complicated earlier in this thread. Problem of the current method is that it biases to music that is more popular with our audience. The upshot of that flaw is that it makes the results more, not less, meaningful for our readers, although you're free to point out the results are biased more towards popular music rather than unpopular one when discussing the result.

The alternate is to not test (VBR) at all, which is even less useful.

Quote
(s0)...
(s5) settings that produce equal bitrates for the whole population of music; results are useful for nobody, because it's hard to realize how your particular music (the one you usually deal with) sorts with that universe and how your particular bitrates sort with those “global” ones.


This reasoning is completely and utterly bogus. The result from the test is what an average listener can expect on the average song with the tested codec+settings. In absence of more information, it's a very useful result to see which codec is best because the odds are always higher that this codec is also best if you pick a specific sample (genre) and a listener.

If you reasoning were valid (and as just demonstrated, it isn't), then there would be no point in doing any tests because the listeners *themselves* already vary.

Quote
I see some flaws in your test setup that decrease accuracy of the test results, so I do my best to describe them providing arguments.


Unfortunately I didn't find any valid argument regarding your stance on the VBR bitrates, and your proposal actively decreases the accuracy of a VBR test. The only one you made I consider valid is regarding sample selection, and that was pointed out and discussed already several times already in the past few pages.

Quote
I just want this test to be properly organized and everybody understands what and why we do this and that in this test, what is the goal of the test.


Yes, thank you again for illustrating that the setup is as close to optimal as we can get for now. People who want to understand why the bitrates for the sample set don't average 96kbps will have even more pages to refer to to understand.

New Public Multiformat Listening Test (Jan 2014)

Reply #324
Please stop arguing against Serge's testing methodology as well as the one used here.
As for the latter, Serge, you can see that average bitrates for the various test sets used in this thread don't vary much.
More important, the comclusions towards fair settings for the participating encoders are exactly the same for every test set. And in case an encoder chooses higher bitrate than usual on a problematic spot it's quite natural that this encoder has a quality advantage here. Good detecttion of music that needs more bits should be rewarded, as long as average bitrate of a test set of regular music isn't increased.
That's the idea behind the testing methodology here. There maybe disadvantages with this approach, too, but this is the way we want to go here.
lame3995o -Q1.7 --lowpass 17