Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Multiformat listening test @ ~64kbps: Results (Read 123114 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Multiformat listening test @ ~64kbps: Results

Reply #25
I figured ratings would vary between testers depending on which of pre-echo, lowpass, ringing, warble and grittiness is more objectionable. Further more on the Bohemian Rhapsody sample source warbling had me very confused for a while

Multiformat listening test @ ~64kbps: Results

Reply #26
# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.
Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.


I think it's really unfortunate that Igor released a file with the word cheater in it.  There are so many ways for a result to go weird which have nothing to do with "cheating".

Your results can be excluded purely based on the previously published confused reference criteria (2,4,9,22,30 invalid),  so that should close the question on correctness of excluding those results and it should have been left at that.  Even with good and careful listeners this can happen, and it's nothing anyone should take too personally.

Though, your results are pretty weird— You ranked the reference fairly low (e.g. 3) on a couple comparisons where many people found the reference and codec indistinguishable.  I think you also failed to reverse your preference on some samples where the other listeners changed their preference (behavior characteristic of a non-blind test?).

I don't mean to cause offense, but were you listening via speakers or could you have far less HF sensitivity than most of the other listeners (if you are male and older than most participants then the answer to that might be yes)?  Any other ideas why your results might be very different overall and also on specific samples?


This was the first test of this kind I made. I fast realized that I do hear much difference with my speakers, so I tested the samples with a pair of good ear plugs. I "think" I can hear differeneces in HF quite good. btw. I am male and 26 years old.
Yes OK, there might is a special case. I can hear high frequencies better with my left hear, where I got a small tinnitus (resulting from loud fireworks).

Multiformat listening test @ ~64kbps: Results

Reply #27
I've done some more testing with headphones after this was finished and also realized that my speakers were limiting my initial impressions. I can pick up differences significantly more easily through headphones than speakers. I guess next time I'll have a more valuable contribution!

Multiformat listening test @ ~64kbps: Results

Reply #28
Yes, I was too strict. Sorry about it.

Some of the listeners prefer Nero over Vorbis or vice versa. Some of them have rated Vorbis higher against HE-AAC codecs.
Other preferred Apple HE-AAC over CELT on second half of samples. These variations are all fine.
Finally on average Opus/CELT was better for all listeners with enough results.
It was very strange that you have ranked the Opus as low as low anchor! (like sample 10 and much others) where ALL other listeners scored it very well.
You average scores (including 5 invalid samples):
Vorbis - 3.53
Nero - 3.15
Apple -3.51
CELT - 2.34


Maybe your hardware has some issues.

Earlier I also wrote you to re run again the whole test  because there were 5 invalid results and all test was discarded.


Hi Igor,
on sample 10 I votes this way, because I found this part of the sample SUPER annoying:

http://dl.dropbox.com/u/745331/64kbs%20tes..._4_celt_cut.wav
http://dl.dropbox.com/u/745331/64kbs%20tes...e10_org_cut.wav

From this point the "glitch" gets less annoying but stays through the end of the sample.
maybe this is only for me that annoying or is it a decoding error?! can you please check this?

Thanks,
Christoph

Multiformat listening test @ ~64kbps: Results

Reply #29
Thanks for organizing the tests, guys! Sorry for being picky, but I'm not convinced about the analysis. To ease my mind, it would be great if you could comment on the following.

  • Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).
  • How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).
  • Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.


Edit: Christoph, why are the samples you uploaded at 96 kHz? Did you do the test that way?

Chris
If I don't reply to your reply, it means I agree with you.

Multiformat listening test @ ~64kbps: Results

Reply #30
motion_blur,

You can download the results of all listeners and compare them with yours. http://listening-tests.hydrogenaudio.org/i...ous/results.zip
They are too different.

Also why do you post samples with samplerate 96 kHz?



Hi, Chris,
I had also hard time to understand the bootstrap analysis.
Please, wait for detailed answer on it

As of Christoph's results, all of them were excluded. http://www.hydrogenaudio.org/forums/index....st&p=751768

 

Multiformat listening test @ ~64kbps: Results

Reply #31
Here is the raw data for a bitrate table. The bitrates are calculated from the physical file sizes and exact durations of the lossless reference files. The container overhead is not taken into account, but the situation is the same for every contender. I can create the finished table if no one else volunteers, but perhaps not today. I have already spent too much time with this.

Code: [Select]
		bytes	duration	kbps
FOLDER .\Sample01\
FILE sample01.flac 742,802 8.029 740.12
FILE sample01_1.ogg 74,594 8.029 74.32
FILE sample01_2.m4a 62,553 8.029 62.33
FILE sample01_3.m4a 68,891 8.029 68.64
FILE sample01_4.oga 68,270 8.029 68.02
FILE sample01_5.m4a 54,640 8.029 54.44
FOLDER .\Sample02\
FILE sample02.flac 2,834,017 25.000 906.89
FILE sample02_1.ogg 232,073 25.000 74.26
FILE sample02_2.m4a 192,460 25.000 61.59
FILE sample02_3.m4a 211,283 25.000 67.61
FILE sample02_4.oga 210,511 25.000 67.36
FILE sample02_5.m4a 159,226 25.000 50.95
FOLDER .\Sample03\
FILE sample03.flac 960,531 16.717 459.67
FILE sample03_1.ogg 154,038 16.717 73.72
FILE sample03_2.m4a 103,701 16.717 49.63
FILE sample03_3.m4a 142,545 16.717 68.22
FILE sample03_4.oga 143,250 16.717 68.55
FILE sample03_5.m4a 108,151 16.717 51.76
FOLDER .\Sample04\
FILE sample04.flac 1,880,667 19.858 757.65
FILE sample04_1.ogg 162,906 19.858 65.63
FILE sample04_2.m4a 147,510 19.858 59.43
FILE sample04_3.m4a 171,758 19.858 69.19
FILE sample04_4.oga 170,527 19.858 68.70
FILE sample04_5.m4a 126,836 19.858 51.10
FOLDER .\Sample05\
FILE sample05.flac 2,405,162 29.323 656.18
FILE sample05_1.ogg 267,027 29.323 72.85
FILE sample05_2.m4a 258,347 29.323 70.48
FILE sample05_3.m4a 250,533 29.323 68.35
FILE sample05_4.oga 257,966 29.323 70.38
FILE sample05_5.m4a 185,075 29.323 50.49
FOLDER .\Sample06\
FILE sample06.flac 1,936,163 17.468 886.72
FILE sample06_1.ogg 128,628 17.468 58.91
FILE sample06_2.m4a 143,713 17.468 65.82
FILE sample06_3.m4a 152,934 17.468 70.04
FILE sample06_4.oga 148,598 17.468 68.05
FILE sample06_5.m4a 112,631 17.468 51.58
FOLDER .\Sample07\
FILE sample07.flac 1,725,279 25.838 534.18
FILE sample07_1.ogg 280,547 25.838 86.86
FILE sample07_2.m4a 196,327 25.838 60.79
FILE sample07_3.m4a 231,898 25.838 71.80
FILE sample07_4.oga 223,721 25.838 69.27
FILE sample07_5.m4a 163,560 25.838 50.64
FOLDER .\Sample08\
FILE sample08.flac 1,732,476 20.455 677.58
FILE sample08_1.ogg 159,867 20.455 62.52
FILE sample08_2.m4a 165,652 20.455 64.79
FILE sample08_3.m4a 172,542 20.455 67.48
FILE sample08_4.oga 171,391 20.455 67.03
FILE sample08_5.m4a 131,021 20.455 51.24
FOLDER .\Sample09\
FILE sample09.flac 3,588,564 27.481 1044.67
FILE sample09_1.ogg 281,690 27.481 82.00
FILE sample09_2.m4a 235,189 27.481 68.47
FILE sample09_3.m4a 250,493 27.481 72.92
FILE sample09_4.oga 236,652 27.481 68.89
FILE sample09_5.m4a 174,125 27.481 50.69
FOLDER .\Sample10\
FILE sample10.flac 3,176,903 29.207 870.18
FILE sample10_1.ogg 413,776 29.207 113.34
FILE sample10_2.m4a 255,898 29.207 70.09
FILE sample10_3.m4a 267,479 29.207 73.26
FILE sample10_4.oga 242,965 29.207 66.55
FILE sample10_5.m4a 184,898 29.207 50.64
FOLDER .\Sample11\
FILE sample11.flac 2,034,667 20.017 813.18
FILE sample11_1.ogg 183,494 20.017 73.34
FILE sample11_2.m4a 173,358 20.017 69.28
FILE sample11_3.m4a 181,262 20.017 72.44
FILE sample11_4.oga 173,385 20.017 69.30
FILE sample11_5.m4a 128,182 20.017 51.23
FOLDER .\Sample12\
FILE sample12.flac 1,369,056 15.001 730.11
FILE sample12_1.ogg 175,658 15.001 93.68
FILE sample12_2.m4a 145,147 15.001 77.41
FILE sample12_3.m4a 131,690 15.001 70.23
FILE sample12_4.oga 131,032 15.001 69.88
FILE sample12_5.m4a 97,925 15.001 52.22
FOLDER .\Sample13\
FILE sample13.flac 3,199,288 30.002 853.09
FILE sample13_1.ogg 267,568 30.002 71.35
FILE sample13_2.m4a 266,484 30.002 71.06
FILE sample13_3.m4a 268,730 30.002 71.66
FILE sample13_4.oga 253,476 30.002 67.59
FILE sample13_5.m4a 189,903 30.002 50.64
FOLDER .\Sample14\
FILE sample14.flac 3,244,477 24.494 1059.68
FILE sample14_1.ogg 236,053 24.494 77.10
FILE sample14_2.m4a 214,877 24.494 70.18
FILE sample14_3.m4a 209,514 24.494 68.43
FILE sample14_4.oga 207,971 24.494 67.93
FILE sample14_5.m4a 156,055 24.494 50.97
FOLDER .\Sample15\
FILE sample15.flac 2,332,219 29.543 631.55
FILE sample15_1.ogg 269,799 29.543 73.06
FILE sample15_2.m4a 217,455 29.543 58.89
FILE sample15_3.m4a 256,557 29.543 69.47
FILE sample15_4.oga 260,016 29.543 70.41
FILE sample15_5.m4a 186,255 29.543 50.44
FOLDER .\Sample16\
FILE sample16.flac 631,914 6.634 762.03
FILE sample16_1.ogg 71,240 6.634 85.91
FILE sample16_2.m4a 58,878 6.634 71.00
FILE sample16_3.m4a 57,764 6.634 69.66
FILE sample16_4.oga 56,862 6.634 68.57
FILE sample16_5.m4a 45,967 6.634 55.43
FOLDER .\Sample17\
FILE sample17.flac 1,794,257 15.472 927.74
FILE sample17_1.ogg 136,374 15.472 70.51
FILE sample17_2.m4a 126,772 15.472 65.55
FILE sample17_3.m4a 138,673 15.472 71.70
FILE sample17_4.oga 131,054 15.472 67.76
FILE sample17_5.m4a 100,027 15.472 51.72
FOLDER .\Sample18\
FILE sample18.flac 2,403,680 20.155 954.08
FILE sample18_1.ogg 164,209 20.155 65.18
FILE sample18_2.m4a 172,550 20.155 68.49
FILE sample18_3.m4a 180,669 20.155 71.71
FILE sample18_4.oga 173,027 20.155 68.68
FILE sample18_5.m4a 128,988 20.155 51.20
FOLDER .\Sample19\
FILE sample19.flac 2,473,098 25.271 782.90
FILE sample19_1.ogg 188,316 25.271 59.61
FILE sample19_2.m4a 203,905 25.271 64.55
FILE sample19_3.m4a 213,815 25.271 67.69
FILE sample19_4.oga 211,536 25.271 66.97
FILE sample19_5.m4a 159,900 25.271 50.62
FOLDER .\Sample20\
FILE sample20.flac 2,208,744 19.887 888.52
FILE sample20_1.ogg 137,666 19.887 55.38
FILE sample20_2.m4a 162,528 19.887 65.38
FILE sample20_3.m4a 171,667 19.887 69.06
FILE sample20_4.oga 167,556 19.887 67.40
FILE sample20_5.m4a 127,993 19.887 51.49
FOLDER .\Sample21\
FILE sample21.flac 2,401,753 19.908 965.14
FILE sample21_1.ogg 179,423 19.908 72.10
FILE sample21_2.m4a 180,686 19.908 72.61
FILE sample21_3.m4a 182,050 19.908 73.16
FILE sample21_4.oga 167,027 19.908 67.12
FILE sample21_5.m4a 127,049 19.908 51.05
FOLDER .\Sample22\
FILE sample22.flac 2,831,537 22.143 1023.00
FILE sample22_1.ogg 200,308 22.143 72.37
FILE sample22_2.m4a 200,216 22.143 72.34
FILE sample22_3.m4a 188,506 22.143 68.10
FILE sample22_4.oga 188,741 22.143 68.19
FILE sample22_5.m4a 140,889 22.143 50.90
FOLDER .\Sample23\
FILE sample23.flac 1,216,626 11.686 832.88
FILE sample23_1.ogg 121,623 11.686 83.26
FILE sample23_2.m4a 102,927 11.686 70.46
FILE sample23_3.m4a 119,684 11.686 81.93
FILE sample23_4.oga 106,219 11.686 72.72
FILE sample23_5.m4a 77,692 11.686 53.19
FOLDER .\Sample24\
FILE sample24.flac 1,870,069 17.025 878.74
FILE sample24_1.ogg 134,142 17.025 63.03
FILE sample24_2.m4a 135,416 17.025 63.63
FILE sample24_3.m4a 153,654 17.025 72.20
FILE sample24_4.oga 147,069 17.025 69.11
FILE sample24_5.m4a 110,437 17.025 51.89
FOLDER .\Sample25\
FILE sample25.flac 2,734,360 28.727 761.47
FILE sample25_1.ogg 281,634 28.727 78.43
FILE sample25_2.m4a 242,678 28.727 67.58
FILE sample25_3.m4a 252,085 28.727 70.20
FILE sample25_4.oga 243,928 28.727 67.93
FILE sample25_5.m4a 182,075 28.727 50.70
FOLDER .\Sample26\
FILE sample26.flac 2,599,998 22.092 941.52
FILE sample26_1.ogg 223,182 22.092 80.82
FILE sample26_2.m4a 180,466 22.092 65.35
FILE sample26_3.m4a 191,940 22.092 69.51
FILE sample26_4.oga 185,322 22.092 67.11
FILE sample26_5.m4a 141,355 22.092 51.19
FOLDER .\Sample27\
FILE sample27.flac 2,574,403 21.612 952.95
FILE sample27_1.ogg 200,562 21.612 74.24
FILE sample27_2.m4a 187,622 21.612 69.45
FILE sample27_3.m4a 193,290 21.612 71.55
FILE sample27_4.oga 178,160 21.612 65.95
FILE sample27_5.m4a 137,567 21.612 50.92
FOLDER .\Sample28\
FILE sample28.flac 1,739,752 19.144 727.02
FILE sample28_1.ogg 159,467 19.144 66.64
FILE sample28_2.m4a 162,526 19.144 67.92
FILE sample28_3.m4a 176,282 19.144 73.67
FILE sample28_4.oga 163,339 19.144 68.26
FILE sample28_5.m4a 123,291 19.144 51.52
FOLDER .\Sample29\
FILE sample29.flac 2,409,128 28.505 676.13
FILE sample29_1.ogg 215,868 28.505 60.58
FILE sample29_2.m4a 233,228 28.505 65.46
FILE sample29_3.m4a 258,227 28.505 72.47
FILE sample29_4.oga 239,592 28.505 67.24
FILE sample29_5.m4a 180,755 28.505 50.73
FOLDER .\Sample30\
FILE sample30.flac 2,648,660 30.000 706.31
FILE sample30_1.ogg 227,521 30.000 60.67
FILE sample30_2.m4a 247,638 30.000 66.04
FILE sample30_3.m4a 254,019 30.000 67.74
FILE sample30_4.oga 251,772 30.000 67.14
FILE sample30_5.m4a 189,944 30.000 50.65
The codecs:
_1. Vorbis
_2. Nero
_3. Apple
_4. Opus (CELT)
_5. low anchor

The FLAC bitrate may be somewhat interesting. It gives some indication of the sample's complexity.

The same data in Excel format is available here: http://www.hydrogenaudio.org/forums/index....showtopic=88033

Multiformat listening test @ ~64kbps: Results

Reply #32
Thanks for organizing the tests, guys! Sorry for being picky, but I'm not convinced about the analysis. To ease my mind, it would be great if you could comment on the following.

  • Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).
  • How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).
  • Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.


Edit: Christoph, why are the samples you uploaded at 96 kHz? Did you do the test that way?

Chris


@Edit: Oh sorry, I just fast cut it with audacity, did not noticed it was still configured that way. But anyway I hope you can hear what mean.
Maybe the most people only concentrated on the beginning of the samples!? the part with the glitch is way in the sample.

Yes, I know that my results do not fit the standards and therefore are excluded.
And if I would be in, I am just one of the outliers and do not influence the median of the scoring much.
http://www.physics.csbsju.edu/stats/complex.box.defs.gif

But I want to know what I did different and what I can change next time.

Multiformat listening test @ ~64kbps: Results

Reply #33
Some presentation suggestions:
1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.
2. Links to results of older tests would be welcome.
3. I can't wait for the bitrate table.

Multiformat listening test @ ~64kbps: Results

Reply #34
Thank you for your help, AlexB. If you can do the complete bitrate analysis it  will be useful.
I hadn't time to do bitrate table in these days.

Multiformat listening test @ ~64kbps: Results

Reply #35
Some presentation suggestions:
1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.
2. Links to results of older tests would be welcome.
3. I can't wait for the bitrate table.


Thank you for observations.

Multiformat listening test @ ~64kbps: Results

Reply #36
Christoph, do you mean the slightly washed out bass drum? To me (and probably most other listeners) the artifacts of the other codecs in the first 15 seconds appeared much more severe. I don't have the decoded items here. Can someone check if Christoph's CELT decodes match his/her own?

And, since you said this is your first listening test of this kind: did you do training sessions? Did you read e.g. this guideline? The way you choose your loops (especially length) has a great impact on your ability to identify artifacts.

Chris
If I don't reply to your reply, it means I agree with you.

Multiformat listening test @ ~64kbps: Results

Reply #37
I've checked. The decoder on Christoph's system is fine.

P.S. I've also pointed to Christoph to this guide http://ff123.net/64test/practice.html

Multiformat listening test @ ~64kbps: Results

Reply #38
I figured ratings would vary between testers depending on which of pre-echo, lowpass, ringing, warble and grittiness is more objectionable. Further more on the Bohemian Rhapsody sample source warbling had me very confused for a while


The bigger difference just comes from which samples were tested.  A great many listeners only listened to the first few samples, so of course their preferences will be skewed by the correlation with the samples they tested.

If you look at the 10 listeners which had all 30 valid results (so no sample-unbalance), you'll see that the overall preferences agree pretty strongly:

These are just the ranks of the averages (no comment on the significance):

Garf    Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis > AAC-LC@48k
hlm    Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis> AAC-LC@48k
IgorC  Opus > Apple_HE-AAC > Vorbis > Nero_HE-AAC > AAC-LC@48k
KW      Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis > AAC-LC@48k
04_anon Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis > AAC-LC@48k
06_anon Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis > AAC-LC@48k
14_anon Opus > Apple_HE-AAC > Vorbis > Nero_HE-AAC > AAC-LC@48k
25_anon Opus > Apple_HE-AAC > Vorbis > Nero_HE-AAC > AAC-LC@48k
26_anon Apple_HE-AAC > Opus > Nero_HE-AAC > Vorbis > AAC-LC@48k
30_anon Opus > Apple_HE-AAC > Nero_HE-AAC > Vorbis > AAC-LC@48k

The sample to sample variance in rank is a lot greater than the listener to listener variance in rank (scores might be another matter— but listeners don't score things the same, and because the score scale is non-linear I don't know of any intuitively correct way to deal with that other than using ranks).



> d <- read.listener.file("comp_data.txt")
> aggregate(d$value, list(codec=d$codec,listener=d$listener),mean)
          codec listener        x
1    AAC-LC@48k  04_anon 1.033333
2  Apple_HE-AAC  04_anon 3.550000
3  Nero_HE-AAC  04_anon 3.453333
4          Opus  04_anon 3.900000
5        Vorbis  04_anon 3.310000
6    AAC-LC@48k  06_anon 1.793333
7  Apple_HE-AAC  06_anon 4.186667
8  Nero_HE-AAC  06_anon 3.820000
9          Opus  06_anon 4.460000
10      Vorbis  06_anon 3.603333
11  AAC-LC@48k  14_anon 1.050000
12 Apple_HE-AAC  14_anon 3.283333
13  Nero_HE-AAC  14_anon 2.666667
14        Opus  14_anon 3.600000
15      Vorbis  14_anon 3.110000
16  AAC-LC@48k  25_anon 1.293333
17 Apple_HE-AAC  25_anon 3.183333
18  Nero_HE-AAC  25_anon 2.500000
19        Opus  25_anon 3.503333
20      Vorbis  25_anon 2.960000
21  AAC-LC@48k  26_anon 1.800000
22 Apple_HE-AAC  26_anon 4.866667
23  Nero_HE-AAC  26_anon 4.666667
24        Opus  26_anon 4.766667
25      Vorbis  26_anon 4.573333
26  AAC-LC@48k  30_anon 1.086667
27 Apple_HE-AAC  30_anon 3.110000
28  Nero_HE-AAC  30_anon 2.656667
29        Opus  30_anon 3.333333
30      Vorbis  30_anon 2.476667
31  AAC-LC@48k    Garf 1.923333
32 Apple_HE-AAC    Garf 4.093333
33  Nero_HE-AAC    Garf 3.963333
34        Opus    Garf 4.203333
35      Vorbis    Garf 3.916667
36  AAC-LC@48k      hlm 1.533333
37 Apple_HE-AAC      hlm 3.476667
38  Nero_HE-AAC      hlm 3.113333
39        Opus      hlm 3.656667
40      Vorbis      hlm 2.616667
41  AAC-LC@48k    IgorC 1.056667
42 Apple_HE-AAC    IgorC 3.003333
43  Nero_HE-AAC    IgorC 2.753333
44        Opus    IgorC 3.583333
45      Vorbis    IgorC 2.940000
46  AAC-LC@48k      KW 1.376667
47 Apple_HE-AAC      KW 4.040000
48  Nero_HE-AAC      KW 3.816667
49        Opus      KW 4.236667
50      Vorbis      KW 3.190000

Multiformat listening test @ ~64kbps: Results

Reply #39
2. Links to results of older tests would be welcome.

http://listeningtests.t35.com.

I have mirrored Roberto's and Sebastian's old test sites. Sebastian's tests are also available here: http://listening-tests.hydrogenaudio.org/sebastian/

Quote
3. I can't wait for the bitrate table.

Actually, a more useful presentation would be a comparison like this: http://www.hydrogenaudio.org/forums/index....st&p=593735
I.e. bitrates that represent real life usage, not the bitrates of these short test samples.

I am planning to do it, but the lack of application support for Opus (CELT) will make the process quite a bit more complex than before.

Multiformat listening test @ ~64kbps: Results

Reply #40
Christoph, do you mean the slightly washed out bass drum? To me (and probably most other listeners) the artifacts of the other codecs in the first 15 seconds appeared much more severe. I don't have the decoded items here. Can someone check if Christoph's CELT decodes match his/her own?

And, since you said this is your first listening test of this kind: did you do training sessions? Did you read e.g. this guideline? The way you choose your loops (especially length) has a great impact on your ability to identify artifacts.

Chris


Hi C.R.,

yes I read the guideline before the test, but usually compared only 2-3 loops per sample.
It is interesting that you are not that much annoyed by this part.
I can clearly hear it and i just did a spectrum analysis and it is also visible.
http://dl.dropbox.com/u/745331/spectrum.png

Multiformat listening test @ ~64kbps: Results

Reply #41
yes I read the guideline before the test, but usually compared only 2-3 loops per sample.

I wonder how many listeners did it like that. It seems there are a lot of things we should put in a checklist for all to read before a test. Such as "listen to the entire sample" and "use headphones"... Maybe by coincidence you only listened to sections where CELT does a bit worse than the other codecs?

Quote
It is interesting that you are not that much annoyed by this part.
I can clearly hear it and i just did a spectrum analysis and it is also visible.
http://dl.dropbox.com/u/745331/spectrum.png

Weird, I don't even see this in my own spectrogram of the file you uploaded.    What frequency range is the highlighted part in? In other words: please label your axes!

Chris
If I don't reply to your reply, it means I agree with you.

Multiformat listening test @ ~64kbps: Results

Reply #42
Quote
Please provide the number of valid results (i.e. listeners) per sample (excluding "27", see below).


Will be addressed when per sample graphs are made. You can obtain this data yourself easily if you can't wait - the results are public.

  • How did you compute the overall average score of a codec and its confidence intervals? Taking the mean of all listeners' results? That would mean a sample with more listeners (i.e. probably sample01) has a greater influence than the last few samples (which still needed listeners shortly before the end of the test). This is probably not a good approach; weighting each sample equally in the overall score seems to be the way to go for me (but it probably doesn't make a difference here, but still...).


This is already addressed and explained on the results page. Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That being said, the only solution to this is to put some infrastructure to force equal listeners per sample in the next tests. Any kind of post-processing to equalize the sample weights is probably as controversial as not having them equal in the first place. The samples that weren't included in the test also had unequal weights compared to those that were, if you know what I mean.


Quote
  • Nothing personal, but if a listener like "27" consistently scores in opposite direction as the average (as shown by Igor), a thorough post-screening analysis (like Spearman rank correlation < some value) would - and has to - exclude such results.


As explained in this thread, this listener was in fact screened.

Multiformat listening test @ ~64kbps: Results

Reply #43
Sorry for being picky, but I'm not convinced about the analysis.


The paired statistical tests are pretty incontrovertible.  I've since run the same analysis with a number of different balancing and post filtering rules and every time it's come out to be the same way.

If it's any consolation, Opus considerably bombs the couple of cases that it does poorly (though its sample by sample variance is still not as large as the other codecs, it has stronger outliers).  This is undoubtedly due to a mixture of encoder immaturity, lack of taking advantage of VBR,  and just one of the annoying tradeoffs that come from creating a low latency codec. (The mode opus was used in here has a total of 22.5ms of latency, including the overlap but ignoring any serialization delay related to VBR).

I've noticed that there seems to be some misunderstanding promoted around here related to confidence intervals.  Even ignoring the issues with non-pairwise comparisons, assumptions of normality, etc.  there seems to be a mis-aprehension  that the confidence intervals must not overlap at all for the result to be deemed significant to whatever P-value was used to draw the bars.  This is clearly incorrect.

For example, consider 5% error bars on the mean of codec A and 5% bars on the mean codec B  and the lower bar of A is the same as the upper bar of B.  Is there a 1/20 (p=0.05) chance that the difference in means arose from noise?  _NO_  If we assume that the errors are independent the chance of that is more like 1/400 (0.05^2).  Of course, the errors are not completely independently distributed— but this fact also invalidates the assumptions used to set the errors bars in the first place. Another  approach would be to compare the mean of one value with the error-bars on the mean of the other and vice versa, this isn't ideal either but it does avoid squaring the P-value used.

Blocked pair-wise parametric tests are much better for this reason, and others but they don't result in pretty graphs.

Multiformat listening test @ ~64kbps: Results

Reply #44
yes I read the guideline before the test, but usually compared only 2-3 loops per sample.

I wonder how many listeners did it like that. It seems there are a lot of things we should put in a checklist for all to read before a test. Such as "listen to the entire sample" and "use headphones"... Maybe by coincidence you only listened to sections where CELT does a bit worse than the other codecs?

Quote
It is interesting that you are not that much annoyed by this part.
I can clearly hear it and i just did a spectrum analysis and it is also visible.
http://dl.dropbox.com/u/745331/spectrum.png

Weird, I don't even see this in my own spectrogram of the file you uploaded.    What frequency range is the highlighted part in? In other words: please label your axes!

Chris


I did the spectrogram with foobar and log scale, sadly no labels. I looked at it with linear scale and the gap goes approximately from 7 kHz to 9kHz.

Multiformat listening test @ ~64kbps: Results

Reply #45
Sorry, Christoph, can't reproduce it. What you describe must sound like a notch filter, i.e. frequency band missing. Haven't noticed anything of that sort during and after the test. What OS are you using? 64-bit?

Thanks, Garf and NullC, for the explanations.

Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That's good to hear. Still, if you find some time, would you mind creating a closeup average-codec-score plot using only the complete results, just like the plot on the results page? 

Thanks,

Chris
If I don't reply to your reply, it means I agree with you.


Multiformat listening test @ ~64kbps: Results

Reply #47
Some presentation suggestions:
1. Codec versions and settings should be in the results or one clearly marked click away. I don't consider what is now to be clearly marked.
2. Links to results of older tests would be welcome.
3. I can't wait for the bitrate table.


I added the bitrate table (thanks AlexB!), but that's as far as I'll go. If people want nicer webpages they need to find someone who is actually skilled at making nice HTML/CSS.


Multiformat listening test @ ~64kbps: Results

Reply #49
Note that equal sample weighting, by only including complete results, does not change the results in the slightest.

That's good to hear. Still, if you find some time, would you mind creating a closeup average-codec-score plot using only the complete results, just like the plot on the results page? 

Thanks,

Chris


The 10 listeners that did all samples with all results valid (N=300):



The results are just as highly significant:

Code: [Select]
bootstrap.py v1.0 2011-02-03
Copyright (C) 2011 Gian-Carlo Pascutto <gcp@sjeng.org>
License Affero GPL version 3 or later <http://www.gnu.org/licenses/agpl.html>

Reading from: bs1.txt
Read 5 treatments, 300 samples => 10 comparisons
Means:
      Vorbis   Nero_HE-AAC  Apple_HE-AAC          Opus    AAC-LC@48k
       3.270         3.341         3.679         3.924         1.395

Unadjusted p-values:
          Nero_HE-AAC   Apple_HE-AAC  Opus          AAC-LC@48k  
Vorbis        0.297         0.000*        0.000*        0.000*      
Nero_HE-AAC   -             0.000*        0.000*        0.000*      
Apple_HE-AAC  -             -             0.000*        0.000*      
Opus          -             -             -             0.000*      

Apple_HE-AAC is better than Vorbis (p=0.000)
Apple_HE-AAC is better than Nero_HE-AAC (p=0.000)
Opus is better than Vorbis (p=0.000)
Opus is better than Nero_HE-AAC (p=0.000)
Opus is better than Apple_HE-AAC (p=0.000)
AAC-LC@48k is worse than Vorbis (p=0.000)
AAC-LC@48k is worse than Nero_HE-AAC (p=0.000)
AAC-LC@48k is worse than Apple_HE-AAC (p=0.000)
AAC-LC@48k is worse than Opus (p=0.000)

p-values adjusted for multiple comparison:
          Nero_HE-AAC   Apple_HE-AAC  Opus          AAC-LC@48k  
Vorbis        0.297         0.000*        0.000*        0.000*      
Nero_HE-AAC   -             0.000*        0.000*        0.000*      
Apple_HE-AAC  -             -             0.000*        0.000*      
Opus          -             -             -             0.000*      

Apple_HE-AAC is better than Vorbis (p=0.000)
Apple_HE-AAC is better than Nero_HE-AAC (p=0.000)
Opus is better than Vorbis (p=0.000)
Opus is better than Nero_HE-AAC (p=0.000)
Opus is better than Apple_HE-AAC (p=0.000)
AAC-LC@48k is worse than Vorbis (p=0.000)
AAC-LC@48k is worse than Nero_HE-AAC (p=0.000)
AAC-LC@48k is worse than Apple_HE-AAC (p=0.000)
AAC-LC@48k is worse than Opus (p=0.000)