HydrogenAudio

Hydrogenaudio Forum => General Audio => Topic started by: ff123 on 2001-10-29 06:57:26

Title: Second in the series of 128 tests
Post by: ff123 on 2001-10-29 06:57:26
Your participation is invited.

Test of various audio codecs which average about 128 kbit/s:
MP3, AAC, Ogg Vorbis, MPC, and WMA8

See http://ff123.net/128test/instruct.html (http://ff123.net/128test/instruct.html)

for instructions.  I have posted RAR'd binaries to the title: "128 kbit/s listening test" in alt.binaries.test.  There are also binaries available from my site.

Please do not discuss your results here!

ff123
Title: Second in the series of 128 tests
Post by: Jon Ingram on 2001-10-29 07:52:18
Nice instruction page...

I hope to participate as soon as I get writing a paper out of the way  How long is this test scheduled to run?
Title: Second in the series of 128 tests
Post by: Nic on 2001-10-29 08:35:24
No Psytel again :-( ... Im sure you have your reasons (as in Liquid is the best in its class) but I would have liked to see it there....

...I look forward to the results  (will they differ much compared to the last set using doggies?)

Cheers,
-Nic
Title: Second in the series of 128 tests
Post by: PatchWorKs on 2001-10-29 09:56:03
Vorbis RC3 ???
Title: Second in the series of 128 tests
Post by: TrNSZ on 2001-10-29 10:43:31
[deleted]
Title: Second in the series of 128 tests
Post by: JohnV on 2001-10-29 11:50:18
Hmm, how's RC3 alpha supposed to compare against RC3? Last time I checked Monty was concentrading on bitrate control for streaming modes. Hopefully it means that the quality tweakings which will be implemented in RC3 are mostly done. (?)
Title: Second in the series of 128 tests
Post by: ff123 on 2001-10-29 14:43:53
Quote
Hmm, how's RC3 alpha supposed to compare against RC3? Last time I checked Monty was concentrading on bitrate control for streaming modes. Hopefully it means that the quality tweakings which will be implemented in RC3 are mostly done. (?)


This is my understanding of it.  Anyway, I have at least a week to complete the test before the real RC3 is out :-)

Quote
No Psytel again :-( ... Im sure you have your reasons (as in Liquid is the best in its class) but I would have liked to see it there....


Ivan is running his own tests of psytel at 128.

Quote
I hope to participate as soon as I get writing a paper out of the way  How long is this test scheduled to run?


There's no hurry.  Probably at least a couple weeks.

ff123
Title: Second in the series of 128 tests
Post by: JohnV on 2001-11-12 12:36:43
Sorry FF, that I still haven't done the test, I have just started though. How long it will be going and how many people have attended so far?

I hope people will help FF out by doing the test! Otherwise he will need a quantum computer to calculate results which have statistical significance. 

http://ff123.net/128test/instruct.html (http://ff123.net/128test/instruct.html)
Title: Second in the series of 128 tests
Post by: ff123 on 2001-11-12 15:41:19
The test can run indefinitely, but I'll probably release comments and individual ratings after a decent amount of time, or if I ever get 30 people to rate the files.

I updated the results at:

http://ff123.net/128test/interim.html (http://ff123.net/128test/interim.html)

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-02 09:17:35
Just a note that monty has tried his hand at the tests and has posted some interesting comments.  Participants have access to the comments pages, which are currently private.  For those who don't have access but are interested in what monty had to say, take the tests, and I will give you the links

ff123
Title: Second in the series of 128 tests
Post by: Delirium on 2002-01-04 07:43:50
I'm a bit confused as to how we're supposed to do the test.  It says to give a rating indicating whether we hear a difference from the original, but all i see for download are ZIPs which each have six WAVs, presumably the results of the six different codecs; where do we get the original samples to compare with?
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-04 09:05:28
The originals are included along with the six encodes in the zip archives.

ff123
Title: Second in the series of 128 tests
Post by: Delirium on 2002-01-04 09:08:20
Quote
Originally posted by ff123
The originals are included along with the six encodes in the zip archives.


Well, I'm confused then, because I see exactly six samples in each archive, presumably the six encodes.  They're all labeled with what appear to be random numbers, none is labeled "original" or anything of that sort to distinguish it.  Perhaps I am just being retarded at 3am, but I can't seem to find them...

FWIW I downloaded the WAV archives.
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-04 09:19:05
Ah, shit, I fucked those up (the plain zips were added a couple days ago by request from somebody).  I'll add the originals as a separate zip file.

ff123

Edit:  Ok, I've fixed it.  You'll have to download another 2 to 3 MB, depending on the sample.
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-06 04:48:19
Some lessons I learned from this test:

1.  I should have chosen more difficult samples.  Although the best listeners (for example, Monty) could reliably hear what was wrong with nearly every encoded file and could describe what he heard in great detail (he said of the others' comments: "looking at these results, I have to wonder how many people bothered even listening"), the results from fossiles and rawhide indicates that even at 128 kbit/s, many samples will be essentially transparent to many people.  More difficult samples may be less representative of normal music, but the results will be more reliable.

2.  I erred on some of the settings.  For AAC, I should have chosen "transparent 128" (VBR), but lowpassed at 16 kHz.  Liquid Audio is probably still the AAC codec to beat at 128, but I would perform some pretests vs. Psytel -internet to find out for sure for the next test.  Also, I'm beginning to think that FastEnc would be a better choice for the "good" mp3.  And it might be worthwhile investigating what RealAudio can do.

3.  The next group test I organize will use ABC/HR.  More listeners don't necessarily mean better results.  One listener, if he is good enough, can yield better results than twenty untrained listeners.  Hopefully ABC/HR can help to identify and remove noisy listeners from the data set.

4.  I would like to have had at least a dozen different samples to test.  But this is highly unrealistic for a web-based test of different formats.  FTP/web space and bandwidth is one issue.  Download time is another.  I don't know of a good way around this difficulty.

ff123
Title: Second in the series of 128 tests
Post by: tangent on 2002-01-06 10:58:26
Sounds great. But erm.. can you explain ABC/HR to us? Thanks.
Title: Second in the series of 128 tests
Post by: YinYang on 2002-01-06 11:49:26
Quote
Originally posted by ff123
4.  I would like to have had at least a dozen different samples to test.  But this is highly unrealistic for a web-based test of different formats.  FTP/web space and bandwidth is one issue.  Download time is another.  I don't know of a good way around this difficulty.

ff123


Having the different test samples hosted by different people? It might not be as reliable regarding availability, but better than nothing I gather.
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-06 16:43:47
Quote
Sounds great. But erm.. can you explain ABC/HR to us? Thanks.


See this thread:

http://www.hydrogenaudio.org/forums/showth...s=&threadid=633 (http://www.hydrogenaudio.org/forums/showthread.php?s=&threadid=633)

A method for post-screening noisy listeners is discussed in ITU-R BS.1116-1

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-08 01:06:09
I'm going to close the test on 1-12-02.

Looks like the results are stable, if not significant on two of the samples.  Looks like the pre-RC3 test is past its time, now that some fixes have been incorporated into the official RC3.

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-13 05:47:49
The second test is now closed, and comments are now linked from the main test page:

http://ff123.net/128test/instruct.html (http://ff123.net/128test/instruct.html)

ff123
Title: Second in the series of 128 tests
Post by: Garf on 2002-01-13 11:47:19
Minor nitpick: Your page still states 'Ogg Vorbis RC3 has not yet been released'.

Maybe also clarify that it may improve the quality of the encoded files.

Edit: in the interm results page:

The next test I organize will hopefully use a tool better suited to post-screening, such that results from listeners who consistently rate the original better than encoded files will be discarded.

Didn't you mean it the other way around?

--
GCP
Title: Second in the series of 128 tests
Post by: Ruse on 2002-01-13 13:29:52
I guess the results for wayitis (piano heavy) confirm what was a generally held perception:

ogg, mpc and aac are better than wma8 with 95% confidence that the results are not due to chance alone.

Stick that up your jumper MS!
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-13 19:06:25
About the weird results of listener 28 of wayitis.wav:

I have plotted some of the error function curves I made for dogies.wav.  I ranked the listeners by sensitivity to artifacts, assuming that the lowest total score indicated the most sensitive listener.  Then I plotted the curves of ratings vs. ranked listener.  You can see these graphs at:

http://ff123.net/128test/outlier.html (http://ff123.net/128test/outlier.html)

The third most sensitive listener is listener 28 of the raw data.  (BTW, xiphmont is the most sensitive listener for this sample).  You can see, especially for Xing, WMA8, and MPC, that this listener's results are highly at odds with the trend.

I don't know what formal statistical test could test for such outliers, and even if I did find them, what I should do with them.  But it's interesting that this person should have such opposing preferences, when I found the quality ranking to be so clear-cut (as did many others).

ff123
Title: Second in the series of 128 tests
Post by: Ruse on 2002-01-14 06:49:44
Why don't you analyse and publish the results without listener 28 for comparison purposes. There must be a statistical validity of some type for excluding "wonky' data. I think the plots you have shown above indicate that listener 28 is an "outllier".

Can't you just exclude him on the basis of being more than 2 standard deviations from the mean?
Title: Second in the series of 128 tests
Post by: Delirium on 2002-01-14 07:19:31
Well, I think listener 28's data is a valid ranking - it's quite possible that he is sensitive to certain artifacts that most people are not, and not sensitive to those that most people are sensitive to.  I'm not sure how statistically one should go about averaging the data; perhaps it would be useful to do some sort of post-screening to break people down into groups based on hearing and preference (i.e. "most sensitive to pre-echo", "most sensitive to treble distortion," "most sensitive to bass scratching," and so on).  Then you'd get results like "for people most sensitive to pre-echo, Ogg RC3 is best," rather than blanket preference claims that might not be true for everyone.
Title: Second in the series of 128 tests
Post by: Ruse on 2002-01-14 09:17:34
In biological systems, I suppose it is possible to get unusual sensitivities, freak performances and critical failings. I have read of a human hearing defect where a person hears a different pitch in each ear: to use that subject to develop an audio coding system wouldn't be useful.

It is more useful to look at attributes and responses that can be categorised as standard subject response. To do otherwise would be to study atypical human perception and disease.

For developing perceptual audio coding systems, one should be able to identify & categorise artifacts that "typical" listeners will recognise and dislike. I think that ff123 has identified that most of his listening group responded in a similar fashion to the artifacts produced by the codecs. This must represent the standard response to artefacts by the human ear/brain system. There will be some that respond differently, but they would be better pulled from the testing group on the basis of outlier performance.
Title: Second in the series of 128 tests
Post by: Garf on 2002-01-14 10:12:58
Quote
Originally posted by Ruse
Why don't you analyse and publish the results without listener 28 for comparison purposes. There must be a statistical validity of some type for excluding "wonky' data. I think the plots you have shown above indicate that listener 28 is an "outllier".

Can't you just exclude him on the basis of being more than 2 standard deviations from the mean?


No. The analysis that was used doesn't have a concept of 'standard deviation' anyway, and 'removing' data is always a very tricky thing to do, and not even generally accepted as possible in a statitically valid way.

Note that this guy would have passed even if post-screening would have been used. He is a valid data point. Us not liking what the data says doesn't change that.

--
GCP
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-21 20:21:29
I've been getting some help from Rich Ulrich in sci.stat.math in identifying outliers, and it appears that the statistic to use is the "corrected item-total correlation," or the (Pearson) correlation of each rater with the average for all the other raters.

For example, using this statistic, Monty has a correlation coefficient of 0.86, and Joerg (listener 28) has a value of -0.81.

A large, negative value (near -1.0) indicates a preference that runs highly counter to the the general trend.

I will be performing a sub-analysis in the near future for those listeners (there are 9 of them) who are highly and positively correlated.

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-22 04:30:04
Subanalysis based on the nine listeners who were highly correlated with each other (r > 0.7).  These were the following:

Code: [Select]
listener    r

  1       0.86

  2       0.95

  6       0.80

 10       0.86

 14       0.84

 18       0.82

 19       0.96

 23       0.86

 27       0.92


Resampling analysis as follows:

Code: [Select]
Means:



mpc      ogg      lame     aac      wma8     xing

 4.63     4.09     3.61     3.36     2.11     2.04



                           Unadjusted p-values

        ogg      lame     aac      wma8     xing

mpc      0.022*   0.000*   0.000*   0.000*   0.000*

ogg        -      0.043*   0.003*   0.000*   0.000*

lame       -        -      0.270    0.000*   0.000*

aac        -        -        -      0.000*   0.000*

wma8       -        -        -        -      0.772



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        ogg      lame     aac      wma8     xing

mpc      0.077    0.001*   0.000*   0.000*   0.000*

ogg        -      0.114    0.011*   0.000*   0.000*

lame       -        -      0.465    0.000*   0.000*

aac        -        -        -      0.000*   0.000*

wma8       -        -        -        -      0.773


ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-22 07:39:49
Going back to dogies.wav, the listener corrected item-total correlations were:

1: 0.63
2: 0.70
3: 0.72
4: 0.71
5: 0.70
6: 0.76
7: 0.69
8: 0.74
9: 0.71
10: 0.70
11: 0.71
12: 0.81
13: 0.73
14: 0.71

All the listeners on this data set were fairly well correlated.

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-22 16:37:16
Added the subanalysis to the report, maybe not in time for the latest slashdot discussion, though.

http://ff123.net/128test/interim.html (http://ff123.net/128test/interim.html)

ff123
Title: Second in the series of 128 tests
Post by: mithrandir on 2002-01-22 16:39:47
Quote
Code: [Select]
Means:



mpc      ogg      lame     aac      wma8     xing

 4.63     4.09     3.61     3.36     2.11     2.04

These results correlate rather closely to my experience with these codecs overall.
Title: Second in the series of 128 tests
Post by: Jon Ingram on 2002-01-22 17:12:32
This is all very interesting, and this way of outlier removal seems exactly what you would want for developing audio codecs -- what you want to do is to develop something which sounds the best for the normal listener.

FF123, what happens to the significance information when you perform the same procedure on the other samples in your test?
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-22 22:30:10
Quote
FF123, what happens to the significance information when you perform the same procedure on the other samples in your test?


Unfortunately, this procedure doesn't work for rawhide.wav.  This is kind of strange because I know that at one time rawhide.wav had significant results.  I'd guess some sort of factor analysis is needed to pull a cluster of like-preferences out of the noise.  I'll post the corrected item-total correlations later today for rawhide.wav and fossiles.wav.

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-23 04:58:57
Oops.  It does work for rawhide.wav.  I made a mistake when calculating the statistic for that file.  The correlation coefficients are listed below.  If I use the same standard as wayitis, and choose only those listeners satisfying 0.7 < r < 1.0, that would leave me with only two listeners.  To get a decent group of listeners, I would have to change the standard and include weakly correlated listeners as well (0.3 < r < 0.7).


1.  -0.33
2.    0.36
4.    0.75
5.    0.61
6.    0.49
7.    0.38
8.    0.94
10.  0.54
13. -0.36
14.  0.51
16.  0.06
17.  0.43
18.  0.27
19.  0.54
20.  0.23
21. -0.01
22.  0.18
23. -0.40
24. -0.33
25.  0.01
26. -0.48

If I include all listeners with 0.3 < r < 1.0, the following analysis follows:

Code: [Select]
Read 6 treatments, 10 samples



                           Unadjusted p-values

        ogg      wma8     mpc      lame     xing

aac      0.679    0.384    0.007*   0.006*   0.000*

ogg        -      0.646    0.020*   0.018*   0.001*

wma8       -        -      0.058    0.053    0.002*

mpc        -        -        -      0.963    0.201

lame       -        -        -        -      0.218



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        ogg      wma8     mpc      lame     xing

aac      0.951    0.791    0.053    0.048*   0.001*

ogg        -      0.951    0.126    0.120    0.004*

wma8       -        -      0.281    0.278    0.018*

mpc        -        -        -      0.960    0.648

lame       -        -        -        -      0.648


ff123
Title: Second in the series of 128 tests
Post by: Delirium on 2002-01-23 05:03:46
ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample?  Seems a lot different than the results for the other samples, but plausible.
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-23 05:14:52
Quote
ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample? Seems a lot different than the results for the other samples, but plausible.


for wayitis, for the nine highly correlated listeners, after adjustment for multiple samples,

mpc is better than xing
ogg is better than xing
lame is better than xing
aac is better than xing
mpc is better than wma8
ogg is better than wma8
lame is better than wma8
aac is better than wma8
mpc is better than aac
ogg is better than aac
mpc is better than lame

with 95% confidence

ff123
Title: Second in the series of 128 tests
Post by: tangent on 2002-01-23 09:42:45
ff123, what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results?
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-23 15:34:41
Quote
what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results?


The results wouldn't be as significant as what I posted above.  For example, xiphmont has a negative correlation on rawhide.  Actually, I'm a bit leery of digging out groups of people this way.  Grouping together a bunch of strongly correlated people is one thing (r > 0.7).  It's another to pull in weakly correlated people as well.

ff123
Title: Second in the series of 128 tests
Post by: tangent on 2002-01-28 05:14:35
What about using this technique for AQ1 results?
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-28 05:47:56
I thought about that, but I need to automate the process before I apply it to AQ1.  I did the others by hand.

ff123
Title: Second in the series of 128 tests
Post by: ff123 on 2002-01-28 06:53:13
Ah, what the heck.  I was curious.

I found the following correlations by listener, and sorted from most to least correlation (I am listener 6):

Code: [Select]
listener    r

   6         0.87

  20         0.79

  17         0.74

   1         0.71

  34         0.67

  13         0.67

   7         0.63

  30         0.60

  15         0.58

  37         0.56

  11         0.54

  41         0.54

  35         0.45

   9         0.43

  16         0.42

  10         0.38

   4         0.30

  18         0.29

  39         0.08

   2         0.06

  14         0.05

  38         0.02

  25        -0.01

  23        -0.07

  36        -0.12

  29        -0.17

  32        -0.56

  28        -0.56


If I choose only the 18 listeners with at least weak positive correlation (including listener 18), I get the following results:

Code: [Select]
mpc      dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

 4.76     4.63     4.49     4.38     4.36     4.29     4.27     3.81



                           Unadjusted p-values

        dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.379    0.068    0.010*   0.007*   0.002*   0.001*   0.000*

dm-std     -      0.339    0.087    0.062    0.021*   0.015*   0.000*

dm-xtrm    -        -      0.444    0.359    0.169    0.137    0.000*

dm-ins     -        -        -      0.878    0.540    0.467    0.000*

cbr256     -        -        -        -      0.646    0.566    0.000*

abr224     -        -        -        -        -      0.908    0.001*

r3mix      -        -        -        -        -        -      0.002*



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.924    0.459    0.120    0.087    0.025*   0.020*   0.000*

dm-std     -      0.931    0.522    0.445    0.203    0.166    0.000*

dm-xtrm    -        -      0.922    0.922    0.724    0.660    0.000*

dm-ins     -        -        -      0.985    0.922    0.922    0.003*

cbr256     -        -        -        -      0.941    0.922    0.005*

abr224     -        -        -        -        -      0.985    0.021*

r3mix      -        -        -        -        -        -      0.027*


ff123
Title: Second in the series of 128 tests
Post by: Delirium on 2002-01-28 08:21:51
Again I seem to have trouble reading these charts, but would it be correct then to say that this analysis does not show any statistically significant difference between MPC, dm-std, and dm-xtrm (on the high end)?  Also interesting than the average for dm-std seems to be higher than that for dm-xtrm, though again there's no statistically significant difference (I think?).
Title: Second in the series of 128 tests
Post by: Jon Ingram on 2002-01-28 09:39:50
Quote
Again I seem to have trouble reading these charts

The only statistically significant results (after resampling) were:
*everything* is better than cbr192
*mpc* is also better than r3mix and abr224.