Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Second in the series of 128 tests (Read 11723 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Second in the series of 128 tests

Reply #25
In biological systems, I suppose it is possible to get unusual sensitivities, freak performances and critical failings. I have read of a human hearing defect where a person hears a different pitch in each ear: to use that subject to develop an audio coding system wouldn't be useful.

It is more useful to look at attributes and responses that can be categorised as standard subject response. To do otherwise would be to study atypical human perception and disease.

For developing perceptual audio coding systems, one should be able to identify & categorise artifacts that "typical" listeners will recognise and dislike. I think that ff123 has identified that most of his listening group responded in a similar fashion to the artifacts produced by the codecs. This must represent the standard response to artefacts by the human ear/brain system. There will be some that respond differently, but they would be better pulled from the testing group on the basis of outlier performance.
Ruse
____________________________
Don't let the uncertainty turn you around,
Go out and make a joyful sound.

Second in the series of 128 tests

Reply #26
Quote
Originally posted by Ruse
Why don't you analyse and publish the results without listener 28 for comparison purposes. There must be a statistical validity of some type for excluding "wonky' data. I think the plots you have shown above indicate that listener 28 is an "outllier".

Can't you just exclude him on the basis of being more than 2 standard deviations from the mean?


No. The analysis that was used doesn't have a concept of 'standard deviation' anyway, and 'removing' data is always a very tricky thing to do, and not even generally accepted as possible in a statitically valid way.

Note that this guy would have passed even if post-screening would have been used. He is a valid data point. Us not liking what the data says doesn't change that.

--
GCP

Second in the series of 128 tests

Reply #27
I've been getting some help from Rich Ulrich in sci.stat.math in identifying outliers, and it appears that the statistic to use is the "corrected item-total correlation," or the (Pearson) correlation of each rater with the average for all the other raters.

For example, using this statistic, Monty has a correlation coefficient of 0.86, and Joerg (listener 28) has a value of -0.81.

A large, negative value (near -1.0) indicates a preference that runs highly counter to the the general trend.

I will be performing a sub-analysis in the near future for those listeners (there are 9 of them) who are highly and positively correlated.

ff123

Second in the series of 128 tests

Reply #28
Subanalysis based on the nine listeners who were highly correlated with each other (r > 0.7).  These were the following:

Code: [Select]
listener    r

  1       0.86

  2       0.95

  6       0.80

 10       0.86

 14       0.84

 18       0.82

 19       0.96

 23       0.86

 27       0.92


Resampling analysis as follows:

Code: [Select]
Means:



mpc      ogg      lame     aac      wma8     xing

 4.63     4.09     3.61     3.36     2.11     2.04



                           Unadjusted p-values

        ogg      lame     aac      wma8     xing

mpc      0.022*   0.000*   0.000*   0.000*   0.000*

ogg        -      0.043*   0.003*   0.000*   0.000*

lame       -        -      0.270    0.000*   0.000*

aac        -        -        -      0.000*   0.000*

wma8       -        -        -        -      0.772



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        ogg      lame     aac      wma8     xing

mpc      0.077    0.001*   0.000*   0.000*   0.000*

ogg        -      0.114    0.011*   0.000*   0.000*

lame       -        -      0.465    0.000*   0.000*

aac        -        -        -      0.000*   0.000*

wma8       -        -        -        -      0.773


ff123

Second in the series of 128 tests

Reply #29
Going back to dogies.wav, the listener corrected item-total correlations were:

1: 0.63
2: 0.70
3: 0.72
4: 0.71
5: 0.70
6: 0.76
7: 0.69
8: 0.74
9: 0.71
10: 0.70
11: 0.71
12: 0.81
13: 0.73
14: 0.71

All the listeners on this data set were fairly well correlated.

ff123

 

Second in the series of 128 tests

Reply #30
Added the subanalysis to the report, maybe not in time for the latest slashdot discussion, though.

http://ff123.net/128test/interim.html

ff123

Second in the series of 128 tests

Reply #31
Quote
Code: [Select]
Means:



mpc      ogg      lame     aac      wma8     xing

 4.63     4.09     3.61     3.36     2.11     2.04

These results correlate rather closely to my experience with these codecs overall.

Second in the series of 128 tests

Reply #32
This is all very interesting, and this way of outlier removal seems exactly what you would want for developing audio codecs -- what you want to do is to develop something which sounds the best for the normal listener.

FF123, what happens to the significance information when you perform the same procedure on the other samples in your test?

Second in the series of 128 tests

Reply #33
Quote
FF123, what happens to the significance information when you perform the same procedure on the other samples in your test?


Unfortunately, this procedure doesn't work for rawhide.wav.  This is kind of strange because I know that at one time rawhide.wav had significant results.  I'd guess some sort of factor analysis is needed to pull a cluster of like-preferences out of the noise.  I'll post the corrected item-total correlations later today for rawhide.wav and fossiles.wav.

ff123

Second in the series of 128 tests

Reply #34
Oops.  It does work for rawhide.wav.  I made a mistake when calculating the statistic for that file.  The correlation coefficients are listed below.  If I use the same standard as wayitis, and choose only those listeners satisfying 0.7 < r < 1.0, that would leave me with only two listeners.  To get a decent group of listeners, I would have to change the standard and include weakly correlated listeners as well (0.3 < r < 0.7).


1.  -0.33
2.    0.36
4.    0.75
5.    0.61
6.    0.49
7.    0.38
8.    0.94
10.  0.54
13. -0.36
14.  0.51
16.  0.06
17.  0.43
18.  0.27
19.  0.54
20.  0.23
21. -0.01
22.  0.18
23. -0.40
24. -0.33
25.  0.01
26. -0.48

If I include all listeners with 0.3 < r < 1.0, the following analysis follows:

Code: [Select]
Read 6 treatments, 10 samples



                           Unadjusted p-values

        ogg      wma8     mpc      lame     xing

aac      0.679    0.384    0.007*   0.006*   0.000*

ogg        -      0.646    0.020*   0.018*   0.001*

wma8       -        -      0.058    0.053    0.002*

mpc        -        -        -      0.963    0.201

lame       -        -        -        -      0.218



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        ogg      wma8     mpc      lame     xing

aac      0.951    0.791    0.053    0.048*   0.001*

ogg        -      0.951    0.126    0.120    0.004*

wma8       -        -      0.281    0.278    0.018*

mpc        -        -        -      0.960    0.648

lame       -        -        -        -      0.648


ff123

Second in the series of 128 tests

Reply #35
ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample?  Seems a lot different than the results for the other samples, but plausible.

Second in the series of 128 tests

Reply #36
Quote
ff123: I'm not sure if I'm reading your statistics correctly; do the wayitis results indicate that with a reasonable degree of certainty aac, ogg, and wma all outperformed both mpc and lame on this sample? Seems a lot different than the results for the other samples, but plausible.


for wayitis, for the nine highly correlated listeners, after adjustment for multiple samples,

mpc is better than xing
ogg is better than xing
lame is better than xing
aac is better than xing
mpc is better than wma8
ogg is better than wma8
lame is better than wma8
aac is better than wma8
mpc is better than aac
ogg is better than aac
mpc is better than lame

with 95% confidence

ff123

Second in the series of 128 tests

Reply #37
ff123, what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results?

Second in the series of 128 tests

Reply #38
Quote
what happens if you consider only the rawhide results from the 9 listeners who "passed" the wayitis results?


The results wouldn't be as significant as what I posted above.  For example, xiphmont has a negative correlation on rawhide.  Actually, I'm a bit leery of digging out groups of people this way.  Grouping together a bunch of strongly correlated people is one thing (r > 0.7).  It's another to pull in weakly correlated people as well.

ff123

Second in the series of 128 tests

Reply #39
What about using this technique for AQ1 results?

Second in the series of 128 tests

Reply #40
I thought about that, but I need to automate the process before I apply it to AQ1.  I did the others by hand.

ff123

Second in the series of 128 tests

Reply #41
Ah, what the heck.  I was curious.

I found the following correlations by listener, and sorted from most to least correlation (I am listener 6):

Code: [Select]
listener    r

   6         0.87

  20         0.79

  17         0.74

   1         0.71

  34         0.67

  13         0.67

   7         0.63

  30         0.60

  15         0.58

  37         0.56

  11         0.54

  41         0.54

  35         0.45

   9         0.43

  16         0.42

  10         0.38

   4         0.30

  18         0.29

  39         0.08

   2         0.06

  14         0.05

  38         0.02

  25        -0.01

  23        -0.07

  36        -0.12

  29        -0.17

  32        -0.56

  28        -0.56


If I choose only the 18 listeners with at least weak positive correlation (including listener 18), I get the following results:

Code: [Select]
mpc      dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

 4.76     4.63     4.49     4.38     4.36     4.29     4.27     3.81



                           Unadjusted p-values

        dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.379    0.068    0.010*   0.007*   0.002*   0.001*   0.000*

dm-std     -      0.339    0.087    0.062    0.021*   0.015*   0.000*

dm-xtrm    -        -      0.444    0.359    0.169    0.137    0.000*

dm-ins     -        -        -      0.878    0.540    0.467    0.000*

cbr256     -        -        -        -      0.646    0.566    0.000*

abr224     -        -        -        -        -      0.908    0.001*

r3mix      -        -        -        -        -        -      0.002*



Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples

.........+



                            Adjusted p-values

        dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.924    0.459    0.120    0.087    0.025*   0.020*   0.000*

dm-std     -      0.931    0.522    0.445    0.203    0.166    0.000*

dm-xtrm    -        -      0.922    0.922    0.724    0.660    0.000*

dm-ins     -        -        -      0.985    0.922    0.922    0.003*

cbr256     -        -        -        -      0.941    0.922    0.005*

abr224     -        -        -        -        -      0.985    0.021*

r3mix      -        -        -        -        -        -      0.027*


ff123

Second in the series of 128 tests

Reply #42
Again I seem to have trouble reading these charts, but would it be correct then to say that this analysis does not show any statistically significant difference between MPC, dm-std, and dm-xtrm (on the high end)?  Also interesting than the average for dm-std seems to be higher than that for dm-xtrm, though again there's no statistically significant difference (I think?).

Second in the series of 128 tests

Reply #43
Quote
Again I seem to have trouble reading these charts

The only statistically significant results (after resampling) were:
*everything* is better than cbr192
*mpc* is also better than r3mix and abr224.