Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Statistical Methods for Listening Tests(splitted R3mix VBR s (Read 28072 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #25
The analysis tool I wrote can now be run from the web:

http://ff123.net/friedman/stats.html

Have fun!

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #26
Quote
Originally posted by ff123

yes.  I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.


Hollander, Myles: Nonparametric statistical methods / Myles Hollander, Douglas A. Wolfe. New York (N.Y.) : Wiley, 1999. XIV, 787 p..

Practical nonparametric and semiparametric Bayesian statistics / Dipak Dey, Peter Muller, Debajyoti Sinha (eds.).. New York (N.Y.) : Springer, 1998. XVI, 369 p. : ill..

I'll try to get them tuesday or wednesday. Library opening hours suck for me though

--
GCP

 

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #27
I found something else that may be interesting.

I had been wondering for a while that, since we have a pretty good idea of how the actual distribution looks (+- normal with clipping), if there's no way to make use of that instead of a pure nonparametric test. If you are able to make use of more knowledge about the distribution, you should be able to get more sensitive tests.

Something like this already seems to exist and it's called Bootstrapping (seems to be a fairly new technique too).

Basically, starting from your sample you use a large number of simulations to determine the distribution function of your actual population.

I think that once you are able to determine the distribution function, it should be possible to create an appropriate test for inequality of means. I've only skimmed through the book I have quickly, but one method seems to be via simulations again.

On a related note, this book says that the Wilcoxon Signed Rank Test needs symmetric distributions. If that's the case, then it's not applicable to the AQ test results I think.

With some luck the Hollander/Wolfe book will answer these questions. I should be able to get it tomorrow afternoon.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #28
This is a fascinating topic which I had no idea existed.  My elementary statistics book is dated 1973, and apparently the field was opened up by Dr. Efron in 1977.  The general technique is called resampling, and there are actually four different types of resampling methods:

1. the bootstrap, invented by Bradley Efron;
2. the jacknife, invented by Maurice Quenouille and later developed by John W. Tukey;
3. cross-validation, developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba
4. balanced repeated replication, developed by Philip J. McCarthy.

This page was informative, with some references:

http://ericae.net/pare/getvn.asp?v=3&n=5

Here's a conceptually simple example taken from that page:

****
For simplicity, let's assume that a district has 13 voucher students and 39 non-voucher students, and the mean difference is 10 standard score units. To empirically construct the distribution, we'd follow these steps:

1.  Create a data base with all the student grades.
2.  Randomly sort the data base.
3.  Compute the mean for the first 13 students.
4.  Compute the mean for the other 39 students.
5.  Record the test statistic--the absolute value of the mean difference.

Then repeat steps 2 though 5 many times.  That way, we'd get the distribution of mean differences when we randomly select students. The probability of observing a mean difference of 10 when everything is random is the proportion of experimental test statistics in step 5 that are greater than 10.
****

It's very simple conceptually, and does not assume either that data is normal or that the sample is randomly drawn from some population.  Very nice.

ff123

Edit:  another web reference lists the four major types of resampling methods as follows:

(http://seamonkey.ed.asu.edu/~alex/teaching...resampling.html)


****
There are four major types of resampling:

Randomization exact test: It is also known as the permutation test. Surprise! It was developed by R. A. Fisher, the founder of classical statistical testing. However, in his later years Fisher lost interest in the permutation method because there was no computers in his days to automate such a laborious method.

Cross-validation: It was developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba.

Jacknife: It is also known as Jackknife and Quenouille-Tukey Jackknife. It was invented by Maurice Quenouille (1949) and later developed by John W. Tukey (1958). The name "jacknife" was coined by Tukey to imply that the method is an all-purpose statistical tool.

Bootstrap: It was invented by Bradley Efron and further developed by Efron & Tibshirani (1993). It means that one available sample gives rise to many others by resampling (pulling yourself by your own bootstrap).

Among the four methods, the first and the last ones are more useful. The principles of cross-validation, Jacknife, and bootstrap are very similar but bootstrap overshadows the others for it is a more thorough procedure. Indeed, Jacknife is of largely historical interest today (Mooney & Duval, 1993) (Nevertheless, Jacknife is still useful in EDA for assessing how each subsample affects the model).
****

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #29
Here is my untutored guess on how a resampled analysis method would work on, for example, the AQ1 test data (using rank scale):

1. Convert to ranks instead of ratings
2. For each listener, randomize the order of the ranking, then add up each column (codec setting).
3. Calculate the difference between all pairs of columns.  For 8 codecs, there will be 28 pairs of columns.
4. Repeat 1000 times (or however many times you want).  At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.

Is it really that simple?

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #30
Quote
Originally posted by ff123
4. Repeat 1000 times (or however many times you want).  At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.


You don't even need the distribution. During the simulation, you just count how many times the difference in ranksums in the trials exceeded the one you've got.

At the end, you divide that by the number of trials. Voila

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #31
Man, that sounds sweet.  It shouldn't be too hard to code up given that my current program has almost all the bits and pieces needed.

So does this sidestep the problem of doing multiple pairwise comparisons?  It seems like it, to me.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #32
Quote
Originally posted by ff123
So does this sidestep the problem of doing multiple pairwise comparisons?  It seems like it, to me.


I don't really see how immediately? Any suggestion?

Edit3: Deleted Edit 1 and 2

If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.

I got the Hollander & Wolfe book, but they gave me an edition from 1972. It doesn't have anything about bootstrapping, nor about the Bayesan LSD. It does have a simulatenous comparison method, but I think it's just the Nonparametric Tukey HSD (it gives the same results too...)

The second book was lended out...to my stats professor of last year.

If possible, I'll got back tomorrow and try to get my hands on the 1999 edition.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #33
Quote
If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.


That's what I meant -- that the overall result may still be significant even with 28 pairwise comparisons.  So that we should both be able to agree on the results.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #34
This page:

http://www.uvm.edu/~dhowell/StatPages/Resa...ndomOneway.html

implies that resampling methods for multiple means is not so simple as outlined in previous messages.  It still talks about Bonferroni adjustments, for example.

Book reference:

Westfall, R. H. & Young, S. S. (1993) Resampling-based multiple testing. New York: John Wiley & Sons.

ff123

Edit:  another page discussing p-value adjustments and resampling:

http://www.rz.tu-clausthal.de/sashtml/stat/chap43/sect14.htm

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #35
Quote
Originally posted by ff123
implies that resampling methods for multiple means is not so simple as outlined in previous messages.  It still talks about Bonferroni adjustments, for example.


You will always have that problem if you use pairwise tests to do a multiple comparison. The issue is that _if_ the bootstrap comparison is strong enough to give very significant pairwise results, there may be more results that are still significant even _after_ the Bonferroni correction.

The problem with Bonferroni is that it's so crude, and throws away a lot of results that may have been correct.

The second link is very interesting because it gives correction methods that are less crude.

Ironically, one of them is bootstrapping. It works by checking how many times you would incorrectly (because the bootstrapping uses random data) have concluded one of the p values was significant when it wasn't. More specifically, it determines the lowest significance level random data would give for each of the comparisons, and the proportion of tests in which that is lower than one of the significance levels you got on your test, is your new significance level.

I just _love_ this method. I can actually understand how it works

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #36
Sto cercando volontari per effettuare (o eventulmente tradurre)  test sui vari formati audio.
Contattatemi.

http://www.patchworks.it

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #37
I've written a crude utility based on your code. It doesn't do the simultaneous comparison correction yet.

(and may have bugs)

After 1 000 000 simulations:

Input file : aq1.txt
Read 8 treatments, 42 samples

[FUBAR formatted table snipped]

Resampling..........................................................................................
..........

cbr192 is worse than abr224 (0.01576)
cbr192 is worse than dm-xtrm (0.00030)
cbr192 is worse than mpc (0.00001)
cbr192 is worse than dm-ins (0.01259)
cbr192 is worse than cbr256 (0.01840)
cbr192 is worse than dm-std (0.00216)
abr224 is worse than mpc (0.01117)
r3mix is worse than dm-xtrm (0.01595)
r3mix is worse than mpc (0.00082)
mpc is better than dm-ins (0.01446)
mpc is better than cbr256 (0.00959)

After Bonferroni correction the alpha level is 0.001831, so that leaves:

mpc > cbr192, r3mix
dmxtrm > cbr192

For fun, I'm going to check whether testing vs the means gives more or less power.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #38
Using means:

cbr192 is worse than abr224 (0.02259)
cbr192 is worse than r3mix (0.04574)
cbr192 is worse than dm-xtrm (0.00050)
cbr192 is worse than mpc (0.00000)
cbr192 is worse than dm-ins (0.00475)
cbr192 is worse than cbr256 (0.01107)
cbr192 is worse than dm-std (0.00016)
abr224 is worse than mpc (0.00149)
abr224 is worse than dm-std (0.03514)
r3mix is worse than dm-xtrm (0.04195)
r3mix is worse than mpc (0.00056)
r3mix is worse than dm-std (0.01679)
dm-xtrm is worse than mpc (0.04614)
mpc is better than dm-ins (0.00782)
mpc is better than cbr256 (0.00324)

So that would give:

mpc > r3mix, abr224, cbr192
dmstd > cbr192
dmxtrm > cbr192

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #39
Garf, very nice.

I've got Westfall and Young on order from barnesandnoble.com, should be an entertaining read.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #40
I've got the simultaneous bootstrap correction implemented too, but darn, this thing needs horsepowers!

You need at least 10000 iterations to more or less converge on the alpha values, and then each time 10000 to check the alpha values.

10 000 * 10 000 = 100 000 000 tests!

Yikes!

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #41
lol!

How long does it take to run 100 million trials?  What kind of computer do you have and how fast is it?

ff123

Edit:  And what were the results?

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #42
It took about 1-2 minutes for a test with 1000 * 1000 trials (and it didn't converge very well for the really small values, so more is definetely needed). One with 10 000 * 10 000 should take about 100-200 minutes.

I have an Athlon 1000. Optimizing my code would probably make things faster though. I didn't exactly code for efficiency.

It's up at http://sjeng.org/ftp/bootstrap.c

If possible, could you proofread it for bugs? Note that I switched it to medians instead of rank scores. if I understand things correctly, you can use whatever gives the most power. Means could be very interesting.

Edit: We will know the results in about 100 minutes

Edit2: I confused medians and means...I think.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #43
I took a quick look at the code.  I checked the random number generation implementation, which looked correct.  If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs.  Then it randomly chooses another listener, but it could choose the same listener (I think).  It does that N times, where N is the number of listeners.

Why not just shuffle the codecs for each listener?  BTW, this was just a 5 minute glance, so I may have interpreted the code wrong (I have a hard enough time looking over my own code sometimes).

Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50.  Can it be implemented to not care how many listeners are in the data input?

ff123

Edit:  There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described:  Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix.  Which is the correct way to do it?

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #44
Quote
Originally posted by ff123
I took a quick look at the code.  I checked the random number generation implementation, which looked correct.


Perhaps a simple improvement is to pick a faster, higher-quality random number generator. I think I have some of those still lying around

Quote
If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs.  Then it randomly chooses another listener, but it could choose the same listener (I think).  It does that N times, where N is the number of listeners.
Why not just shuffle the codecs for each listener? 


I think you're right on this. I read up a bit more and the replacement/not replacement is one of the differences between a bootstrap and a randomization method. Since what we want is a randomization method, there should be no replacement. That said, the two are so closely related I expect no differences in the results.

Quote
Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50.  Can it be implemented to not care how many listeners are in the data input?


Sure, just set MAXSAMPLE higher

Quote
Edit:  There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described:  Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix.  Which is the correct way to do it?


Under the null hypothesis each setting has an equal chance of getting a certain score from the range of values that a certain listener uses. But the scores between different listeners are not comparable measurements. Still under the null hypothesis, it does not seem true that a certain sample has an equal chance of getting a certain score from the full range of values all listeners use.

So, I do not think you can also randomize the listeners. (But if you use ranks, it doesn't matter at all.)

Edit: One of the websites you linked to describes this under 'Repeated measures'

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #45
cbr192 is worse than abr224 (0.02360 vs 0.48990)
cbr192 is worse than r3mix (0.04510 vs 0.69920)
cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-ins (0.00440 vs 0.12540)
cbr192 is worse than cbr256 (0.01160 vs 0.29520)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
abr224 is worse than mpc (0.00190 vs 0.05520)
abr224 is worse than dm-std (0.03310 vs 0.59550)
r3mix is worse than dm-xtrm (0.03970 vs 0.65770)
r3mix is worse than mpc (0.00040 vs 0.00810)
r3mix is worse than dm-std (0.01550 vs 0.36640)
mpc is better than dm-ins (0.00980 vs 0.26020)
mpc is better than cbr256 (0.00280 vs 0.07850)

Note that these have errors of _at least_ 0.0001, and are based on only 10 000 trials (which amounts to 100M of actual tests).

If you compare the first column (pairwise alphas) with the values after 1M trials, you will see that at least in one case (abr224/mpc) the error is large enough to change the result (simultaneous alpha in second column just above 5%).

Edit: Just to make it clear: the first value between brackets is the pairwise alpha, the second one is the alpha after correction for the simultaneous test. The second one should be smaller than 0.05 for a truly significant result.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #46
So to compare the rank data, the resampling method yields:

cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
r3mix is worse than mpc (0.00040 vs 0.00810)

And Friedman/Fisher LSD yields:

mpc is better than r3mix, cbr192
dm-xtrm is better than cbr192
dm-std is better than cbr192

It seems they yield the same results!

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #47
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0.  If you do that, the ranked data yields ranksums which put dm-xtrm before dm-std, just like the ANOVA does.

ff123

Edit:  Oops, I meant that the parametric analysis is changed to look like the ranked method, where dm-xtrm is better than dm-std.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #48
The difference being the resampling results hold guaranteed for all comparisons at the same time with > 95% certainty.

You can add abr224<mpc to the resampling results BTW. I checked it fell through because of a bad estimation of the alpha value after only 10000 trials, and am running a test with 25000 trials now (will take half a day). It was already confirmed to hold with the Bonferroni correction, which is safe (and even overconservative).

But hey, it's always nice to see things confirm each other

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #49
Quote
Originally posted by ff123
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0.


Hmm, that's not acceptable for doing actual analysis on though

One thing I think  I _can_ do is to simply eliminate everybody who gave all-5's. After resampling those results are not changed anyway, and they do not affect the differences between the means.

That would speed up the analysis quite a bit, but I want to crosscheck it really does not affect any results.

Edit: Hmm, it may make a small difference anyway, so I'm going to keep them in just to be sure.

--
GCP