# HydrogenAudio

## Hydrogenaudio Forum => General Audio => Topic started by: ff123 on 2001-10-03 03:47:05

Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 03:47:05
Quote
Hmm, the results of that test are still under discussion (actually, I'm waiting for ff123 to finish his analysis tool with the nonparametric Tukey HSD test )

Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel.  It's 64.  The Fisher LSD was 44.  So, you can see that Tukey is quite a bit more conservative.

The ranksums (for reference) were:

cbr192 = 151.5
r3mix = 172.0
abr224 = 186.5
dm-ins = 188.0
cbr256 = 185.5
dm-std = 198.0
dm-xtrm = 207.0
mpc = 223.5

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!

ff123

Edit:  I discovered my Excel spreadsheet had a mistake in it.  The non-parametric Tukey's HSD should be 68.1.  I was debugging my code and had to resolve the discrepancy (the code was correct).  The conclusion remains the same.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 17:26:11
Quote
Originally posted by ff123

Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel.  It's 64.  The Fisher LSD was 44.  So, you can see that Tukey is quite a bit more conservative.

But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?

(I remember that we talked about it and I  concluded that it wasn't reliable/applicable, but I don't remember why)

I wanted a statistically 'sound' conclusion from this test. I wouldn't call soundness conservative.

For an idea of the individual results the Wilcoxon S-R test was enough. (From a look at the values its sensitivity seems to be even better than the Fisher LSD?) But presenting a result and having to say: there >50% chance one of the things we concluded is incorrect isn't very nice is it?

(btw. Wilcoxon+Bonferroni correction gave in the end the same results as the nonparam Tukey HSD!)

Quote

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!

Hmm, in the next test we will have to set in advance what we want to test I guess. And preferably that should only be like 4 or 5 pairs or so.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 17:50:14
Quote
But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?

The Fisher LSD I use for the Friedman analysis is a non-parametric version (which doesn't assume normal distribution).  There is a different Fisher LSD I use for blocked ANOVA.

Both are one-at-a-time multiple comparison techniques.  I guess that seems like an oxymoron, but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test.  And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment.  It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.

So my question would be:  For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD?  If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations.  And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded.  Maybe the best way to do this is to perform two types of experiments:  exploratory ones and confirmatory ones.  The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm.  The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 18:20:16
Quote
Originally posted by ff123

Both are one-at-a-time multiple comparison techniques. I guess that seems like an oxymoron,

Yes. I didn't understand it at first. (Now I do, thanks to your explanation)

Quote

but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test.

This seems very plausible, given that most of these methods predate computers

Quote

And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment.  It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.

Actually, I would expect the Wilcoxon+Bonf corr/Fisher+Bonf corr/Tukey tests all to give nothing if the Friedman test fails. (Wouldn't there be a contradiction otherwhise?)

Quote

So my question would be:  For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD?  If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations.

I honestly wouldn't know. I'm a bit biased vs the Wilcoxon because the statisticans told me it was good for our purposes, so I know it's good, whereas I don't know the Fisher LSD. I think that you might be right in the fact that the Fisher LSD is for convenience of calculation.

On the other hand, you've already written the app, so perhaps you can just use the Fisher LSD results vs the Wilcoxon results and check which one is more sensitive? We can just use that one then. The SPSS output is still on my page : http://home.planetinternet.be/~pascutto/AQT/OUTPUT.HTM (http://home.planetinternet.be/~pascutto/AQT/OUTPUT.HTM)

Also, there shouldn't be any contradictions between the two.

Quote

And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?

Tukey HSD, no question. It should _always_ be more sensitive than the other methods. It basically does a smarter 'correction'  than the very conservative Bonferroni.

Quote

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.

Yes. (But I'm not sure which one of Fisher LSD/Wilcoxon is best for one-at-a-time)

Quote

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded.

Right. Also, if possible, decide which one you expect to do better in the comparison (that also halves the significance needed due to one-tail/two-tail)

Quote

Maybe the best way to do this is to perform two types of experiments:  exploratory ones and confirmatory ones.  The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm.  The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.

Yep. This is why the first AQ test results are of good use: we know what to test for next time

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 20:24:12
It seems the worth of Bonferroni adjustments (perhaps even the very idea of simultaneous testing of null hypotheses) is not universally accepted in all statistical circles.

http://www.bmj.com/cgi/content/full/316/71...ch=&FIRSTINDEX= (http://www.bmj.com/cgi/content/full/316/7139/1236?maxtoshow=&HITS=10&hits=10&RESULTFORMAT=&titleabstract=bonferroni&searchid=QID_NOT_SET&stored_search=&FIRSTINDEX=)

with summary points as follows:

Adjusting statistical significance for the number of tests that have been performed on study data -- the Bonferroni method -- creates more problems than it solves.

The Bonferroni method is concerned with the general null hypothesis (that all null hypotheses are true simultaneously), which is rarely of interest or use to researchers.

The main weakness is that the interpretation of a finding depends on the number of other tests performed.

The likelihood of type II errors is also increased, so that truly important differences are deemed non-significant.

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 21:06:13
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm (http://home.clara.net/sisa/bonhlp.htm)

This website writes:

Quote
Scenario three concerns the situation when not predefined hypothesis are pursued using many tests, one test for each hypothesis. Basically this concerns the situation of data 'dredging' or 'fishing', many among us will recognize correlation variables=all or t-test groups=sex(2) variables=all. Above all, this should not be done. Bonferroni correction is difficult in this situation as the alpha level should be lowered very considerably in situations of such wealth (potentially with a factor of r*(r-1)/2, whereby r is the number of variables), and most standard statistical packages are not able to provide small enough p-value's to do it. SISA's advice is, if you want to go ahead with it anyway, to test at the 0.05 level for each test. After a relationship has been found, and this relationship is theoretically meaningful, the relationship should be confirmed in a separate study. This can be done after new data is collection or in the same study, by using the 'split sample' method. The sample is split in two, one half is used to do the 'dredging', the other half is used to confirm the relationships found. The disadvantage of the split sample method is that you loose power (use the procedure power to estimate how much). A Bayesian method can be used if you want to formally incorporate the result of the original study or dredging in the confirmation process. But don't put too high a value on your original finding.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 21:45:09

http://149.170.199.144/resdesgn/multrang.htm (http://149.170.199.144/resdesgn/multrang.htm)

Quote
Multiple range tests can be placed into two categories.

1. Constant LSD. In these a single LSD is found and used to compare all pairs of means. Tests differ in the algorithm used to calculate the LSD. Examples : Fisher's LSD, Tukey's HSD, Sheffé's LSD and Waller-Duncan's LSD.

2. Variable LSD. In these tests the means are ranked and the magnitude of the LSD is determined by the number of intervening means, between the two being compared. Examples: Newman-Keul's test, Duncan's multiple range test.
The second group appear to be generally less accepted and recommended than the former. The following notes about the first group are based on comments by Swallow (1984).

a. Tukey's HSD and Sheffé's LSD are too conservative, type II errors are favoured.

b. Fisher's LSD is prone to type I errors, although this is not too serious when used after rejecting an analysis of variance Null hypothesis (i.e. when it is a protected test).

c. Waller-Duncan's LSD has few faults but the statistic is complex and tables are generally unavailable.

If you require more information about multiple range tests the following are recommended: Swallow (1984), Chew (1980) and Day and Quinn (1989).

So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach.  However, I should remove the option from my program to allow the user to adjust the critical significance of just the LSD.  If anything it should adjust *both* the critical significance values of the Friedman/ANOVA and the corresponding LSD tests.

Waller-Duncan's LSD might be interesting as a side study, but Fisher's LSD is very easy to calculate once a Friedman or ANOVA has been performed.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-03 22:06:40
Situations in which Fisher's LSD is weak:

from:

http://davidmlane.com/hyperstat/B96288.html (http://davidmlane.com/hyperstat/B96288.html)

Quote
An approach suggested by the statistician R. A. Fisher (called the "least significant difference method" or Fisher's LSD) is to first test the null hypothesis that all the population means are equal (the omnibus null hypothesis) with an analysis of variance. If the analysis of variance is not significant, then neither the omnibus null hypothesis nor any other null hypothesis about differences among means can be rejected. If the analysis of variance is significant, then each mean is compared with each other mean using a t-test. The advantage of this approach is that there is some control over the EER. If the omnibus null hypothesis is true, then the EER is equal to whatever significance level was used in the analysis of variance. In the example with the six groups of subjects given in the section on t-tests, if the .01 level were used in the analysis of variance, then the EER would be .01. The problem with this approach is that it can lead to a high EER if most population means are equal but one or two are different.

next page:

http://davidmlane.com/hyperstat/B94854.html (http://davidmlane.com/hyperstat/B94854.html)

Quote
In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: YouriP on 2001-10-03 22:52:33
ff123, in the future when you want to post ammendments to your previous posts before a reply has yet been made, could you just edit the last post instead of posting 3 or 4 replies? Thanks.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 23:27:51
Quote
Originally posted by ff123

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.

This should hardly be a surprise...

The problem is presenting those results in a way that a general public without statistical background can understand what the implication of the multiple tests really is.

For some reason I feel that many people have a problem with: 'these are our results, but keep in mind that there's a 70% chance something here is incorrect'. Doesn't look very scientific, though it's prefectly ok.

Note that the contesting of the Bonferroni correction is due to the conservativeness. For us, this doesn't actually matter so much. But if you are testing if a new medicine has effect, you don't want to take a risk of incorrectly rejecting the hypothesis that it works. The mathematics behind it are sound.

Let it be clear that I prefer a simultaneous test over a multiple 2-sample tests + correction. But I dont agree with doing a 2-sample test _without_ correction.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 23:34:20
Quote
Originally posted by ff123
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm (http://home.clara.net/sisa/bonhlp.htm)

This website writes:
ff123

Hmm, nothing new here either.

Make a test to see if there are trends.

Do another test to test those trends.

This is what you suggested just earlier.

The comment about Bonferroni is also in line what we saw. The alpha level in the AQ test gets as low as 0.0017. That's at the limit of accuracy SPSS uses for its results

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 23:39:19
Quote
Originally posted by ff123

http://149.170.199.144/resdesgn/multrang.htm (http://149.170.199.144/resdesgn/multrang.htm)
So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach.

Hmm, I'm not convinced. I'd agree if we were talking about a small number of variables, but we've got 8.

I have this doubt because the Friedman test just says ' there is a difference between the samples '. This provides little protection if you are making 28 comparisons, though it obviously helps a lot if you only make 3 or so. It gets too easy to see false differences (aka Type 1 errors)

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-03 23:43:50
Quote
Originally posted by ff123

In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.
ff123

(Whats EER?)

I think this is basically saying what I said in my prev post, namely that when you make a lot of comparisons the fact that you know that 'there is a difference between samples' is not enough protection to prevent you from seeing differences where there aren't any.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-04 00:41:46
Youri, I'll modify posts into one if they haven't been replied to yet.  What is the purpose of this, though?  Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?

Garf,

EER = Experiment-wise error rate.

Basically, the difference between using a simultaneous vs. a one-at-a-time method is the difference between trying to control a type I error (false difference in codecs is identified) vs. a type II error (true difference in codecs is not identified).  That's also what I mean by being "conservative" or "agressive" about how one wants to be about analyzing the data.  If you're looking for an airtight conclusion (mpc is better than cbr192), tukey's HSD will give you one, but it probably won't be very useful.  On the other hand, if you're looking for some insight and are willing to accept some risk of a type I error, Fisher's protected LSD is much more sensitive.

This seems to be an area of controversy in statistics, just like there's a minor controversy over whether one-tailed tests of significance should be used (some conservative statisticians say that a two-tailed test should always be used, even in a confirmatory study, because if you're bothering to perform a test there must be some uncertainty about the outcome).

Perhaps a compromise solution that could accomodate us both would be to use Waller-Duncan's k-ratio t test, which, unlike Tukey's test, doesn't operate on the principle of controlling type I error.  Instead, it compares the type I and type II error rates based on Bayesian principles.  The only problem, I think, is that with the limited net search I've made so far, I haven't seen whether there is a non-parametric version of this.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Dibrom on 2001-10-04 00:47:56
Quote
Originally posted by ff123
Youri, I'll modify posts into one if they haven't been replied to yet.  What is the purpose of this, though?  Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?

Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant.  Multiple replies would bump the thread multiple times too, but again I don't see much of a problem.  I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.

Just my 2 cents.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-04 03:15:41
Found this powerpoint slideshow on the net:

Here are some relevant quotes:

Quote
The winner among winner pickers -- Cramer and Swanson (1973) conducted a computer simulation study involving 88,000 differences they compared LSD, FPLSD, HSD, SNK, BLSD both FPLSD and BLSD were better in their ability to protect against type I error and also in their power to detect real differences when they exist none of the other methods came close.

LSD = Fisher's LSD, without using an F test first
FPLSD = Fisher's protected LSD, only run if F test proves significant
HSD = Tukey's HSD
SNK = Student Newman Keuls test
BLSD = Bayes LSD (also known as Waller-Duncan's protected LSD)

Quote
The edge goes to BLSD... -- BLSD is prefered by some because it is a single value and therefore easy to use larger when F indicates that the means are homogeneous and small when means appear to be heterogeneous.  But the necessary tables may not be available, so FPLSD is quite acceptable

I'd like to get my hands on the Cramer and Swanson paper and also on the book which has the BLSD tables.  I wonder which book has them?  If I can get a hold of the tables, I can probably brute force the calculations by table lookup in my program.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: CiTay on 2001-10-04 04:21:42
ff, maybe you want to mail this person who searched for a similar thing a while ago:

Maybe she found some info in the meantime.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-04 05:49:54
Thanks Citay, but I did some digging, and I think the following papers are relevant to the Bayes LSD:

Waller, R.A. and Duncan, D.B. (1969) "A Bayes Rule for the Symmetric Multiple Comparison Problem", Journal of the American Statistical Association 64, pp. 184-199

Waller, R.A. and Kemp, K.E. (1975) "Computations of Bayesian t-Values for Multiple Comparisons", Journal of Statistical Computation and Simulation (Vol 4, no. 3), pp. 169-172

Swallow, W. H. 1984. "Those overworked and oft-misued mean separation procedures - Duncans, LDS, etc."  Plant Disease, 68: 919-921.

And a couple of books:

An Introduction to Statistical Methods and Data Analysis, 5th Ed., 2000, R. Lyman Ott, Duxbury Press, Belmont CA

Principles and Procedures of Statistics: A Biometrical Approach, 3rd Ed., 1996, Robert Steel and James Torrie

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-04 10:46:42
Thanks ff123. I'll have a look through the university library and check if they happen to have any of the relevant material.

If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.

Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-04 16:53:39
Quote
If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.

The SAS website has this:

Quote
It has been suggested that the experimentwise error rate can be held to the  level by performing the overall ANOVA F-test at the  level and making further comparisons only if the F-test is significant, as in Fisher's protected LSD. This assertion is false if there are more than three means (Einot and Gabriel 1975). Consider again the situation with ten means. Suppose that one population mean differs from the others by such a sufficiently large amount that the power (probability of correctly rejecting the null hypothesis) of the F-test is near 1 but that all the other population means are equal to each other. There will be 9(9 - 1)/2=36 t tests of true null hypotheses, with an upper limit of 0.84 on the probability of at least one type 1 error. Thus, you must distinguish between the experimentwise error rate under the complete null hypothesis, in which all population means are equal, and the experimentwise error rate under a partial null hypothesis, in which some means are equal but others differ.

So this supports the position that Fisher's protected LSD is not so protected for the case where there are a lot of means close to each other but one or two which are very different, as pointed out earlier.

Quote
Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?

yes.  I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: YouriP on 2001-10-06 02:14:04
Quote
Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant. Multiple replies would bump the thread multiple times too, but again I don't see much of a problem. I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.
Yeah, the reason it's normally not allowed is to keep people from bumping their own threads all the time, or adding to the reply count just to make their thread look popular (yes, some people worry about that apparantly ) It's probably just a pet-peeve of mine I developed from visiting a lot of übermoderated fora.  Actually, it's mainly meant to prevent posts like:

"Hi, I'm Youri! How are you all doing?"
"Oh, I'm fine btw!"

In this case, the response could simply be edited into the original posts. That's why I was only speaking of ammendments - if a reply to your post has already been made and you want to make an ammendment still, it's usually better to post a reply instead of editing your original post, because otherwise people may not notice it.

But I'm making a bigger problem out of it than it is, so carry on.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-08 08:00:05
I have completed the code to perform an optional Tukey's HSD (either parametric or non-parametric).  Version 1.20 of friedman.exe with source is at:

http://ff123.net/friedman/friedman120.zip (http://ff123.net/friedman/friedman120.zip)

This version also outputs an ANOVA table, if that option is specified, and generates a matrix of difference values to show how the means or ranksums are separated.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-08 10:05:25
Very cool programming work!

I have some more problems:

a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up'  to the ends of the rating scale? Is ANOVA permissible here?

b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)

b1) does it change any results?

b2) does it make the data 'more' normal?

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-08 15:07:35
Quote
a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up' to the ends of the rating scale? Is ANOVA permissible here?

For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."

and:

"When in doubt, some people choose a parametric test (because they aren't sure the Gaussian assumption is violated), and others choose a nonparametric test (because they aren't sure the Gaussian assumption is met)."

Quote
b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)

Nonparametric results should remain the same as long as the relative rankings are not changed.  I don't know how the ANOVA results change.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-08 17:01:27
Quote
Originally posted by ff123

For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."

Hmm yeah, but I think the non-normal look of the Xing/MPC results will stay even if we add more listeners.

The distribution we have looks normal but it has a 'pile up' effect on the sides of the hardest and lowest samples. The AQ test has this too as it consists entirely of hard samples.

Although it fails a normaility test, your comment above has me in doubt. Is this 'clipping' effect described somewhere?

If it would turn out that although a normality test fails we can still use methods based on a normal distribution, that would be a major help...

Choosing between parametric and nonparametric tests is sometimes easy. You should definitely choose a parametric test if you are sure that your data are sampled from a population that follows a Gaussian distribution (at least approximately). You should definitely select a nonparametric test in three situations:

• The outcome is a rank or a score and the population is clearly not Gaussian. Examples include class ranking of students, the Apgar score for the health of newborn babies (measured on a scale of 0 to IO and where all scores are integers), the visual analogue score for pain (measured on a continuous scale where 0 is no pain and 10 is unbearable pain), and the star scale commonly used by movie and restaurant critics (* is OK, ***** is fantastic).

'the visual analogue scale for pain' ... doesn't this apply to the Xing scores?

• The data ire measurements, and you are sure that the population is not distributed in a Gaussian manner. If the data are not sampled from a Gaussian distribution, consider whether you can transformed the values to make the distribution become Gaussian. For example, you might take the logarithm or reciprocal of all values. There are often biological or chemical reasons (as well as statistical ones) for performing a particular transform.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-12 07:22:54
The analysis tool I wrote can now be run from the web:

http://ff123.net/friedman/stats.html (http://ff123.net/friedman/stats.html)

Have fun!

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-12 13:12:48
Quote
Originally posted by ff123

yes.  I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.

Hollander, Myles: Nonparametric statistical methods / Myles Hollander, Douglas A. Wolfe. New York (N.Y.) : Wiley, 1999. XIV, 787 p..

Practical nonparametric and semiparametric Bayesian statistics / Dipak Dey, Peter Muller, Debajyoti Sinha (eds.).. New York (N.Y.) : Springer, 1998. XVI, 369 p. : ill..

I'll try to get them tuesday or wednesday. Library opening hours suck for me though

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-16 00:32:43
I found something else that may be interesting.

I had been wondering for a while that, since we have a pretty good idea of how the actual distribution looks (+- normal with clipping), if there's no way to make use of that instead of a pure nonparametric test. If you are able to make use of more knowledge about the distribution, you should be able to get more sensitive tests.

Something like this already seems to exist and it's called Bootstrapping (seems to be a fairly new technique too).

Basically, starting from your sample you use a large number of simulations to determine the distribution function of your actual population.

I think that once you are able to determine the distribution function, it should be possible to create an appropriate test for inequality of means. I've only skimmed through the book I have quickly, but one method seems to be via simulations again.

On a related note, this book says that the Wilcoxon Signed Rank Test needs symmetric distributions. If that's the case, then it's not applicable to the AQ test results I think.

With some luck the Hollander/Wolfe book will answer these questions. I should be able to get it tomorrow afternoon.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-16 04:27:18
This is a fascinating topic which I had no idea existed.  My elementary statistics book is dated 1973, and apparently the field was opened up by Dr. Efron in 1977.  The general technique is called resampling, and there are actually four different types of resampling methods:

1. the bootstrap, invented by Bradley Efron;
2. the jacknife, invented by Maurice Quenouille and later developed by John W. Tukey;
3. cross-validation, developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba
4. balanced repeated replication, developed by Philip J. McCarthy.

http://ericae.net/pare/getvn.asp?v=3&n=5 (http://ericae.net/pare/getvn.asp?v=3&n=5)

Here's a conceptually simple example taken from that page:

****
For simplicity, let's assume that a district has 13 voucher students and 39 non-voucher students, and the mean difference is 10 standard score units. To empirically construct the distribution, we'd follow these steps:

1.  Create a data base with all the student grades.
2.  Randomly sort the data base.
3.  Compute the mean for the first 13 students.
4.  Compute the mean for the other 39 students.
5.  Record the test statistic--the absolute value of the mean difference.

Then repeat steps 2 though 5 many times.  That way, we'd get the distribution of mean differences when we randomly select students. The probability of observing a mean difference of 10 when everything is random is the proportion of experimental test statistics in step 5 that are greater than 10.
****

It's very simple conceptually, and does not assume either that data is normal or that the sample is randomly drawn from some population.  Very nice.

ff123

Edit:  another web reference lists the four major types of resampling methods as follows:

(http://seamonkey.ed.asu.edu/~alex/teaching...resampling.html (http://seamonkey.ed.asu.edu/~alex/teaching/WBI/resampling.html))

****
There are four major types of resampling:

Randomization exact test: It is also known as the permutation test. Surprise! It was developed by R. A. Fisher, the founder of classical statistical testing. However, in his later years Fisher lost interest in the permutation method because there was no computers in his days to automate such a laborious method.

Cross-validation: It was developed by Seymour Geisser, Mervyn Stone, and Grace G. Wahba.

Jacknife: It is also known as Jackknife and Quenouille-Tukey Jackknife. It was invented by Maurice Quenouille (1949) and later developed by John W. Tukey (1958). The name "jacknife" was coined by Tukey to imply that the method is an all-purpose statistical tool.

Bootstrap: It was invented by Bradley Efron and further developed by Efron & Tibshirani (1993). It means that one available sample gives rise to many others by resampling (pulling yourself by your own bootstrap).

Among the four methods, the first and the last ones are more useful. The principles of cross-validation, Jacknife, and bootstrap are very similar but bootstrap overshadows the others for it is a more thorough procedure. Indeed, Jacknife is of largely historical interest today (Mooney & Duval, 1993) (Nevertheless, Jacknife is still useful in EDA for assessing how each subsample affects the model).
****
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-16 07:21:27
Here is my untutored guess on how a resampled analysis method would work on, for example, the AQ1 test data (using rank scale):

1. Convert to ranks instead of ratings
2. For each listener, randomize the order of the ranking, then add up each column (codec setting).
3. Calculate the difference between all pairs of columns.  For 8 codecs, there will be 28 pairs of columns.
4. Repeat 1000 times (or however many times you want).  At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.

Is it really that simple?

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-16 08:47:09
Quote
Originally posted by ff123
4. Repeat 1000 times (or however many times you want).  At the end, one should have 28 distributions of differences.
5. Compare the actual difference in ranksums to the simulated distributions to come up with a p-value for each of the 28 pair comparisons.

You don't even need the distribution. During the simulation, you just count how many times the difference in ranksums in the trials exceeded the one you've got.

At the end, you divide that by the number of trials. Voila

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-16 17:12:45
Man, that sounds sweet.  It shouldn't be too hard to code up given that my current program has almost all the bits and pieces needed.

So does this sidestep the problem of doing multiple pairwise comparisons?  It seems like it, to me.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-16 17:20:34
Quote
Originally posted by ff123
So does this sidestep the problem of doing multiple pairwise comparisons?  It seems like it, to me.

I don't really see how immediately? Any suggestion?

Edit3: Deleted Edit 1 and 2

If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.

I got the Hollander & Wolfe book, but they gave me an edition from 1972. It doesn't have anything about bootstrapping, nor about the Bayesan LSD. It does have a simulatenous comparison method, but I think it's just the Nonparametric Tukey HSD (it gives the same results too...)

The second book was lended out...to my stats professor of last year.

If possible, I'll got back tomorrow and try to get my hands on the 1999 edition.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-16 17:52:21
Quote
If you can get pairwise results with a very high significance then it's interesting anyway as the overall result may still be significant even with 28 comparisons.

That's what I meant -- that the overall result may still be significant even with 28 pairwise comparisons.  So that we should both be able to agree on the results.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-17 01:24:26

http://www.uvm.edu/~dhowell/StatPages/Resa...ndomOneway.html (http://www.uvm.edu/~dhowell/StatPages/Resampling/RandomOneway/RandomOneway.html)

implies that resampling methods for multiple means is not so simple as outlined in previous messages.  It still talks about Bonferroni adjustments, for example.

Book reference:

Westfall, R. H. & Young, S. S. (1993) Resampling-based multiple testing. New York: John Wiley & Sons.

ff123

Edit:  another page discussing p-value adjustments and resampling:

http://www.rz.tu-clausthal.de/sashtml/stat/chap43/sect14.htm (http://www.rz.tu-clausthal.de/sashtml/stat/chap43/sect14.htm)
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 10:41:42
Quote
Originally posted by ff123
implies that resampling methods for multiple means is not so simple as outlined in previous messages.  It still talks about Bonferroni adjustments, for example.

You will always have that problem if you use pairwise tests to do a multiple comparison. The issue is that _if_ the bootstrap comparison is strong enough to give very significant pairwise results, there may be more results that are still significant even _after_ the Bonferroni correction.

The problem with Bonferroni is that it's so crude, and throws away a lot of results that may have been correct.

The second link is very interesting because it gives correction methods that are less crude.

Ironically, one of them is bootstrapping. It works by checking how many times you would incorrectly (because the bootstrapping uses random data) have concluded one of the p values was significant when it wasn't. More specifically, it determines the lowest significance level random data would give for each of the comparisons, and the proportion of tests in which that is lower than one of the significance levels you got on your test, is your new significance level.

I just _love_ this method. I can actually understand how it works

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: PatchWorKs on 2001-10-17 11:47:24
Sto cercando volontari per effettuare (o eventulmente tradurre)  test sui vari formati audio.
Contattatemi.

http://www.patchworks.it (http://www.patchworks.it)
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 14:21:13
I've written a crude utility based on your code. It doesn't do the simultaneous comparison correction yet.

(and may have bugs)

After 1 000 000 simulations:

Input file : aq1.txt

[FUBAR formatted table snipped]

Resampling..........................................................................................
..........

cbr192 is worse than abr224 (0.01576)
cbr192 is worse than dm-xtrm (0.00030)
cbr192 is worse than mpc (0.00001)
cbr192 is worse than dm-ins (0.01259)
cbr192 is worse than cbr256 (0.01840)
cbr192 is worse than dm-std (0.00216)
abr224 is worse than mpc (0.01117)
r3mix is worse than dm-xtrm (0.01595)
r3mix is worse than mpc (0.00082)
mpc is better than dm-ins (0.01446)
mpc is better than cbr256 (0.00959)

After Bonferroni correction the alpha level is 0.001831, so that leaves:

mpc > cbr192, r3mix
dmxtrm > cbr192

For fun, I'm going to check whether testing vs the means gives more or less power.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 15:13:50
Using means:

cbr192 is worse than abr224 (0.02259)
cbr192 is worse than r3mix (0.04574)
cbr192 is worse than dm-xtrm (0.00050)
cbr192 is worse than mpc (0.00000)
cbr192 is worse than dm-ins (0.00475)
cbr192 is worse than cbr256 (0.01107)
cbr192 is worse than dm-std (0.00016)
abr224 is worse than mpc (0.00149)
abr224 is worse than dm-std (0.03514)
r3mix is worse than dm-xtrm (0.04195)
r3mix is worse than mpc (0.00056)
r3mix is worse than dm-std (0.01679)
dm-xtrm is worse than mpc (0.04614)
mpc is better than dm-ins (0.00782)
mpc is better than cbr256 (0.00324)

So that would give:

mpc > r3mix, abr224, cbr192
dmstd > cbr192
dmxtrm > cbr192

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-17 16:05:40
Garf, very nice.

I've got Westfall and Young on order from barnesandnoble.com, should be an entertaining read.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 18:16:24
I've got the simultaneous bootstrap correction implemented too, but darn, this thing needs horsepowers!

You need at least 10000 iterations to more or less converge on the alpha values, and then each time 10000 to check the alpha values.

10 000 * 10 000 = 100 000 000 tests!

Yikes!

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-17 19:37:25
lol!

How long does it take to run 100 million trials?  What kind of computer do you have and how fast is it?

ff123

Edit:  And what were the results?
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 19:50:27
It took about 1-2 minutes for a test with 1000 * 1000 trials (and it didn't converge very well for the really small values, so more is definetely needed). One with 10 000 * 10 000 should take about 100-200 minutes.

I have an Athlon 1000. Optimizing my code would probably make things faster though. I didn't exactly code for efficiency.

It's up at http://sjeng.org/ftp/bootstrap.c (http://sjeng.org/ftp/bootstrap.c)

If possible, could you proofread it for bugs? Note that I switched it to medians instead of rank scores. if I understand things correctly, you can use whatever gives the most power. Means could be very interesting.

Edit: We will know the results in about 100 minutes

Edit2: I confused medians and means...I think.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-17 21:06:13
I took a quick look at the code.  I checked the random number generation implementation, which looked correct.  If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs.  Then it randomly chooses another listener, but it could choose the same listener (I think).  It does that N times, where N is the number of listeners.

Why not just shuffle the codecs for each listener?  BTW, this was just a 5 minute glance, so I may have interpreted the code wrong (I have a hard enough time looking over my own code sometimes).

Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50.  Can it be implemented to not care how many listeners are in the data input?

ff123

Edit:  There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described:  Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix.  Which is the correct way to do it?
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 23:38:38
Quote
Originally posted by ff123
I took a quick look at the code.  I checked the random number generation implementation, which looked correct.

Perhaps a simple improvement is to pick a faster, higher-quality random number generator. I think I have some of those still lying around

Quote
If I understand the resampling algorithm correctly, the code randomly chooses a listener, then for that listener, it shuffles the codecs.  Then it randomly chooses another listener, but it could choose the same listener (I think).  It does that N times, where N is the number of listeners.
Why not just shuffle the codecs for each listener?

I think you're right on this. I read up a bit more and the replacement/not replacement is one of the differences between a bootstrap and a randomization method. Since what we want is a randomization method, there should be no replacement. That said, the two are so closely related I expect no differences in the results.

Quote
Also, I noticed that your code limits the number of listeners to MAXSAMPLE, which you set to 50.  Can it be implemented to not care how many listeners are in the data input?

Sure, just set MAXSAMPLE higher

Quote
Edit:  There's at least one other way I can think of to resample, besides the way it appears you did it, and the way I just described:  Pool all the rankings together from all listeners, then randomly grab rankings out of that pool one at a time (replacing rankings each time) to reconstruct a new matrix.  Which is the correct way to do it?

Under the null hypothesis each setting has an equal chance of getting a certain score from the range of values that a certain listener uses. But the scores between different listeners are not comparable measurements. Still under the null hypothesis, it does not seem true that a certain sample has an equal chance of getting a certain score from the full range of values all listeners use.

So, I do not think you can also randomize the listeners. (But if you use ranks, it doesn't matter at all.)

Edit: One of the websites you linked to describes this under 'Repeated measures'

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-17 23:44:34
cbr192 is worse than abr224 (0.02360 vs 0.48990)
cbr192 is worse than r3mix (0.04510 vs 0.69920)
cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-ins (0.00440 vs 0.12540)
cbr192 is worse than cbr256 (0.01160 vs 0.29520)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
abr224 is worse than mpc (0.00190 vs 0.05520)
abr224 is worse than dm-std (0.03310 vs 0.59550)
r3mix is worse than dm-xtrm (0.03970 vs 0.65770)
r3mix is worse than mpc (0.00040 vs 0.00810)
r3mix is worse than dm-std (0.01550 vs 0.36640)
mpc is better than dm-ins (0.00980 vs 0.26020)
mpc is better than cbr256 (0.00280 vs 0.07850)

Note that these have errors of _at least_ 0.0001, and are based on only 10 000 trials (which amounts to 100M of actual tests).

If you compare the first column (pairwise alphas) with the values after 1M trials, you will see that at least in one case (abr224/mpc) the error is large enough to change the result (simultaneous alpha in second column just above 5%).

Edit: Just to make it clear: the first value between brackets is the pairwise alpha, the second one is the alpha after correction for the simultaneous test. The second one should be smaller than 0.05 for a truly significant result.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 05:46:27
So to compare the rank data, the resampling method yields:

cbr192 is worse than dm-xtrm (0.00050 vs 0.01380)
cbr192 is worse than mpc (0.00010 vs 0.00190)
cbr192 is worse than dm-std (0.00020 vs 0.00370)
r3mix is worse than mpc (0.00040 vs 0.00810)

And Friedman/Fisher LSD yields:

mpc is better than r3mix, cbr192
dm-xtrm is better than cbr192
dm-std is better than cbr192

It seems they yield the same results!

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 05:56:28
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0.  If you do that, the ranked data yields ranksums which put dm-xtrm before dm-std, just like the ANOVA does.

ff123

Edit:  Oops, I meant that the parametric analysis is changed to look like the ranked method, where dm-xtrm is better than dm-std.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-18 07:54:34
The difference being the resampling results hold guaranteed for all comparisons at the same time with > 95% certainty.

You can add abr224<mpc to the resampling results BTW. I checked it fell through because of a bad estimation of the alpha value after only 10000 trials, and am running a test with 25000 trials now (will take half a day). It was already confirmed to hold with the Bonferroni correction, which is safe (and even overconservative).

But hey, it's always nice to see things confirm each other

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-18 08:01:55
Quote
Originally posted by ff123
By the way, on an unrelated note, one can change the results a bit by eliminating listener number 16, who was quite severe with overall ratings (in fact, he is the most severe rater), but who rated dm-std as a 5.0.

Hmm, that's not acceptable for doing actual analysis on though

One thing I think  I _can_ do is to simply eliminate everybody who gave all-5's. After resampling those results are not changed anyway, and they do not affect the differences between the means.

That would speed up the analysis quite a bit, but I want to crosscheck it really does not affect any results.

Edit: Hmm, it may make a small difference anyway, so I'm going to keep them in just to be sure.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 08:04:57
BTW, The p-Value adjustments page I linked to says that the p-adjusted resampling algorithm can be made even more sensitive while still controlling familywise error rate by using a stepdown method.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 08:13:43
Quote
Hmm, that's not acceptable for doing actual analysis on though

Post screening is not a-priori ruled out, depending on how the data was collected.  BS-1116-1 has this to say:

Quote
Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared with the mean result and another relies on the ability of the subject to make correct identifications. The first class is never justifiable. Whenever a subjective listening test is performed with the test method recommended here, the required information for the second class of post-screening is automatically available. A suggested statistical method for doing this is described in Appendix 1.

The methods are primarily used to eliminate subjects who cannot make the appropriate discriminations. The application of a post-screening method may clarify the tendencies in a test result. However, bearing in mind the variability of subjects’ sensitivities to different artefacts, caution should be exercised.

So, if ABC/HR is used to collect the data (reference is rated each time a sample is rated), post-screening can be used as described in Appendix 1 of that document.  It is too long to paste here, but supposedly BS 1116-1 can be had for free these days.  See one of 2Bdecided's posts on the r3mix forum.

I agree that post screening as I describe is not appropriate for the AQ1 test.  I was just commenting on why dm-std and dm-xtrm seemed to be swapped depending on whether a parametric or non-parametric method is used.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-18 09:45:27
I think we may get lucky on the effects of the nonrandom order. Even though we don't know how much exactly it applied, if at all, we can check where it would have applied, if present.

The two more extreme settings in this test were cbr192 (very low) and mpc (very high). If the effect plays, one would expect the codec(s) just after cbr192 to be rated higher than they should be, and the one(s) after mpc to be rated lower than they should be.

However, we reached a downward conclusion so far for the codecs after cbr192 (abr224<mpc, r3mix < mpc), so I think we can already say that, if the effect plays, it did not endanger the conclusions there.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Dibrom on 2001-10-18 19:04:36
Since Garf is having trouble accessing the website here today, he asked me to post this for him:

Garf Wrote:
Quote
Results after 25000^2 resamples (10+ hours cpu time):

cbr192 is worse than abr224 (0.02208 vs 0.48052)
cbr192 is worse than r3mix (0.04712 vs 0.71088)
cbr192 is worse than dm-xtrm (0.00024 vs 0.01144)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.00316 vs 0.11308)
cbr192 is worse than cbr256 (0.01032 vs 0.28664)
cbr192 is worse than dm-std (0.00012 vs 0.00588)
abr224 is worse than dm-xtrm (0.07908 vs 0.85528)
abr224 is worse than mpc (0.00080 vs 0.03504)
abr224 is worse than dm-std (0.03364 vs 0.60492)
r3mix is worse than dm-xtrm (0.04024 vs 0.66340)
r3mix is worse than mpc (0.00040 vs 0.01824)
r3mix is worse than dm-std (0.01436 vs 0.36392)
dm-xtrm is worse than mpc (0.04596 vs 0.70132)
mpc is better than dm-ins (0.00808 vs 0.23876)
mpc is better than cbr256 (0.00304 vs 0.10972)
cbr256 is worse than dm-std (0.06392 vs 0.80176)

He said you would know how to interpret them ff123.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 19:30:09
With 95% confidence for the entire experiment, not just for individual pair comparisons, one can state that:

1. mpc is better than abr224, r3mix, and cbr192
2. dm-xtrm is better than cbr192
3. dm-std is better than cbr192

I assume this was done using the rank data (not the raw ratings data), and that the figure of merit was the means of the ranks (same as using the rank sums).

This result shows that the resampling method is even more sensitive than using the Friedman / Fisher LSD, while affording greater confidence in the result to boot.

It should be even more interesting to see if the results change when a stepdown technique is incorporated to further adjust the p-values.  This promises to increase sensitivity even further.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Dibrom on 2001-10-18 19:51:39

Quote

<Garf> I used raw ratings data, and means.
<Garf> (which is just as well, and even more powerfull)
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 20:01:28
<Garf> I used raw ratings data, and means.
<Garf> (which is just as well, and even more powerfull)

Ok, then the classical analog would be the blocked ANOVA / Fisher LSD.

Garf, are you sure that the way you randomize (choose listeners with replacement) is not significantly different from choosing listeners without replacement?  I would be interested to see a comparison in the methods.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Dibrom on 2001-10-18 20:09:22
Quote

<Garf> I do not use replacement
<Garf> I changed that after [your] first comment

Garf says you can check it with the utility, he uploaded the last version here:

sjeng.org/ftp/bootstrap.c
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-18 20:20:05
Ok, cool!

The blocked ANOVA /  Fisher LSD seems to be implying that further sensitivity is possible, although for all we know, the results are incorrect because of the assumptions that are ignored and because familywise error is not controlled well.

Can you figure out how the stepdown is supposed to work?  I haven't looked at it very carefully.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-19 00:07:39
Hmmm,

If I understand the procedure correctly, essentially all the simulation work is already done, and the stepdown is extremely painless.

So the after stepdown correction, the adjusted p-values would be:

1. mpc > cbr192:  padj = 28 * 0.00000 = 0.00000
2. dm-std > cbr192: padj = max(0.00000, 27 * 0.00012) = .00324
3. dm-xtrm > cbr192: padj = max(0.00324, 26 * 0.00024) = 0.00624
4. mpc > r3mix: padj = max(0.00624, 25 * 0.00040) = 0.01
5. mpc > abr224: padj = max(0.01, 24 * 0.00080) = 0.0192

6. mpc > cbr256: padj = max(0.0192, 23 * 0.00304) = 0.06992

So, after stepdown correction, the mpc > cbr256 is closer to meeting the critical significance of 0.05, but no cigar.

ff123

Edit:  something tells me I didn't do this correctly, because I basically ignored the adjusted p-values which were obtained by the bootstrap adjustment.  If I did that, why didn't I just start with the base 25,000 trial run?

Edit2:  ok, let's try this again.  When calculating the ordinary bootstrap p-value adjustments for the AQ1 data set, there are 28 p-value counters, one counter for each pairwise comparison.  The counters are loaded with new values after each block of 25,000 trials.  Any particular counter is incremented if one or more of the new 28 block p-values is less than or equal to the actual p value.  The adjusted p-values are the proportion of counts after 25,000 times 25,000 blocks of trials are run.

To calculate the stepdown p-value adjustments, The most extreme p-value counter (mpc vs. cbr192) is incremented after each 25,000 trial block as described above.  However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter.  The third most extreme p-value counter excludes the first two counters, etc.

I think I have that correct, now.

Edit 3:  The initial stepdown calculation I made was actually a Bonferroni stepdown adjustment.  It is still valid; the advantage is that one doesn't have to run 25,000^2 trials, just the one 25,000 trial block.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-19 17:58:01
Quote
Originally posted by ff123

I think I have that correct, now.

I agree.

I'd add medians and stepdown correction, but I'm rather busy with other things like now, so feel free to....

Edit: where you say:

However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter

Don't you mean 'excluding the most extreme p-value'?

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-19 18:44:11
Quote
I agree.

I'd add medians and stepdown correction, but I'm rather busy with other things like now, so feel free to....

I won't be around this weekend to play, so I guess this will have to wait.  BTW, my celeron 800 does a 1000 x 1000 simulation in about 2 minutes 20 sec, quite a bit slower than your Athlon 1Gig.  Using MSVC 6 instead of djgpp doesn't really help much.  A 25,000 x 25,000 simulation would take about 24 hours.  I can see why resampling techniques have taken so long to come into their own.

Quote
However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter

Don't you mean 'excluding the most extreme p-value'?

Yes.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-20 23:37:39
medians (1/2):

(only ran 1000^2 trials this time)

cbr192 is worse than abr224 (0.01600 vs 0.15700)
cbr192 is worse than r3mix (0.01800 vs 0.16400)
cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.00000 vs 0.00000)
cbr192 is worse than cbr256 (0.00100 vs 0.01600)
cbr192 is worse than dm-std (0.00000 vs 0.00000)

First quartile (1/4):

cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-std (0.03800 vs 0.34800)
abr224 is worse than mpc (0.00000 vs 0.00000)
abr224 is worse than dm-std (0.03600 vs 0.26900)
r3mix is worse than mpc (0.00000 vs 0.00000)
r3mix is worse than dm-std (0.03900 vs 0.41400)
mpc is better than dm-ins (0.00100 vs 0.02500)
mpc is better than cbr256 (0.00000 vs 0.00000)
cbr256 is worse than dm-std (0.04000 vs 0.41400)

1/3 :

cbr192 is worse than r3mix (0.00800 vs 0.15400)
cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.01200 vs 0.23400)
cbr192 is worse than cbr256 (0.00900 vs 0.15400)
cbr192 is worse than dm-std (0.00300 vs 0.11300)
abr224 is worse than dm-xtrm (0.00900 vs 0.15400)
abr224 is worse than mpc (0.00000 vs 0.00000)
abr224 is worse than dm-std (0.01100 vs 0.19100)
r3mix is worse than mpc (0.02200 vs 0.35200)
mpc is better than dm-ins (0.02000 vs 0.34300)
mpc is better than cbr256 (0.02900 vs 0.41000)

Neither of these is as sensitive as plain means. Aditionally, it's harder to give a meaning to the results. (With means you can say: people graded x more on average. I can't think of something similar for either of the test statistics above)

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-24 01:32:46
Ok, I believe I have modified your source correctly to implement stepdown using resampling.  Here is a run of 2000 x 2000:

cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-std (0.00000 vs 0.00000)
abr224 is worse than mpc (0.00000 vs 0.00000)
r3mix is worse than mpc (0.00050 vs 0.02200)

mpc is better than cbr256 (0.00350 vs 0.11400)
cbr192 is worse than dm-ins (0.00450 vs 0.13450)
mpc is better than dm-ins (0.00750 vs 0.20550)
cbr192 is worse than cbr256 (0.00800 vs 0.21150)
r3mix is worse than dm-std (0.01800 vs 0.38100)
cbr192 is worse than abr224 (0.02150 vs 0.41050)
abr224 is worse than dm-std (0.03450 vs 0.51900)
r3mix is worse than dm-xtrm (0.04050 vs 0.56300)
cbr192 is worse than r3mix (0.04500 vs 0.58900)
dm-xtrm is worse than mpc (0.04700 vs 0.56600)

The first five conclusions, taken together, are significant with 95% confidence.

I've placed the modified source at:

http://ff123.net/export/bootstrap.c (http://ff123.net/export/bootstrap.c)

The changes aren't necessarily pretty, but I think it works.

ff123

Edit:  10,000 x 10,000 run:

cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-xtrm (0.00010 vs 0.00440)
cbr192 is worse than dm-std (0.00010 vs 0.00400)
r3mix is worse than mpc (0.00040 vs 0.01490)
abr224 is worse than mpc (0.00050 vs 0.01740)

mpc is better than cbr256 (0.00230 vs 0.07020)
cbr192 is worse than dm-ins (0.00370 vs 0.10790)
mpc is better than dm-ins (0.00760 vs 0.18750)
cbr192 is worse than cbr256 (0.01030 vs 0.23310)
r3mix is worse than dm-std (0.01570 vs 0.31730)
cbr192 is worse than abr224 (0.02170 vs 0.39190)
abr224 is worse than dm-std (0.03590 vs 0.52930)
r3mix is worse than dm-xtrm (0.04160 vs 0.56070)
cbr192 is worse than r3mix (0.04330 vs 0.56710)
dm-xtrm is worse than mpc (0.04350 vs 0.53840)
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-24 07:37:25
Just for kicks, I thought I'd try it for the dogies.wav test data, and I got for 3000 x 3000 trials:

MPC is better than XING (0.00000 vs 0.00000)
AAC is better than XING (0.00000 vs 0.00000)

AAC is better than LAME (0.00233 vs 0.05367)
MPC is better than WMA (0.00233 vs 0.05100)
MPC is better than LAME (0.00333 vs 0.06333)
OGG is better than XING (0.00333 vs 0.05867)
AAC is better than WMA (0.00433 vs 0.06867)
MPC is better than OGG (0.00533 vs 0.07700)
LAME is better than XING (0.00767 vs 0.09167)
WMA is better than XING (0.00967 vs 0.10167)
AAC is better than OGG (0.01267 vs 0.11700)

So from the looks of it, this method is still quite conservative when compared with Friedman or ANOVA with Fisher's LSD.

Either that, or I did the stepdown incorrectly.

ff123

Edit:  in fact, Tukey's HSD is less conservative than this!
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Garf on 2001-10-24 10:07:16
Resampling methods are generally considered 'debatable'  for samples of size 20-30, and only generally accepted for samples > 30.

Using them with a sample size of 12 is probably going to kill you. No need for it either, as the 128kbps data looked normal enough that parametric methods will work.

--
GCP
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-10-24 18:25:03
I'm going to go over bootstrap.c with a fine tooth comb tonight.  I already see that it has some errors in it related to floating-point comparisons, which should always include the "DELTA" fuzz.  This should be minor, though.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-03 00:32:50
Ok,

Combing finished.  Changed some code to use long variables instead of float.  This sidesteps some issues I was having using the DELTA fuzz thingy.  The results are the same as far as I can tell, at least at the 1000 x 1000 level.

http://ff123.net/export/bootstrap.c (http://ff123.net/export/bootstrap.c)

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-04 00:09:11
In reading Resampling-Based Multiple Testing (Westfall & Young), I note that there is a way to step down that is potentially even more powerful than the method Garf and I are using in bootstrap.c.  The book makes a distinction between free step down and restricted step down.  The idea is to restrict hypotheses to those whose simultaneous truth does not contradict.

In a free stepdown, the multipliers for a 6 treatment test, would be:  15, 14, 13, 12, ... 1.

In a restricted stepdown for the same number of treatments, a conservative adjustment (not quite optimal, but conveniently available in a table) yields the multipliers:  15, 10, 10 10, 10, 10, 7, 7, 7, 6, 4, 4, 3, 2, 1.

This is a substantial improvement over the free stepdown.  It would be interesting to implement this improvement in bootstrap.c (which should really be called resampling.c).

I will scan in the relevant pages and send this to Garf.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-04 20:32:34
Hmmm,

It appears that permutation (what the current program does) is not the optimal resampling method to use with restricted step down.  For example, see:

http://www.sas.com/service/library/periodi...s/obswww23/#s05 (http://www.sas.com/service/library/periodicals/obs/obswww23/#s05)

In Resampling-Based Multiple Testing, Westfall and Young voice the same concerns (not surprisingly, since Westfall is involved in SAS/STAT).

So perhaps it is time to backtrack a bit and get bootstrap resampling working in the program.  However, there needs to be a few adjustments, because just comparing the means of the treatments directly is not adequate for bootstrap resampling.  Instead, a t statistic should be calculated which uses a "shift" and "pivot" method.

Also, I may be missing some information to make the restricted step down calculation easier:  Westfall appears to have done some work in 1997 and "devised a minP-based method for any set of linear contrasts that respects the collapsing in the closure tree, as well as intercorrelations among the variables."

But first, I'd like to get bootstrap resampling working and giving the same results as the permutation resampling.

ff123

Edit:  There is a PDF paper by Westfall which mentions minP here:

http://www.unibas.ch/psycho/BBS/stuff/westfall.pdf (http://www.unibas.ch/psycho/BBS/stuff/westfall.pdf)

and Reference paper (which I don't have):
Westfall, P.H. (1997) Multiple testing of general contrasts using logical constraints and correlations, Journal of the American Statistical Association, 92, 299-306.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-05 23:58:12
Quote
and Reference paper (which I don't have):
Westfall, P.H. (1997) Multiple testing of general contrasts using logical constraints and correlations, Journal of the American Statistical Association, 92, 299-306.

I copied this paper from my local community college library, and it's very interesting, although I think it will take some time for me to fully absorb.  It's definitely not plug and play.  But in short, I believe a simple and efficient algorithm is presented which allows one to take advantage of logical constraints when performing stepdown adjustments that will make the (bootstrap) resampling analysis more powerful.

Too bad a piece of (free) code doesn't already exist somewhere that performs this type of analysis.  There's SAS/STAT, but they don't even list a price (you have to call them), so I figure the price must be exorbitant.

ff123

Edit:  Hmm.  Peter Westfall seems to have made a piece of code available here:

http://lib.stat.cmu.edu/jasasoftware/mtest (http://lib.stat.cmu.edu/jasasoftware/mtest)
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-07 21:29:59
I don't know who's actually reading these posts, but for me, they're a kind of logbook.  I'm working on version 0.3 of bootstrap.c, which will deprecate the permutation resampling in favor of bootstrap resampling.  I think I finally understand how the simplest bootstrap algorithm works (I'm talking single step, not even the free step down, much less the restricted step down!), and the book has a good example which I should be able to replicate to test the program.

Running the current permuation resampling code on the example, it's clear that the program has a lot of room for improvement (read: potential for increase in power).

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: Jon Ingram on 2001-11-08 10:17:04
Quote
I don't know who's actually reading these posts...

I don't have any time to contribute to audio tests/methodology at the moment (work + real life getting in the way), but I'm finding what you are writing very interesting, if a little out of my sphere of understanding -- I've previously only met bootstrapping before in the context of evolutionary phylogenetics from aligned DNA sequence data.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-09 10:08:49
Finished the single-step bootstrap and verified that it gives the same results as the example in the book.  I am able to tweak it even further by assuming that resampling values for each listener are restricted to the values given by that listener.  It is the same idea, I think, as the blocked ANOVA vs. a regular ANOVA.

The results for the AQ1 data are:

Code: [Select]
`         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.4216   0.2928   0.1263   0.0822   0.0539   0.0342*  0.0016*dm-std      --    0.8035   0.4672   0.3486   0.2592   0.1871   0.0181*dm-xtrm     --       --    0.6323   0.4909   0.3788   0.2842   0.0342*dm-ins      --       --       --    0.8332   0.6877   0.5530   0.1004cbr256      --       --       --       --    0.8482   0.7019   0.1517abr224      --       --       --       --       --    0.8482   0.2139r3mix       --       --       --       --       --       --    0.2928         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.8494   0.5887   0.1399   0.0569   0.0226*  0.0081*  0.0000*dm-std      --    0.9999   0.9044   0.7191   0.4983   0.2943   0.0018*dm-xtrm     --       --    0.9899   0.9257   0.7783   0.5657   0.0081*dm-ins      --       --       --    0.9999   0.9964   0.9657   0.0871cbr256      --       --       --       --    0.9999   0.9973   0.2004abr224      --       --       --       --       --    0.9999   0.3696r3mix       --       --       --       --       --       --    0.5887`

The top table are the unadjusted p-values, calculated assuming a normal distribution.  The bottom table are the adjusted p-values after 100,000 bootstrap trials.  Notice that the p-values decrease.  This is because of my tweak.  Here is what it would look like without that tweak -- only one comparison is significant this way!

Code: [Select]
`         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.9929   0.9657   0.7902   0.6596   0.5266   0.3970   0.0314*dm-std      --    1.0000   0.9962   0.9823   0.9506   0.8912   0.2533dm-xtrm     --       --    0.9997   0.9974   0.9878   0.9621   0.3970dm-ins      --       --       --    1.0000   0.9999   0.9989   0.7222cbr256      --       --       --       --    1.0000   0.9999   0.8398abr224      --       --       --       --       --    1.0000   0.9185r3mix       --       --       --       --       --       --    0.9657`

I will clean up the code slightly, and post it tomorrow.  Next up:  free step down.

Edit:  BTW, this method is far superior in terms of speed: 100,000 trials takes only 40 seconds.  That's because I'm using a calculated starting point for the unadjusted p-values.
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: tangent on 2001-11-09 19:25:54
FastForward:

Fixed the qsort_longsamples() function. Sorry about that, it seems there was a bug with repeated numbers in the reference I used.

Code: [Select]
`void qsort_longsamples(longsamples_t *sortedp, int first, int last) {  int pivot_index, i, j, k;  long pivot;  struct {    long data;   /* data to be sorted */    int num;     /* numbering of data */  } temp;    if (first < last) {    pivot = sortedp->data[first];    i = first+1;    j = last;    while (i <= j) {      while ((sortedp->data[i] <= pivot ) && (i <= last)) i++;      while ((sortedp->data[j] > pivot ) && (first < j)) j--;      if (i < j) {        temp.data = sortedp->data[i];        temp.num = sortedp->num[i];        sortedp->data[i] = sortedp->data[j];        sortedp->num[i] = sortedp->num[j];        sortedp->data[j] = temp.data;        sortedp->num[j] = temp.num;      }    }    temp.data = sortedp->data[j];    temp.num = sortedp->num[j];    sortedp->data[j] = sortedp->data[first];    sortedp->num[j] = sortedp->num[first];    sortedp->data[first] = temp.data;    sortedp->num[first] = temp.num;    qsort_longsamples(sortedp, first, j-1);    qsort_longsamples(sortedp, j+1, last);  }    }`
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-09 21:35:21
tangent,

Thanks.  I'll get that into version 0.4.  Version 0.3 is at:

http://ff123.net/bootstrap/bootstrap03.zip (http://ff123.net/bootstrap/bootstrap03.zip)

Here are some improvements I'd like to schedule for version 0.4:

1. An improved rerandomization (permutation) algorithm which will be greatly speeded up.  Also, since step-down is not strictly valid with permutation resampling, I will revert back to single-step for this.

2. Bootstrap step-down, for improved power.

3. Resampling to arrive at unadjusted p-values.  This isn't really needed, because I don't care too much about the unadjusted p-values, but it should be nice to see in place of the current normal model (calculated unadjusted p-values).

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-11 05:02:13
Interestingly, I note that by using my bootstrap resampling technique, I can arrive at resampled, unadjusted p-values which are almost identical with the blocked ANOVA model, which is good, because it means that what I am doing is exactly what I want.

I.e., here is the blocked ANOVA p-value table for the AQ1 data:

Code: [Select]
`         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.174    0.075    0.010*   0.003*   0.001*   0.000*   0.000*dm-std            0.673    0.218    0.113    0.056    0.026*   0.000*dm-xtrm                    0.418    0.243    0.136    0.070    0.000*dm-ins                              0.721    0.496    0.315    0.006*cbr256                                       0.746    0.517    0.015*abr224                                                0.746    0.036*r3mix                                                          0.075`

and here is the tweaked bootstrap resampled version of the same thing with 100,000 trials (p-values are not adjusted for multiplicity):

Code: [Select]
`         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.160    0.068    0.009*   0.003*   0.001*   0.000*   0.000*dm-std      -     0.663    0.206    0.104    0.051    0.023*   0.000*dm-xtrm     -        -     0.398    0.229    0.125    0.064    0.000*dm-ins      -        -        -     0.712    0.480    0.299    0.005*cbr256      -        -        -        -     0.737    0.503    0.014*abr224      -        -        -        -        -     0.737    0.032*r3mix       -        -        -        -        -        -     0.067`

Notice a similarity?

BTW, this also means that the blocked ANOVA using a protected Fisher's LSD does not control experiment-wise error!

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-12 02:09:50
Version 0.4 is complete and up at:

http://ff123.net/bootstrap/bootstrap04.zip (http://ff123.net/bootstrap/bootstrap04.zip)

It implements bootstrap free step-down p-value adjustment.  Just type:

bootstrap aq1.txt

This will run 10,000 bootstrap trials of the AQ1 data, and the results will be:

Code: [Select]
`BOOTSTRAP version 0.4, Nov 10, 2001Input file : aq1.txtRead 8 treatments, 42 samples                            Unadjusted p-values         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192   mpc      0.174    0.075    0.010*   0.003*   0.001*   0.000*   0.000*   dm-std     -      0.673    0.218    0.113    0.056    0.026*   0.000*   dm-xtrm    -        -      0.418    0.243    0.136    0.070    0.000*   dm-ins     -        -        -      0.721    0.496    0.315    0.006*   cbr256     -        -        -        -      0.746    0.517    0.015*   abr224     -        -        -        -        -      0.746    0.036*   r3mix      -        -        -        -        -        -      0.075    Each '.' is 1,000 resamples.  Each '+' is 10,000 resamples.........+                             Adjusted p-values         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192   mpc      0.684    0.442    0.124    0.059    0.024*   0.010*   0.000*   dm-std     -      0.980    0.719    0.545    0.395    0.245    0.002*   dm-xtrm    -        -      0.906    0.739    0.603    0.450    0.010*   dm-ins     -        -        -      0.966    0.935    0.829    0.085    cbr256     -        -        -        -      0.738    0.935    0.174    abr224     -        -        -        -        -      0.919    0.303    r3mix      -        -        -        -        -        -      0.470    `

Sorry I didn't include your code, tangent, but I really only sort in one or two places and it wasn't going to save a lot of time to implement your quicksort.  Plus I changed the sort routine a little to make it able to sort either from min to max or max to min.

ff123

Edit:  It doesn't actually take too long to run a million trials now, so I did:

Code: [Select]
`                             Adjusted p-values         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.690    0.452    0.130    0.061    0.027*   0.011*   0.000*dm-std     -      0.982    0.726    0.554    0.404    0.248    0.003*dm-xtrm    -        -      0.906    0.747    0.608    0.456    0.012*dm-ins     -        -        -      0.968    0.936    0.831    0.088cbr256     -        -        -        -      0.742    0.934    0.180abr224     -        -        -        -        -      0.922    0.311r3mix      -        -        -        -        -        -      0.477`
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-12 21:54:58
Well, Westfall's paper, "Multiple Testing of General Contrasts Using Logical Constraints and Correlations" (1997) is damn near impenetrable to me.  I think a big part of the problem is that I'm not familiar with the mathematical notation.  However, if I stare at it long enough, I may start to catch on.  Improving on the free step-down bootstrap is a tempting carrot.

Just as an amusing anecdote, I called the SAS institute to find out how much they charged for their software.  It's on the order of about \$2600 for the required base package for the first year (about \$1300 for a yearly renewal), plus \$1100 for the optional subpackage, which I presumes runs the types of tests I would be interested in (about half that for the yearly renewal).

That's a little out of my reach :-)

While I'm trying to decipher Westfall's paper, I will probably implement the rest of the rerandomization code (so that it can shuffle the whole pool of values) in order to verify that it gives the same results as the example in the 1993 Westfall/Young book.  Also, I will probably add an option to convert the raw data into ranked data.

Then I'll probably integrate it into the current web-based analysis tool.  However, I don't want to load down my server's CPU and get myself into trouble, so I'll probably limit that to 1000 trials.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-17 21:25:21
Yay!

After a couple days of frustration, I finally found the bug in the blocked bootstrap which was causing it to lose power.  Now, with free stepdown, for my dogies.wav data, adjusted p-values are (100,000 trials):

Code: [Select]
`         AAC      OGG      LAME     WMA      XINGMPC      0.943    0.006*   0.002*   0.001*   0.000*AAC        -      0.015*   0.005*   0.004*   0.000*OGG        -        -      0.943    0.933    0.002*LAME       -        -        -      0.943    0.007*WMA        -        -        -        -      0.008*`

And AQ1 results are (100,000 trials):

Code: [Select]
`         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192mpc      0.737    0.524    0.130    0.053    0.019*   0.006*   0.000*dm-std     -      0.986    0.777    0.612    0.452    0.271    0.001*dm-xtrm    -        -      0.929    0.792    0.664    0.509    0.006*dm-ins     -        -        -      0.986    0.947    0.861    0.081cbr256     -        -        -        -      0.986    0.947    0.186abr224     -        -        -        -        -      0.986    0.339r3mix      -        -        -        -        -        -      0.509`

And I have a pretty good idea now of how to implement restricted step-down now as well.  So I'll release version 0.5 (bug fix) shortly.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: RD on 2001-11-17 21:55:43
ff123,

What is the correct interpretation of the last chart you posted, namely:

ode:          dm-std  dm-xtrm  dm-ins  cbr256  abr224  r3mix    cbr192
mpc      0.737    0.524    0.130    0.053    0.019*  0.006*  0.000*
dm-std    -      0.986    0.777    0.612    0.452    0.271    0.001*
dm-xtrm    -        -      0.929    0.792    0.664    0.509    0.006*
dm-ins    -        -        -      0.986    0.947    0.861    0.081
cbr256    -        -        -        -      0.986    0.947    0.186
abr224    -        -        -        -        -      0.986    0.339
r3mix      -        -        -        -        -        -      0.509

HAS it finally been proven that dm-preset standard beat --r3mix in that test?

Curious....
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-17 22:29:12
No.  It will probably never be shown that dm-std beats r3mix with 95% confidence in this data set, even after I implement restricted step-down, which will be more powerful than free step-down (but we'll see).  The problem is that there were just too many samples.  Each additional sample adds more statistical noise.

On the other hand, you can say that dm-std is better than cbr192, while you can't say the same thing with r3mix.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-18 08:19:32
Version 0.5 is up, as well as a page for it at:

http://ff123.net/bootstrap/ (http://ff123.net/bootstrap/)

I have also placed it under LGPL.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-11-22 04:18:10
Ok, with the help of Matlab, I believe I finally understand how to find the subsets required to perform restricted step down.  The reason it took me so long is because there is apparently a typo in the critical formula!

But after looking at the SAS/IML code which Westfall generated, I was able to sort it out.

So now I just need to generate a few routines to do stuff like multiply matrices and take the inverse of matrices (or snag such routines off the web).

I'm getting close now, I can see the finish line.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-12-03 01:36:54
It's becoming clear that finding the subsets for the restricted stepdown is going to dominate the computing time.  We're talking an extremely long time to calculate the subsets for 8 codecs.  I haven't even finished yet, but the current computations for 6 codecs already takes 11 seconds.  It's likely to grow further.  Scaling to 8 codecs, that would mean finding all the subsets would take 25 hours!  It's the difference between 2^(15-2) versus 2^(28-2).

It can be done, of course, but that's not the sort of thing I'd like to do every day.  7 codecs is about the practical max for this sort of analysis on my 800MHz Celeron, clocking in at 12 minutes, given the current time estimates.

I'm probably about a week away from finishing up the code.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: tangent on 2001-12-03 15:52:47
I guess it's time to write a distributed version of bootstrap.... call it bootstrap@home
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-12-04 08:00:55
I hit yet another roadblock:  singular or near-singular matrices.  A couple of test cases I tried in Matlab give ambiguous results because of the matrix inversion involved.  I have tried a request for help in sci.stat.math with no reply, so I have finally written to Dr. Westfall himself in search of assistance.  I hope he is amenable to spending time on a person with a hobby.

ff123
Title: Statistical Methods for Listening Tests(splitted R3mix VBR s
Post by: ff123 on 2001-12-04 16:04:11
Dr Westfall replied.  I should be using the generalized inverse, which is what superscript "-" means (superscript "-1" means the normal inverse).  So his formula didn't contain a typo.  I just didn't know enough to know what it meant!  Doh.

He also informed me that some German scientists are coding the algorithm (finding subsets for restricted step-down, I gather) into R.

ff123