Skip to main content
Topic: Statistical Methods for Listening Tests(splitted R3mix VBR s (Read 22725 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Quote
Hmm, the results of that test are still under discussion (actually, I'm waiting for ff123 to finish his analysis tool with the nonparametric Tukey HSD test )


Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel.  It's 64.  The Fisher LSD was 44.  So, you can see that Tukey is quite a bit more conservative.

The ranksums (for reference) were:

cbr192 = 151.5
r3mix = 172.0
abr224 = 186.5
dm-ins = 188.0
cbr256 = 185.5
dm-std = 198.0
dm-xtrm = 207.0
mpc = 223.5

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!

ff123

Edit:  I discovered my Excel spreadsheet had a mistake in it.  The non-parametric Tukey's HSD should be 68.1.  I was debugging my code and had to resolve the discrepancy (the code was correct).  The conclusion remains the same.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #1
Quote
Originally posted by ff123

Well, you don't have to wait for me to finish coding to know what the non-parametric Tukey HSD value is -- I calculated that in Excel.  It's 64.  The Fisher LSD was 44.  So, you can see that Tukey is quite a bit more conservative.


But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?

(I remember that we talked about it and I  concluded that it wasn't reliable/applicable, but I don't remember why)

I wanted a statistically 'sound' conclusion from this test. I wouldn't call soundness conservative.

For an idea of the individual results the Wilcoxon S-R test was enough. (From a look at the values its sensitivity seems to be even better than the Fisher LSD?) But presenting a result and having to say: there >50% chance one of the things we concluded is incorrect isn't very nice is it?

(btw. Wilcoxon+Bonferroni correction gave in the end the same results as the nonparam Tukey HSD!)


Quote

So basically all the Tukey HSD says (experiment-wise confidence level is 95%) is that mpc is better than cbr192!


Hmm, in the next test we will have to set in advance what we want to test I guess. And preferably that should only be like 4 or 5 pairs or so.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #2
Quote
But the Fisher LSD isn't simultaneous is it? Or was it based on a normal distribution?


The Fisher LSD I use for the Friedman analysis is a non-parametric version (which doesn't assume normal distribution).  There is a different Fisher LSD I use for blocked ANOVA.

Both are one-at-a-time multiple comparison techniques.  I guess that seems like an oxymoron, but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test.  And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment.  It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.

So my question would be:  For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD?  If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations.  And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded.  Maybe the best way to do this is to perform two types of experiments:  exploratory ones and confirmatory ones.  The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm.  The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #3
Quote
Originally posted by ff123

Both are one-at-a-time multiple comparison techniques. I guess that seems like an oxymoron, 


Yes. I didn't understand it at first. (Now I do, thanks to your explanation)

Quote

but I believe the reason why it's used (as opposed to the Wilcoxon) is that once you've gone to the trouble of calculating the rank sums for the Friedman test, you might as well use those values to perform the Fisher test.


This seems very plausible, given that most of these methods predate computers

Quote

And the reason the Friedman or ANOVA tests are performed first instead of going straight to the Wilcoxon is to make sure that there is at least one significant difference of means somewhere in the experiment.  It'd be a waste of time to perform all those Wilcoxons and find out after the fact that ANOVA or Friedman says that the difference in means was just statistical noise.


Actually, I would expect the Wilcoxon+Bonf corr/Fisher+Bonf corr/Tukey tests all to give nothing if the Friedman test fails. (Wouldn't there be a contradiction otherwhise?)

Quote

So my question would be:  For one-at-a-time comparisons, is it preferable to use Wilcoxon or to use the Fisher LSD?  If the only rationale for using the Fisher LSD is convenience of calculation, but the Wilcoxon is more sensitive, then I'd rather use the latter -- let the software take care of laborious calculations.


I honestly wouldn't know. I'm a bit biased vs the Wilcoxon because the statisticans told me it was good for our purposes, so I know it's good, whereas I don't know the Fisher LSD. I think that you might be right in the fact that the Fisher LSD is for convenience of calculation.

On the other hand, you've already written the app, so perhaps you can just use the Fisher LSD results vs the Wilcoxon results and check which one is more sensitive? We can just use that one then. The SPSS output is still on my page : http://home.planetinternet.be/~pascutto/AQT/OUTPUT.HTM

Also, there shouldn't be any contradictions between the two.

Quote

And for simultaneous comparisons, is it preferable to use Bonferroni-corrected Wilcoxon, Bonferroni-corrected Fisher LSD, or Tukey's HSD?


Tukey HSD, no question. It should _always_ be more sensitive than the other methods. It basically does a smarter 'correction'  than the very conservative Bonferroni.

Quote

I think you're saying, Garf, that the Wilcoxon might be the way to go for one-at-a-time tests, but perhaps the Tukey HSD would be best for simultaneous tests.


Yes. (But I'm not sure which one of Fisher LSD/Wilcoxon is best for one-at-a-time)

Quote

Oh, and I agree that the objectives of a test should be clearly stated up front, *before* the test is performed, and that if any relationships are not of interest, they should be excluded.


Right. Also, if possible, decide which one you expect to do better in the comparison (that also halves the significance needed due to one-tail/two-tail)

Quote

Maybe the best way to do this is to perform two types of experiments:  exploratory ones and confirmatory ones.  The exploratory ones could give a general idea of what all the relationships look like, and the confirmatory ones would test specific ones, for example dm-ins versus dm-xtrm.  The implication is that the finer the distinction is you want to make, the fewer codecs should be involved.


Yep. This is why the first AQ test results are of good use: we know what to test for next time 

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #4
It seems the worth of Bonferroni adjustments (perhaps even the very idea of simultaneous testing of null hypotheses) is not universally accepted in all statistical circles.

For example, this page:

http://www.bmj.com/cgi/content/full/316/71...ch=&FIRSTINDEX=

with summary points as follows:

Adjusting statistical significance for the number of tests that have been performed on study data -- the Bonferroni method -- creates more problems than it solves.

The Bonferroni method is concerned with the general null hypothesis (that all null hypotheses are true simultaneously), which is rarely of interest or use to researchers.

The main weakness is that the interpretation of a finding depends on the number of other tests performed.

The likelihood of type II errors is also increased, so that truly important differences are deemed non-significant.

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #5
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm

This website writes:

Quote
Scenario three concerns the situation when not predefined hypothesis are pursued using many tests, one test for each hypothesis. Basically this concerns the situation of data 'dredging' or 'fishing', many among us will recognize correlation variables=all or t-test groups=sex(2) variables=all. Above all, this should not be done. Bonferroni correction is difficult in this situation as the alpha level should be lowered very considerably in situations of such wealth (potentially with a factor of r*(r-1)/2, whereby r is the number of variables), and most standard statistical packages are not able to provide small enough p-value's to do it. SISA's advice is, if you want to go ahead with it anyway, to test at the 0.05 level for each test. After a relationship has been found, and this relationship is theoretically meaningful, the relationship should be confirmed in a separate study. This can be done after new data is collection or in the same study, by using the 'split sample' method. The sample is split in two, one half is used to do the 'dredging', the other half is used to confirm the relationships found. The disadvantage of the split sample method is that you loose power (use the procedure power to estimate how much). A Bayesian method can be used if you want to formally incorporate the result of the original study or dredging in the confirmation process. But don't put too high a value on your original finding.


ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #6
And one more link here:

http://149.170.199.144/resdesgn/multrang.htm

Quote
Multiple range tests can be placed into two categories. 

1. Constant LSD. In these a single LSD is found and used to compare all pairs of means. Tests differ in the algorithm used to calculate the LSD. Examples : Fisher's LSD, Tukey's HSD, Sheffé's LSD and Waller-Duncan's LSD. 

2. Variable LSD. In these tests the means are ranked and the magnitude of the LSD is determined by the number of intervening means, between the two being compared. Examples: Newman-Keul's test, Duncan's multiple range test. 
The second group appear to be generally less accepted and recommended than the former. The following notes about the first group are based on comments by Swallow (1984). 

a. Tukey's HSD and Sheffé's LSD are too conservative, type II errors are favoured. 

b. Fisher's LSD is prone to type I errors, although this is not too serious when used after rejecting an analysis of variance Null hypothesis (i.e. when it is a protected test).

c. Waller-Duncan's LSD has few faults but the statistic is complex and tables are generally unavailable. 

If you require more information about multiple range tests the following are recommended: Swallow (1984), Chew (1980) and Day and Quinn (1989).


So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach.  However, I should remove the option from my program to allow the user to adjust the critical significance of just the LSD.  If anything it should adjust *both* the critical significance values of the Friedman/ANOVA and the corresponding LSD tests.

Waller-Duncan's LSD might be interesting as a side study, but Fisher's LSD is very easy to calculate once a Friedman or ANOVA has been performed.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #7
Situations in which Fisher's LSD is weak:

from:

http://davidmlane.com/hyperstat/B96288.html

Quote
An approach suggested by the statistician R. A. Fisher (called the "least significant difference method" or Fisher's LSD) is to first test the null hypothesis that all the population means are equal (the omnibus null hypothesis) with an analysis of variance. If the analysis of variance is not significant, then neither the omnibus null hypothesis nor any other null hypothesis about differences among means can be rejected. If the analysis of variance is significant, then each mean is compared with each other mean using a t-test. The advantage of this approach is that there is some control over the EER. If the omnibus null hypothesis is true, then the EER is equal to whatever significance level was used in the analysis of variance. In the example with the six groups of subjects given in the section on t-tests, if the .01 level were used in the analysis of variance, then the EER would be .01. The problem with this approach is that it can lead to a high EER if most population means are equal but one or two are different.


next page:

http://davidmlane.com/hyperstat/B94854.html

Quote
In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.


ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #8
ff123, in the future when you want to post ammendments to your previous posts before a reply has yet been made, could you just edit the last post instead of posting 3 or 4 replies? Thanks.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #9
Quote
Originally posted by ff123

Simply describing what tests of significance have been performed, and why, is generally the best way of dealing with multiple comparisons.


This should hardly be a surprise...

The problem is presenting those results in a way that a general public without statistical background can understand what the implication of the multiple tests really is.

For some reason I feel that many people have a problem with: 'these are our results, but keep in mind that there's a 70% chance something here is incorrect'. Doesn't look very scientific, though it's prefectly ok.

Note that the contesting of the Bonferroni correction is due to the conservativeness. For us, this doesn't actually matter so much. But if you are testing if a new medicine has effect, you don't want to take a risk of incorrectly rejecting the hypothesis that it works. The mathematics behind it are sound.

Let it be clear that I prefer a simultaneous test over a multiple 2-sample tests + correction. But I dont agree with doing a 2-sample test _without_ correction.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #10
Quote
Originally posted by ff123
And another link, this one from SISA, a site where one can perform free statistical tests using a web browser.

http://home.clara.net/sisa/bonhlp.htm

This website writes:
ff123


Hmm, nothing new here either.

Make a test to see if there are trends.

Do another test to test those trends.

This is what you suggested just earlier.

The comment about Bonferroni is also in line what we saw. The alpha level in the AQ test gets as low as 0.0017. That's at the limit of accuracy SPSS uses for its results

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #11
Quote
Originally posted by ff123
And one more link here:

http://149.170.199.144/resdesgn/multrang.htm
So I am getting the impression that Fisher's LSD (which I am using as a protected test) is a good approach. 


Hmm, I'm not convinced. I'd agree if we were talking about a small number of variables, but we've got 8.

I have this doubt because the Friedman test just says ' there is a difference between the samples '. This provides little protection if you are making 28 comparisons, though it obviously helps a lot if you only make 3 or so. It gets too easy to see false differences (aka Type 1 errors)

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #12
Quote
Originally posted by ff123

In the example, if a seventh treatment condition were included and the population mean for the seventh condition were very different from the other six population means, an analysis of variance would be likely to reject the omnibus null hypothesis. So far, so good, since the omnibus null hypothesis is false. However, the probability of a Type I error in one or more of the 15 t-tests computed among the six treatments with equal population means is about 0.10. Therefore, the LSD method provides only minimal protection against a high EER.
ff123


(Whats EER?)

I think this is basically saying what I said in my prev post, namely that when you make a lot of comparisons the fact that you know that 'there is a difference between samples' is not enough protection to prevent you from seeing differences where there aren't any.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #13
Youri, I'll modify posts into one if they haven't been replied to yet.  What is the purpose of this, though?  Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?

Garf,

EER = Experiment-wise error rate.

Basically, the difference between using a simultaneous vs. a one-at-a-time method is the difference between trying to control a type I error (false difference in codecs is identified) vs. a type II error (true difference in codecs is not identified).  That's also what I mean by being "conservative" or "agressive" about how one wants to be about analyzing the data.  If you're looking for an airtight conclusion (mpc is better than cbr192), tukey's HSD will give you one, but it probably won't be very useful.  On the other hand, if you're looking for some insight and are willing to accept some risk of a type I error, Fisher's protected LSD is much more sensitive.

This seems to be an area of controversy in statistics, just like there's a minor controversy over whether one-tailed tests of significance should be used (some conservative statisticians say that a two-tailed test should always be used, even in a confirmatory study, because if you're bothering to perform a test there must be some uncertainty about the outcome).

Perhaps a compromise solution that could accomodate us both would be to use Waller-Duncan's k-ratio t test, which, unlike Tukey's test, doesn't operate on the principle of controlling type I error.  Instead, it compares the type I and type II error rates based on Bayesian principles.  The only problem, I think, is that with the limited net search I've made so far, I haven't seen whether there is a non-parametric version of this.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #14
Quote
Originally posted by ff123
Youri, I'll modify posts into one if they haven't been replied to yet.  What is the purpose of this, though?  Am I bumping this thread each time I post a new message, which doesn't happen if I just modify an older one?


Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant.  Multiple replies would bump the thread multiple times too, but again I don't see much of a problem.  I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.

Just my 2 cents.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #15
Found this powerpoint slideshow on the net:

http://www.css.orst.edu/TEACHING/Graduate/...ures/CHAP5A.PPT

Here are some relevant quotes:

Quote
The winner among winner pickers -- Cramer and Swanson (1973) conducted a computer simulation study involving 88,000 differences they compared LSD, FPLSD, HSD, SNK, BLSD both FPLSD and BLSD were better in their ability to protect against type I error and also in their power to detect real differences when they exist none of the other methods came close.


LSD = Fisher's LSD, without using an F test first
FPLSD = Fisher's protected LSD, only run if F test proves significant
HSD = Tukey's HSD
SNK = Student Newman Keuls test
BLSD = Bayes LSD (also known as Waller-Duncan's protected LSD)

Quote
The edge goes to BLSD... -- BLSD is prefered by some because it is a single value and therefore easy to use larger when F indicates that the means are homogeneous and small when means appear to be heterogeneous.  But the necessary tables may not be available, so FPLSD is quite acceptable


I'd like to get my hands on the Cramer and Swanson paper and also on the book which has the BLSD tables.  I wonder which book has them?  If I can get a hold of the tables, I can probably brute force the calculations by table lookup in my program.

ff123


Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #17
Thanks Citay, but I did some digging, and I think the following papers are relevant to the Bayes LSD:

Waller, R.A. and Duncan, D.B. (1969) "A Bayes Rule for the Symmetric Multiple Comparison Problem", Journal of the American Statistical Association 64, pp. 184-199

Waller, R.A. and Kemp, K.E. (1975) "Computations of Bayesian t-Values for Multiple Comparisons", Journal of Statistical Computation and Simulation (Vol 4, no. 3), pp. 169-172

Swallow, W. H. 1984. "Those overworked and oft-misued mean separation procedures - Duncans, LDS, etc."  Plant Disease, 68: 919-921.

And a couple of books:

An Introduction to Statistical Methods and Data Analysis, 5th Ed., 2000, R. Lyman Ott, Duxbury Press, Belmont CA
Amazon link:  http://www.amazon.com/exec/obidos/ASIN/053...4136099-6862928

Principles and Procedures of Statistics: A Biometrical Approach, 3rd Ed., 1996, Robert Steel and James Torrie
Amazon link:  http://www.amazon.com/exec/obidos/ASIN/007...4136099-6862928

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #18
Thanks ff123. I'll have a look through the university library and check if they happen to have any of the relevant material.

If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.

Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #19
Quote
If you know of anything that discusses the link between the Friedman protection and a high number of comparisons, please let us know. I'm a bit worried about it.


The SAS website has this:

Quote
It has been suggested that the experimentwise error rate can be held to the  level by performing the overall ANOVA F-test at the  level and making further comparisons only if the F-test is significant, as in Fisher's protected LSD. This assertion is false if there are more than three means (Einot and Gabriel 1975). Consider again the situation with ten means. Suppose that one population mean differs from the others by such a sufficiently large amount that the power (probability of correctly rejecting the null hypothesis) of the F-test is near 1 but that all the other population means are equal to each other. There will be 9(9 - 1)/2=36 t tests of true null hypotheses, with an upper limit of 0.84 on the probability of at least one type 1 error. Thus, you must distinguish between the experimentwise error rate under the complete null hypothesis, in which all population means are equal, and the experimentwise error rate under a partial null hypothesis, in which some means are equal but others differ.


So this supports the position that Fisher's protected LSD is not so protected for the case where there are a lot of means close to each other but one or two which are very different, as pointed out earlier.

Quote
Edit: Hmm, also, aren't most of the methods discussed versions for the normal distribution?


yes.  I am thinking of looking through a book by Hollander and Wolfe, which concentrates on non-parametric methods, to see if they cover Waller Duncan Bayes LSD.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #20
Quote
Well for the record I don't really think there is anything wrong with posting multiple replies as long as it doesn't become redundant. Multiple replies would bump the thread multiple times too, but again I don't see much of a problem. I do see benefit in trying to keep all the posts consolidated if possible, but if the discussion is moving right along then it seems fine to me.
Yeah, the reason it's normally not allowed is to keep people from bumping their own threads all the time, or adding to the reply count just to make their thread look popular (yes, some people worry about that apparantly ) It's probably just a pet-peeve of mine I developed from visiting a lot of übermoderated fora.  Actually, it's mainly meant to prevent posts like:

"Hi, I'm Youri! How are you all doing?"
"Oh, I'm fine btw!"

In this case, the response could simply be edited into the original posts. That's why I was only speaking of ammendments - if a reply to your post has already been made and you want to make an ammendment still, it's usually better to post a reply instead of editing your original post, because otherwise people may not notice it.

But I'm making a bigger problem out of it than it is, so carry on.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #21
I have completed the code to perform an optional Tukey's HSD (either parametric or non-parametric).  Version 1.20 of friedman.exe with source is at:

http://ff123.net/friedman/friedman120.zip

This version also outputs an ANOVA table, if that option is specified, and generates a matrix of difference values to show how the means or ranksums are separated.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #22
Very cool programming work!

I have some more problems:

a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up'  to the ends of the rating scale? Is ANOVA permissible here?

b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)

b1) does it change any results?

b2) does it make the data 'more' normal?

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #23
Quote
a) What can we do with data that is partially normal? For example, in the 128kbps test most data seems normal with the possible exception of the mpc and xing results, who 'bump up' to the ends of the rating scale? Is ANOVA permissible here?


For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

http://www.graphpad.com/www/book/Choose.htm

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."

and:

"When in doubt, some people choose a parametric test (because they aren't sure the Gaussian assumption is violated), and others choose a nonparametric test (because they aren't sure the Gaussian assumption is met)."

Quote
b) What happens if we tranform the data relative to mpc? (i.e. subtract mpc score from everything)


Nonparametric results should remain the same as long as the relative rankings are not changed.  I don't know how the ANOVA results change.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #24
Quote
Originally posted by ff123


For the dogies test it doesn't matter if you choose ANOVA or Friedman, as long as the Fisher LSD is used.

Here is a good page on how to choose a statistical test:

http://www.graphpad.com/www/book/Choose.htm

A couple of quotes of interest:

"Remember, what matters is the distribution of the overall population, not the distribution of your sample. In deciding whether a population is Gaussian, look at all available data, not just data in the current experiment."


Hmm yeah, but I think the non-normal look of the Xing/MPC results will stay even if we add more listeners.

The distribution we have looks normal but it has a 'pile up' effect on the sides of the hardest and lowest samples. The AQ test has this too as it consists entirely of hard samples.

Although it fails a normaility test, your comment above has me in doubt. Is this 'clipping' effect described somewhere?

If it would turn out that although a normality test fails we can still use methods based on a normal distribution, that would be a major help...

Edit: Hmm, interesting link :

Choosing between parametric and nonparametric tests is sometimes easy. You should definitely choose a parametric test if you are sure that your data are sampled from a population that follows a Gaussian distribution (at least approximately). You should definitely select a nonparametric test in three situations:

• The outcome is a rank or a score and the population is clearly not Gaussian. Examples include class ranking of students, the Apgar score for the health of newborn babies (measured on a scale of 0 to IO and where all scores are integers), the visual analogue score for pain (measured on a continuous scale where 0 is no pain and 10 is unbearable pain), and the star scale commonly used by movie and restaurant critics (* is OK, ***** is fantastic).

'the visual analogue scale for pain' ... doesn't this apply to the Xing scores? 

• The data ire measurements, and you are sure that the population is not distributed in a Gaussian manner. If the data are not sampled from a Gaussian distribution, consider whether you can transformed the values to make the distribution become Gaussian. For example, you might take the logarithm or reciprocal of all values. There are often biological or chemical reasons (as well as statistical ones) for performing a particular transform.

Interesting...I need to think about this.

--
GCP

 
SimplePortal 1.0.0 RC1 © 2008-2019