Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Statistical Methods for Listening Tests(splitted R3mix VBR s (Read 28059 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #50
BTW, The p-Value adjustments page I linked to says that the p-adjusted resampling algorithm can be made even more sensitive while still controlling familywise error rate by using a stepdown method.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #51
Quote
Hmm, that's not acceptable for doing actual analysis on though


Post screening is not a-priori ruled out, depending on how the data was collected.  BS-1116-1 has this to say:

Quote
Post-screening methods can be roughly separated into at least two classes; one is based on inconsistencies compared with the mean result and another relies on the ability of the subject to make correct identifications. The first class is never justifiable. Whenever a subjective listening test is performed with the test method recommended here, the required information for the second class of post-screening is automatically available. A suggested statistical method for doing this is described in Appendix 1.

The methods are primarily used to eliminate subjects who cannot make the appropriate discriminations. The application of a post-screening method may clarify the tendencies in a test result. However, bearing in mind the variability of subjects’ sensitivities to different artefacts, caution should be exercised.


So, if ABC/HR is used to collect the data (reference is rated each time a sample is rated), post-screening can be used as described in Appendix 1 of that document.  It is too long to paste here, but supposedly BS 1116-1 can be had for free these days.  See one of 2Bdecided's posts on the r3mix forum.

I agree that post screening as I describe is not appropriate for the AQ1 test.  I was just commenting on why dm-std and dm-xtrm seemed to be swapped depending on whether a parametric or non-parametric method is used.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #52
I think we may get lucky on the effects of the nonrandom order. Even though we don't know how much exactly it applied, if at all, we can check where it would have applied, if present.

The two more extreme settings in this test were cbr192 (very low) and mpc (very high). If the effect plays, one would expect the codec(s) just after cbr192 to be rated higher than they should be, and the one(s) after mpc to be rated lower than they should be.

However, we reached a downward conclusion so far for the codecs after cbr192 (abr224<mpc, r3mix < mpc), so I think we can already say that, if the effect plays, it did not endanger the conclusions there.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #53
Since Garf is having trouble accessing the website here today, he asked me to post this for him:

Garf Wrote:
Quote
Results after 25000^2 resamples (10+ hours cpu time):

cbr192 is worse than abr224 (0.02208 vs 0.48052)
cbr192 is worse than r3mix (0.04712 vs 0.71088)
cbr192 is worse than dm-xtrm (0.00024 vs 0.01144)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.00316 vs 0.11308)
cbr192 is worse than cbr256 (0.01032 vs 0.28664)
cbr192 is worse than dm-std (0.00012 vs 0.00588)
abr224 is worse than dm-xtrm (0.07908 vs 0.85528)
abr224 is worse than mpc (0.00080 vs 0.03504)
abr224 is worse than dm-std (0.03364 vs 0.60492)
r3mix is worse than dm-xtrm (0.04024 vs 0.66340)
r3mix is worse than mpc (0.00040 vs 0.01824)
r3mix is worse than dm-std (0.01436 vs 0.36392)
dm-xtrm is worse than mpc (0.04596 vs 0.70132)
mpc is better than dm-ins (0.00808 vs 0.23876)
mpc is better than cbr256 (0.00304 vs 0.10972)
cbr256 is worse than dm-std (0.06392 vs 0.80176)


He said you would know how to interpret them ff123.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #54
With 95% confidence for the entire experiment, not just for individual pair comparisons, one can state that:

1. mpc is better than abr224, r3mix, and cbr192
2. dm-xtrm is better than cbr192
3. dm-std is better than cbr192

I assume this was done using the rank data (not the raw ratings data), and that the figure of merit was the means of the ranks (same as using the rank sums).

This result shows that the resampling method is even more sensitive than using the Friedman / Fisher LSD, while affording greater confidence in the result to boot.

It should be even more interesting to see if the results change when a stepdown technique is incorporated to further adjust the p-values.  This promises to increase sensitivity even further.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #55
Ok, Garf wanted me to post some more info.

Quote

<Garf> I used raw ratings data, and means.
<Garf> (which is just as well, and even more powerfull)

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #56
<Garf> I used raw ratings data, and means.
<Garf> (which is just as well, and even more powerfull)

Ok, then the classical analog would be the blocked ANOVA / Fisher LSD.

Garf, are you sure that the way you randomize (choose listeners with replacement) is not significantly different from choosing listeners without replacement?  I would be interested to see a comparison in the methods.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #57
Quote

<Garf> I do not use replacement
<Garf> I changed that after [your] first comment


Garf says you can check it with the utility, he uploaded the last version here:

sjeng.org/ftp/bootstrap.c

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #58
Ok, cool!

The blocked ANOVA /  Fisher LSD seems to be implying that further sensitivity is possible, although for all we know, the results are incorrect because of the assumptions that are ignored and because familywise error is not controlled well.

Can you figure out how the stepdown is supposed to work?  I haven't looked at it very carefully.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #59
Hmmm,

If I understand the procedure correctly, essentially all the simulation work is already done, and the stepdown is extremely painless.

So the after stepdown correction, the adjusted p-values would be:

1. mpc > cbr192:  padj = 28 * 0.00000 = 0.00000
2. dm-std > cbr192: padj = max(0.00000, 27 * 0.00012) = .00324
3. dm-xtrm > cbr192: padj = max(0.00324, 26 * 0.00024) = 0.00624
4. mpc > r3mix: padj = max(0.00624, 25 * 0.00040) = 0.01
5. mpc > abr224: padj = max(0.01, 24 * 0.00080) = 0.0192

6. mpc > cbr256: padj = max(0.0192, 23 * 0.00304) = 0.06992

So, after stepdown correction, the mpc > cbr256 is closer to meeting the critical significance of 0.05, but no cigar.

ff123

Edit:  something tells me I didn't do this correctly, because I basically ignored the adjusted p-values which were obtained by the bootstrap adjustment.  If I did that, why didn't I just start with the base 25,000 trial run?

Edit2:  ok, let's try this again.  When calculating the ordinary bootstrap p-value adjustments for the AQ1 data set, there are 28 p-value counters, one counter for each pairwise comparison.  The counters are loaded with new values after each block of 25,000 trials.  Any particular counter is incremented if one or more of the new 28 block p-values is less than or equal to the actual p value.  The adjusted p-values are the proportion of counts after 25,000 times 25,000 blocks of trials are run.

To calculate the stepdown p-value adjustments, The most extreme p-value counter (mpc vs. cbr192) is incremented after each 25,000 trial block as described above.  However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter.  The third most extreme p-value counter excludes the first two counters, etc.

I think I have that correct, now.

Edit 3:  The initial stepdown calculation I made was actually a Bonferroni stepdown adjustment.  It is still valid; the advantage is that one doesn't have to run 25,000^2 trials, just the one 25,000 trial block.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #60
Quote
Originally posted by ff123

I think I have that correct, now.


I agree.

I'd add medians and stepdown correction, but I'm rather busy with other things like now, so feel free to....

Edit: where you say:

However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter

Don't you mean 'excluding the most extreme p-value'?

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #61
Quote
I agree. 

I'd add medians and stepdown correction, but I'm rather busy with other things like now, so feel free to....


I won't be around this weekend to play, so I guess this will have to wait.  BTW, my celeron 800 does a 1000 x 1000 simulation in about 2 minutes 20 sec, quite a bit slower than your Athlon 1Gig.  Using MSVC 6 instead of djgpp doesn't really help much.  A 25,000 x 25,000 simulation would take about 24 hours.  I can see why resampling techniques have taken so long to come into their own.

Quote
However, the next most extreme p-value counter (dm-std vs. cbr192) is incremented or not only after excluding the value for the most extreme p-value counter 

Don't you mean 'excluding the most extreme p-value'?


Yes.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #62
medians (1/2):

(only ran 1000^2 trials this time)

cbr192 is worse than abr224 (0.01600 vs 0.15700)
cbr192 is worse than r3mix (0.01800 vs 0.16400)
cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.00000 vs 0.00000)
cbr192 is worse than cbr256 (0.00100 vs 0.01600)
cbr192 is worse than dm-std (0.00000 vs 0.00000)

First quartile (1/4):

cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-std (0.03800 vs 0.34800)
abr224 is worse than mpc (0.00000 vs 0.00000)
abr224 is worse than dm-std (0.03600 vs 0.26900)
r3mix is worse than mpc (0.00000 vs 0.00000)
r3mix is worse than dm-std (0.03900 vs 0.41400)
mpc is better than dm-ins (0.00100 vs 0.02500)
mpc is better than cbr256 (0.00000 vs 0.00000)
cbr256 is worse than dm-std (0.04000 vs 0.41400)

1/3 :

cbr192 is worse than r3mix (0.00800 vs 0.15400)
cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-ins (0.01200 vs 0.23400)
cbr192 is worse than cbr256 (0.00900 vs 0.15400)
cbr192 is worse than dm-std (0.00300 vs 0.11300)
abr224 is worse than dm-xtrm (0.00900 vs 0.15400)
abr224 is worse than mpc (0.00000 vs 0.00000)
abr224 is worse than dm-std (0.01100 vs 0.19100)
r3mix is worse than mpc (0.02200 vs 0.35200)
mpc is better than dm-ins (0.02000 vs 0.34300)
mpc is better than cbr256 (0.02900 vs 0.41000)

Neither of these is as sensitive as plain means. Aditionally, it's harder to give a meaning to the results. (With means you can say: people graded x more on average. I can't think of something similar for either of the test statistics above)

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #63
Ok, I believe I have modified your source correctly to implement stepdown using resampling.  Here is a run of 2000 x 2000:

cbr192 is worse than dm-xtrm (0.00000 vs 0.00000)
cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-std (0.00000 vs 0.00000)
abr224 is worse than mpc (0.00000 vs 0.00000)
r3mix is worse than mpc (0.00050 vs 0.02200)

mpc is better than cbr256 (0.00350 vs 0.11400)
cbr192 is worse than dm-ins (0.00450 vs 0.13450)
mpc is better than dm-ins (0.00750 vs 0.20550)
cbr192 is worse than cbr256 (0.00800 vs 0.21150)
r3mix is worse than dm-std (0.01800 vs 0.38100)
cbr192 is worse than abr224 (0.02150 vs 0.41050)
abr224 is worse than dm-std (0.03450 vs 0.51900)
r3mix is worse than dm-xtrm (0.04050 vs 0.56300)
cbr192 is worse than r3mix (0.04500 vs 0.58900)
dm-xtrm is worse than mpc (0.04700 vs 0.56600)

The first five conclusions, taken together, are significant with 95% confidence.

I've placed the modified source at:

http://ff123.net/export/bootstrap.c

The changes aren't necessarily pretty, but I think it works.

ff123

Edit:  10,000 x 10,000 run:

cbr192 is worse than mpc (0.00000 vs 0.00000)
cbr192 is worse than dm-xtrm (0.00010 vs 0.00440)
cbr192 is worse than dm-std (0.00010 vs 0.00400)
r3mix is worse than mpc (0.00040 vs 0.01490)
abr224 is worse than mpc (0.00050 vs 0.01740)

mpc is better than cbr256 (0.00230 vs 0.07020)
cbr192 is worse than dm-ins (0.00370 vs 0.10790)
mpc is better than dm-ins (0.00760 vs 0.18750)
cbr192 is worse than cbr256 (0.01030 vs 0.23310)
r3mix is worse than dm-std (0.01570 vs 0.31730)
cbr192 is worse than abr224 (0.02170 vs 0.39190)
abr224 is worse than dm-std (0.03590 vs 0.52930)
r3mix is worse than dm-xtrm (0.04160 vs 0.56070)
cbr192 is worse than r3mix (0.04330 vs 0.56710)
dm-xtrm is worse than mpc (0.04350 vs 0.53840)

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #64
Just for kicks, I thought I'd try it for the dogies.wav test data, and I got for 3000 x 3000 trials:

MPC is better than XING (0.00000 vs 0.00000)
AAC is better than XING (0.00000 vs 0.00000)

AAC is better than LAME (0.00233 vs 0.05367)
MPC is better than WMA (0.00233 vs 0.05100)
MPC is better than LAME (0.00333 vs 0.06333)
OGG is better than XING (0.00333 vs 0.05867)
AAC is better than WMA (0.00433 vs 0.06867)
MPC is better than OGG (0.00533 vs 0.07700)
LAME is better than XING (0.00767 vs 0.09167)
WMA is better than XING (0.00967 vs 0.10167)
AAC is better than OGG (0.01267 vs 0.11700)

So from the looks of it, this method is still quite conservative when compared with Friedman or ANOVA with Fisher's LSD.

Either that, or I did the stepdown incorrectly.

ff123

Edit:  in fact, Tukey's HSD is less conservative than this!

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #65
Resampling methods are generally considered 'debatable'  for samples of size 20-30, and only generally accepted for samples > 30.

Using them with a sample size of 12 is probably going to kill you. No need for it either, as the 128kbps data looked normal enough that parametric methods will work.

--
GCP

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #66
I'm going to go over bootstrap.c with a fine tooth comb tonight.  I already see that it has some errors in it related to floating-point comparisons, which should always include the "DELTA" fuzz.  This should be minor, though.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #67
Ok,

Combing finished.  Changed some code to use long variables instead of float.  This sidesteps some issues I was having using the DELTA fuzz thingy.  The results are the same as far as I can tell, at least at the 1000 x 1000 level.

http://ff123.net/export/bootstrap.c

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #68
In reading Resampling-Based Multiple Testing (Westfall & Young), I note that there is a way to step down that is potentially even more powerful than the method Garf and I are using in bootstrap.c.  The book makes a distinction between free step down and restricted step down.  The idea is to restrict hypotheses to those whose simultaneous truth does not contradict.

In a free stepdown, the multipliers for a 6 treatment test, would be:  15, 14, 13, 12, ... 1.

In a restricted stepdown for the same number of treatments, a conservative adjustment (not quite optimal, but conveniently available in a table) yields the multipliers:  15, 10, 10 10, 10, 10, 7, 7, 7, 6, 4, 4, 3, 2, 1.

This is a substantial improvement over the free stepdown.  It would be interesting to implement this improvement in bootstrap.c (which should really be called resampling.c).

I will scan in the relevant pages and send this to Garf.

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #69
Hmmm,

It appears that permutation (what the current program does) is not the optimal resampling method to use with restricted step down.  For example, see:

http://www.sas.com/service/library/periodi...s/obswww23/#s05

In Resampling-Based Multiple Testing, Westfall and Young voice the same concerns (not surprisingly, since Westfall is involved in SAS/STAT).

So perhaps it is time to backtrack a bit and get bootstrap resampling working in the program.  However, there needs to be a few adjustments, because just comparing the means of the treatments directly is not adequate for bootstrap resampling.  Instead, a t statistic should be calculated which uses a "shift" and "pivot" method.

Also, I may be missing some information to make the restricted step down calculation easier:  Westfall appears to have done some work in 1997 and "devised a minP-based method for any set of linear contrasts that respects the collapsing in the closure tree, as well as intercorrelations among the variables."

But first, I'd like to get bootstrap resampling working and giving the same results as the permutation resampling.

ff123

Edit:  There is a PDF paper by Westfall which mentions minP here:

http://www.unibas.ch/psycho/BBS/stuff/westfall.pdf

and Reference paper (which I don't have):
Westfall, P.H. (1997) Multiple testing of general contrasts using logical constraints and correlations, Journal of the American Statistical Association, 92, 299-306.

 

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #70
Quote
and Reference paper (which I don't have): 
Westfall, P.H. (1997) Multiple testing of general contrasts using logical constraints and correlations, Journal of the American Statistical Association, 92, 299-306.


I copied this paper from my local community college library, and it's very interesting, although I think it will take some time for me to fully absorb.  It's definitely not plug and play.  But in short, I believe a simple and efficient algorithm is presented which allows one to take advantage of logical constraints when performing stepdown adjustments that will make the (bootstrap) resampling analysis more powerful.

Too bad a piece of (free) code doesn't already exist somewhere that performs this type of analysis.  There's SAS/STAT, but they don't even list a price (you have to call them), so I figure the price must be exorbitant.

ff123

Edit:  Hmm.  Peter Westfall seems to have made a piece of code available here:

http://lib.stat.cmu.edu/jasasoftware/mtest

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #71
I don't know who's actually reading these posts, but for me, they're a kind of logbook.  I'm working on version 0.3 of bootstrap.c, which will deprecate the permutation resampling in favor of bootstrap resampling.  I think I finally understand how the simplest bootstrap algorithm works (I'm talking single step, not even the free step down, much less the restricted step down!), and the book has a good example which I should be able to replicate to test the program.

Running the current permuation resampling code on the example, it's clear that the program has a lot of room for improvement (read: potential for increase in power).

ff123

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #72
Quote
I don't know who's actually reading these posts...


Please, carry on posting them
I don't have any time to contribute to audio tests/methodology at the moment (work + real life getting in the way), but I'm finding what you are writing very interesting, if a little out of my sphere of understanding -- I've previously only met bootstrapping before in the context of evolutionary phylogenetics from aligned DNA sequence data.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #73
Finished the single-step bootstrap and verified that it gives the same results as the example in the book.  I am able to tweak it even further by assuming that resampling values for each listener are restricted to the values given by that listener.  It is the same idea, I think, as the blocked ANOVA vs. a regular ANOVA.

The results for the AQ1 data are:

Code: [Select]
         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.4216   0.2928   0.1263   0.0822   0.0539   0.0342*  0.0016*

dm-std      --    0.8035   0.4672   0.3486   0.2592   0.1871   0.0181*

dm-xtrm     --       --    0.6323   0.4909   0.3788   0.2842   0.0342*

dm-ins      --       --       --    0.8332   0.6877   0.5530   0.1004

cbr256      --       --       --       --    0.8482   0.7019   0.1517

abr224      --       --       --       --       --    0.8482   0.2139

r3mix       --       --       --       --       --       --    0.2928



        dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.8494   0.5887   0.1399   0.0569   0.0226*  0.0081*  0.0000*

dm-std      --    0.9999   0.9044   0.7191   0.4983   0.2943   0.0018*

dm-xtrm     --       --    0.9899   0.9257   0.7783   0.5657   0.0081*

dm-ins      --       --       --    0.9999   0.9964   0.9657   0.0871

cbr256      --       --       --       --    0.9999   0.9973   0.2004

abr224      --       --       --       --       --    0.9999   0.3696

r3mix       --       --       --       --       --       --    0.5887


The top table are the unadjusted p-values, calculated assuming a normal distribution.  The bottom table are the adjusted p-values after 100,000 bootstrap trials.  Notice that the p-values decrease.  This is because of my tweak.  Here is what it would look like without that tweak -- only one comparison is significant this way!

Code: [Select]
         dm-std   dm-xtrm  dm-ins   cbr256   abr224   r3mix    cbr192

mpc      0.9929   0.9657   0.7902   0.6596   0.5266   0.3970   0.0314*

dm-std      --    1.0000   0.9962   0.9823   0.9506   0.8912   0.2533

dm-xtrm     --       --    0.9997   0.9974   0.9878   0.9621   0.3970

dm-ins      --       --       --    1.0000   0.9999   0.9989   0.7222

cbr256      --       --       --       --    1.0000   0.9999   0.8398

abr224      --       --       --       --       --    1.0000   0.9185

r3mix       --       --       --       --       --       --    0.9657


I will clean up the code slightly, and post it tomorrow.  Next up:  free step down.

Edit:  BTW, this method is far superior in terms of speed: 100,000 trials takes only 40 seconds.  That's because I'm using a calculated starting point for the unadjusted p-values.

Statistical Methods for Listening Tests(splitted R3mix VBR s

Reply #74
FastForward:

Fixed the qsort_longsamples() function. Sorry about that, it seems there was a bug with repeated numbers in the reference I used.

Code: [Select]
void qsort_longsamples(longsamples_t *sortedp, int first, int last) {

 int pivot_index, i, j, k;

 long pivot;

 struct {

   long data;   /* data to be sorted */

   int num;     /* numbering of data */

 } temp;

 

 if (first < last) {

   pivot = sortedp->data[first];

   i = first+1;

   j = last;

   while (i <= j) {

     while ((sortedp->data[i] <= pivot ) && (i <= last)) i++;

     while ((sortedp->data[j] > pivot ) && (first < j)) j--;

     if (i < j) {

       temp.data = sortedp->data[i];

       temp.num = sortedp->num[i];

       sortedp->data[i] = sortedp->data[j];

       sortedp->num[i] = sortedp->num[j];

       sortedp->data[j] = temp.data;

       sortedp->num[j] = temp.num;

     }

   }

   temp.data = sortedp->data[j];

   temp.num = sortedp->num[j];

   sortedp->data[j] = sortedp->data[first];

   sortedp->num[j] = sortedp->num[first];

   sortedp->data[first] = temp.data;

   sortedp->num[first] = temp.num;

   qsort_longsamples(sortedp, first, j-1);

   qsort_longsamples(sortedp, j+1, last);

 }    

}