## Topic: Statistics For Abx (Read 28689 times)previous topic - next topic

0 Members and 1 Guest are viewing this topic.
• shday
Statistics For Abx
##### Reply #25 – 29 August, 2002, 05:21:34 PM
Quote
Quote
One more thing:  I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points. What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place. As I’ve said before, knowing the progress of the test compromises the independence of the trials and should be avoided if possible.

hmm, I'm quoting myself here because I'd like to restate my comment. Basically, what I'm trying to say is that the look points solve the problem of wasting time, in a statistically sound manner. The listener doesn't have to "look" for there to be look points because the program takes care of it. It is therefor unnecessary for the listener to know how he is performing *during* the test... other than to satisfy his curiosity (perhaps this point is debatable).

It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything. But on the other hand, what if he does here a difference (p>0.5), but fails the first few trials? I think his attitude toward the test could change. This is why I think seeing the results in progress may make the test less statistically sound. It just seems that keeping the listener as "blinded" a possible is the best think to do.

So my question now is: how does allowing the listener to track his progress help the test? (curiosity is one reason, but are there others that I'm missing?)

edit : obviously if the listener reaches the desired confidence at a look point the test would terminate and he would be able to "look" at the results then

• shday
Statistics For Abx
##### Reply #26 – 29 August, 2002, 06:31:47 PM
Quote
Are there any other profiles that might be useful?

Perhaps a traditional 12/16 test with a look point at 6/6. This still gives 95% confidence. It could also terminate if more than 4 incorrect choices were made.

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #27 – 29 August, 2002, 07:20:25 PM
Quote
Quote

Perhaps a traditional 12/16 test with a look point at 6/6. This still gives 95% confidence. It could also terminate if more than 4 incorrect choices were made.

That's a possibility, although the difference between this and the 28-trial profile seems kind of small.  On the 28-trial profile, one gets a look at 18 with 4 allowable wrong guesses.  The minimum number of trials to achieve a significant result is still 6 of 6.  Also, in a 28-trial profile, there's the distinct advantage of being able to go all the way up to 28 trials if necessary, whereas a 16-trial profile will terminate the test at 16 no matter what.

To answer your other point, there is no statistical advantage to being able to look at progress.  This is purely driven by convenience and time savings.  However, if I can make it easier and faster to perform ABX trials with only a slight cost in power, I think that's a good tradeoff.

ff123

• shday
Statistics For Abx
##### Reply #28 – 29 August, 2002, 07:51:06 PM
Quote
To answer your other point, there is no statistical advantage to being able to look at progress.  This is purely driven by convenience and time savings.  However, if I can make it easier and faster to perform ABX trials with only a slight cost in power, I think that's a good tradeoff.

Sorry if I seem to be pressing this... but what are the time savings of allowing the listener to track his progress? How does it make the test easier and faster? (I'm not talking about the automated looks points here, they are a good trade-off. I'm just referring to the listener being able to see his score all, or part of, the time. I think this introduces potential, though probably not very serious, problems).

About the 6/6, 12/16 design... I guess it could be useful if one were not interested in going beyond 16 trials... for whatever reason. It's also kind of familiar territory.

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #29 – 29 August, 2002, 08:46:39 PM
Quote
Sorry if I seem to be pressing this... but what are the time savings of allowing the listener to track his progress? How does it make the test easier and faster? (I'm not talking about the automated looks points here, they are a good trade-off. I'm just referring to the listener being able to see his score all, or part of, the time. I think this introduces potential, though probably not very serious, problems).

The time savings arise because at each look point you can decide whether or not to stop the test early.  With a strictly fixed test, the listener isn't allowed to know the results until all the trials have been completed.

I understand the concern you have about looking at the progress.  The listener may decide to stop before he has a chance to pick out differences, based on his estimate on how the test is likely to continue.  That's what the Bayesian method of sequential testing (the double lines method) was supposed to do, except more rigorously.  That method doesn't put a cap on the number of trials.

However, there is an unspoken assumption I'm making about the ABX test:  I don't really care if the listener decides to stop early when he could continue on.  I really only care about the listener not claiming to hear a difference when there isn't one.  I.e., I am completely ignoring type II errors.  This is probably not the right approach for a generic ABX test, though, in which the listener may be interested in testing for similarity, not just difference.

Maybe the Bayesian method could still be appropriate for my purposes (ie., my goal is to make it as easy as possible for a listener to perform a valid ABX test for differences) if I can refine the lines into something which allows me to get significant results at something less than 9 trials minimum.  I haven't really looked closely at this method, though.

The method I am proposing to use is the frequentist method, and in order to calculate (simulate) it, I need to know the max number of trials allowed.  This has the significant advantage of pushing the minimum number of trials necessary to get significant results down to 6.

ff123

• shday
Statistics For Abx
##### Reply #30 – 29 August, 2002, 09:35:04 PM
I guess most of the time a tester wouldn't be influenced in the way I fear. I'm probably placing more emphasis on it than it warrants. The proof of this can be found in the way I sometimes do an ABX test. I move the window somewhere so that the running score is hidden! Now it looks like I'll be able to continue doing this while not wasting as much time, thanks to ABX/HR

Quote
The time savings arise because at each look point you can decide whether or not to stop the test early.  With a strictly fixed test, the listener isn't allowed to know the results until all the trials have been completed.

I'd still argue that the same time savings could be achieved by allowing the software to deal with the look points automatically. Once a look point was reached the test would either terminate (because the desired confidence was reached) or it would go on as if nothing had happend. The listener would not have to know his exact score. This is indeed different from a strickly fixed test where the listener isn't allowed to know the results until all the trials have been completed... but not by much.

Maybe it could be an option?

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #31 – 29 August, 2002, 10:20:33 PM
Quote
I'd still argue that the same time savings could be achieved by allowing the software to deal with the look points automatically. Once a look point was reached the test would either terminate (because the desired confidence was reached) or it would go on as if nothing had happend. The listener would not have to know his exact score. This is indeed different from a strickly fixed test where the listener isn't allowed to know the results until all the trials have been completed... but not by much.

It doesn't matter at all whether the decision to terminate early is made automatically or by the listener.  The simulation I wrote always terminates at a look point when it is appropriate to do so, just like an automated process would do!

It's purely the fact that early termination is allowable at all which affects the overall pval.  What happens is a subtle form of "cherry picking."  That is, the program doesn't stop at a random point, but instead only when it's advantageous to do so.  That's what causes the problem.

ff123

• Continuum
Statistics For Abx
##### Reply #32 – 30 August, 2002, 04:40:34 AM
Here's a new version of the Excel-sheet. It allows specifying look-up points, and to each point a corresponding required p-val (nominal alpha). Result is the total alpha.
http://www.freewebz.com/aleph/CorrPVal2.xls

Quote
Here is the lookup table I would use for the 28-trial profile:

*0 wrong: at least 6 of 6 (can't have fewer than 6 trials with 0 wrong)
1 wrong: at least 9 of 10 (can't have fewer than 10 trials with 1 wrong)
*2 wrong: at least 10 of 12
3 wrong: at least 13 of 16
*4 wrong: at least 14 of 18
5 wrong: at least 17 of 22
*6 wrong: at least 17 of 23
7 wrong: at least 19 of 26
*8 wrong: at least 20 of 28

Notes:
* = look points
1. overall test significance is 0.05

The accurate value appears to be 0.05080.. (according to my program/calculation).

Quote
2. listener is not allowed to perform ABX trials past the max of 28.
3. listener is allowed to see trials 1 through 5 in addition to the early-decision look points
4. ABX is terminated if listener gets 9 or more trials wrong.
5. listener can terminate at any time, with overall results taken from the above table.

The last point is a little dubious to me. But it shouldn't affect the results too much.

• Continuum
Statistics For Abx
##### Reply #33 – 30 August, 2002, 07:52:28 AM
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

Quote
So my question now is: how does allowing the listener to track his progress help the test? (curiosity is one reason, but are there others that I'm missing?)

On difficult samples, I like to know if my efforts are enough. If my score is not as good as it should be, I can try to listen more carefully (but causing more fatigue). For me this information is very useful!

• Continuum
Statistics For Abx
##### Reply #34 – 30 August, 2002, 07:54:26 AM
This should be a mode that allows 5/5 with total significance = 0.049567:
at least  5 of  5
at least  10 of  12
at least  15 of  19
at least  17 of  22
Not that different from the 28 profile (though shorter), but with 5/5 possibility. Might be good for finding obvious differences.

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #35 – 30 August, 2002, 08:27:41 AM
Quote
This should be a mode that allows 5/5 with total significance = 0.049567:
at least  5 of  5
at least  10 of  12
at least  15 of  19
at least  17 of  22
Not that different from the 28 profile (though shorter), but with 5/5 possibility. Might be good for finding obvious differences.

Thanks for the spreadsheet.  Hopefully I can incorporate the calculations into abchr instead of running a mini-simulation each time I perform a look.

The 22-trial version has an interesting property:  The nominal alphas are not spread evenly, but get tighter as the test progresses:

5 of 5:  0.031
10 of 12:  0.019
15 of 19:  0.010
17 of 22:  0.008

How about something like the following, where the last look point is also spaced 6 trials from the next-to-last look point, instead of only 3 trials.

5 of 5:  0.031
10 of 12:  0.019
15 of 19: 0.010
19 of 25: 0.007

overall p: 0.049

What are the implications of having a test which gets stricter as it progresses?

By comparison, the alpha spreading for the 28-trial version is more even:

6 of 6: 0.016
10 of 12:  0.019
14 of 18:  0.015
17 of 23:  0.017
20 of 28:  0.018

ff123

• shday
Statistics For Abx
##### Reply #36 – 30 August, 2002, 11:04:00 AM
Quote
On difficult samples, I like to know if my efforts are enough. If my score is not as good as it should be, I can try to listen more carefully (but causing more fatigue). For me this information is very useful!

Now I see the point. Being able to see your results can increase your chances of passing the test because you can try harder if needed.  At first I was adverse to this because you are effectively manipulating (attempting to increase) p during the test. But upon further reflection this seems to be irrelevant to the statistics.

Perhaps some users would not treat the running score as you do, resulting in an effective lowering of p… which would be a problem. I suspect that most users would be knowledgeable enough to avoid this so that in practice it should not be an issue.

• shday
Statistics For Abx
##### Reply #37 – 30 August, 2002, 11:12:56 AM
Quote
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

If the true value of p=0.5, than is doesn't matter how much the listener knows, he will always be guessing. That's all I was trying to say

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #38 – 30 August, 2002, 12:19:48 PM
Quote
The accurate value appears to be 0.05080.. (according to my program/calculation).

Hmm, I can't verify this using my simulator.  I made the total alpha precise to 4 digits and increased the simulations to 1 million, but come up with 0.0496.

ff123

• Continuum
Statistics For Abx
##### Reply #39 – 30 August, 2002, 12:36:10 PM
Quote
How about something like the following, where the last look point is also spaced 6 trials from the next-to-last look point, instead of only 3 trials.

5 of 5: 0.031
10 of 12: 0.019
15 of 19: 0.010
19 of 25: 0.007

Yes, this is better. Have missed that one.

Quote
Hmm, I can't verify this using my simulator. I made the total alpha precise to 4 digits and increased the simulations to 1 million, but come up with 0.0496.

Hmmm. There seems to be a little inaccuracy somewhere. Can you post the relevant part of your source, so that we can see if the programs are based on slightly different assumptions?
Or maybe there is a little mistake somewhere, although I'm quite sure that the idea behind it is correct.
Is the Excel code readable for you, or should I explain it a bit more?

• Continuum
Statistics For Abx
##### Reply #40 – 30 August, 2002, 12:40:44 PM
Quote
Quote
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

If the true value of p=0.5, than is doesn't matter how much the listener knows, he will always be guessing. That's all I was trying to say

It does matter, if he can stop the test when it's advantegous to him. In fact, a guessing listener could pass any traditional ABX-test if he takes enough trials with probability 1.

But maybe I understood you wrong?

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #41 – 30 August, 2002, 12:52:03 PM
I have uploaded an updated binary to:
http://ff123.net/export/seqsim.zip

and the source code to:
http://ff123.net/export/seqsimsource.zip

The relevant portion of the code is in seqsimDlg.cpp in the function called OnRunsim().  But in a nutshell, I run N number of simulations of a 28 total-trial ABX session.  At each look point, including the 28th trial, I count the number of times that the number of correct answers equals or exceeds the specified entry at that look point.  I call this a "hit."  If I get a hit at a look point, I terminate and go on to the next simulation run.  Then I count all the hits and divide by the number of simulations to get the total alpha.

I might need some explaining on the macros in your spreadsheet.  The only thing I can think of right now is that there is a rounding error in the calculation (there are a lot of sums in the calculation).  From this standpoint, the simulation should be more accurate.

I also verified that the simulation gives close (but not exact!) agreement with my binomial calculations if I only have one look point.

Any thoughts on the non-even spreading of the alpha error?

ff123

• shday
Statistics For Abx
##### Reply #42 – 30 August, 2002, 12:57:41 PM
Quote
It does matter, if he can stop the test when it's advantegous to him. In fact, a guessing listener could pass any traditional ABX-test if he takes enough trials with probability 1.

Agreed!

One thing that this discussion reinforces for me is the caution one should use when interpreting p-values, for any statistical test, not just ABX.

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #43 – 30 August, 2002, 02:54:46 PM
I've been comparing the simulation vs. the binomial calculation and I see a difference that's not coming from roundoff error (I changed all doubles to long doubles, or 80 bits), and I set the simulation size to 10 million trials.  I also removed an approximation I was making with the binomial calculation (not summing values if they were less than 0.0001).

Here is a graph of the difference, and the absolute value of the difference in the resulting pvalues for a 20 trial ABX session:

This is pretty weird, and I can't explain what's going on.

ff123

Edit:  anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

Edit2:  This was an artifact of the random number generator and/or calculation I was using.  I fixed this problem

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #44 – 30 August, 2002, 04:42:00 PM
Quote
Quote
5. listener can terminate at any time, with overall results taken from the above table.

The last point is a little dubious to me. But it shouldn't affect the results too much.

Yes, it would seem that the most advantageous places to stop an ABX test would be at the look points, and that the most advantageous look points to stop at would be the ones with the highest nominal alpha risks.  In the 28-trial, the best look-point to stop at would be the one at trial 12.  The worst stopping point would be an in-between early termination at trial 22, where the listener is required to get 17 correct (nominal alpha = 0.0085).

However, I'm thinking of the listener again.  If he wants to stop in between look points, it should be fine, but he's going to pay a small penalty for that.

In your 22 or 25-trial version, the best stopping look point is the first one (trial 5).  From there, it gets progressively harder to achieve a significant result.

• Continuum
Statistics For Abx
##### Reply #45 – 30 August, 2002, 04:44:43 PM
Quote
But in a nutshell, I run N number of simulations of a 28 total-trial ABX session. At each look point, including the 28th trial, I count the number of times that the number of correct answers equals or exceeds the specified entry at that look point. I call this a "hit." If I get a hit at a look point, I terminate and go on to the next simulation run. Then I count all the hits and divide by the number of simulations to get the total alpha.

Exactly what it should be. The randomization routine is beyond doubt, I guess?

Quote
The only thing I can think of right now is that there is a rounding error in the calculation (there are a lot of sums in the calculation). From this standpoint, the simulation should be more accurate.

Yes, this would explain why the results are close, but not the same.

Quote
Any thoughts on the non-even spreading of the alpha error?

Theoretically, it shouldn't be a problem. The calculated/simulated total alpha is significant. Intuitively, it takes into account that a listener that was wrong in the first trials is less to be trusted. So to speak, it gives the unknown/beginning user a little bonus.

Quote

I'll write more commentary later.

Quote
Edit: anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

What do you mean?!
Here are accurate values of alphas (again from Maple):
Code: [Select]
`>alpha:=(correct,trials)->evalf[25](sum(binomial(trials,k)*1/2^trials,k=correct..trials));>for i from 0 to 20 do>   alpha(i,20);> end do;1..9999990463256835937500000.9999799728393554687500000.9997987747192382812500000.9987115859985351562500000.9940910339355468750000000.9793052673339843750000000.9423408508300781250000000.8684120178222656250000000.7482776641845703125000000.5880985260009765625000000.4119014739990234375000000.2517223358154296875000000.1315879821777343750000000.05765914916992187500000000.02069473266601562500000000.005908966064453125000000000.001288414001464843750000000.0002012252807617187500000000.00002002716064453125000000000.9536743164062500000000000*10^-6`

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #46 – 30 August, 2002, 05:10:20 PM
Quote
Quote
Edit: anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

What do you mean?!
Here are accurate values of alphas (again from Maple):
[CODE]>alpha:=(correct,trials)->evalf[25](sum(binomial(trials,k)*1/2^trials,k=correct..trials));

I don't doubt the precision of the calculation (after all, I used 80 bits to represent a floating point number).  But if there is little or no roundoff error in the binomial calculation, and the simulation error is small enough (should be with 10 million trials), then I trust the simulation over the calculation as the more accurate one.

As I said, the oscillation of the difference between the simulation and the calculation is very suspicious.  And I don't see how it could have come from the simulation.

ff123

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #47 – 30 August, 2002, 05:17:37 PM
I've been thinking some more about the in-between-look terminations.  Since the listener cannot make a decision to continue the test after he terminates it, I think I have calculated things wrong.  For example, if the listener gets a look at trial 6, but then stops at trial 8, then all the other looks at trial 12, 18, 23, and 28 should not be counted towards the overall alpha.

Also, I think I need to rethink the looks at trials 1 through 5.

ff123

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #48 – 30 August, 2002, 05:51:19 PM
Ok, here is my corrected 28-trials profile

Code: [Select]
`10 million simulations using the corrected random number generator            total    look            alpha    point?     5     5    0.0313    no     no looks 6     6    0.0491    yes    look at trial 6 7     7    0.0156    no     look at trial 6 8     7    0.0390    no     look at trial 6 9     8    0.0273    no     look at trial 610     9    0.0214    no     look at trial 611     9    0.0406    no     look at trial 612    10    0.0491    yes    look at trial 6, 1213    11    0.0295    no     look at trial 6, 1214    11    0.0417    no     look at trial 6, 1215    12    0.0356    no     look at trial 6, 1216    13    0.0326    no     look at trial 6, 1217    13    0.0424    no     look at trial 6, 1218    14    0.0491    yes    look at trial 6, 12, 1819    14    0.0495    no     look at trial 6, 12, 1820    15    0.0430    no     look at trial 6, 12, 1821    16    0.0399    no     look at trial 6, 12, 1822    16    0.0487    no     look at trial 6, 12, 1823    17    0.0491    yes    look at trial 6, 12, 18, 2324    18    0.0435    no     look at trial 6, 12, 18, 2325    18    0.0490    no     look at trial 6, 12, 18, 2326    19    0.0462    no     look at trial 6, 12, 18, 2327    20    0.0449    no     look at trial 6, 12, 18, 2328    20    0.0491    no     look at trial 6, 12, 18, 23`

Notes:
1.  No looks allowed for trials 1 through 5

• ff123
• Developer (Donating)
Statistics For Abx
##### Reply #49 – 30 August, 2002, 06:37:33 PM
I updated seqsimsource.zip and seqsim.zip on my website with the same random number generator used by bootstrap.exe.  This seems to yield the same results as plain old rand(), though.

ff123

Eek!  I take it back.  The results for 10 of 20 now agree to 4 decimal places using the new random number generator and 10 million trials!

But going back to the 28-trial case, I still get 0.0491.  So there was something wrong with my random numbers, but there must still be something wrong with your calculation (could still be roundoff errors).