HydrogenAudio

Hydrogenaudio Forum => General Audio => Topic started by: ff123 on 2002-08-27 16:58:11

Title: Statistics For Abx
Post by: ff123 on 2002-08-27 16:58:11
Hopefully in the near future, I can implement an indicator of whether or not a listener should continue to perform ABX testing based on certain specified parameters:

alpha: probability of stating that a difference occurs when it does not (this is the parameter we are typically concerned with, which is usually set to 0.05)

beta:  probability of stating that no difference occurs when it does

p0: the expected proportion of correct decisions when the samples are identical (0.5 for ABX)

p1: the expected proportion of correct decisions when the odd sample is detected (other than by guess).

We have historically not concerned ourselves with beta and p1, but I think it would be advantageous to do so for tests of very subtle differences.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-27 18:51:36
Is this related to my question concerning guessing probability?

http://www.audio-illumination.org/forums/i...2ac97d72fa86932 (http://www.audio-illumination.org/forums/index.php?act=ST&f=1&t=2753)
Title: Statistics For Abx
Post by: ff123 on 2002-08-27 20:57:23
Ideally, I would pop up a graph with the y axis showing the number of correct responses and the x axis showing the number of total trials.  There would be two lines, one of which shows how many correct responses there need to be for any particular number of total trials to achieve 0.05 significance for hearing a difference.  The other line would show at what point the listener should just give up, because the chance of getting a false negative is below what the chosen beta and p1 indicate.  I'll post the formula later tonight.

ff123
Title: Statistics For Abx
Post by: Guest on 2002-08-27 21:52:36
ff13,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.
Title: Statistics For Abx
Post by: shday on 2002-08-27 21:54:14
... hmm, I can post when not logged in???

ff123,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.
Title: Statistics For Abx
Post by: ff123 on 2002-08-28 07:28:20
Well, it turns out I don't understand the statistics quite well enough to be confident about adding it to abchr.

The type of analysis is called a sequential test.  For example:

http://home.clara.net/sisa/sprthlp.htm (http://home.clara.net/sisa/sprthlp.htm)
and
http://education.indiana.edu/~frick/decide/intro.html (http://education.indiana.edu/~frick/decide/intro.html)

The formula for the lower line (below which similarity is declared and the test is stopped) is:

d0 = log(beta) - log(1-alpha) - n*log(1-p1) + n*log(1-p0) /
{ log(p1) - log(p0) - log(1-p1) + log(1-p0) }

The formula for the upper line (above which a difference is declared and the test is stopped) is:

d1 = log(1-beta) = log(alpha) - n*log(1-p1) + n*log(1-p0) /
{ log(p1) - log(p0) - log(1-p1) + log(1-p0) }

alpha, beta, p0, and p1 are as described in my first post.

Basically, alpha, beta, p0, and p1 are decided upon prior to a test, and the test continues until the number of correct trials exceeds the upper line or goes below the lower line.

When I entered this into an Excel spreadsheet, though, and put in typical values for the test parameters, it invariably resulted in having to get more correct trials per n than what the binomial distribution would give.  So I need to understand why that is.

Here is a chart showing what this test is all about.  In this example, let's say that someone scored the first 4 trials correct, but then got the next 6 trials incorrect.  At that point, the sequential test would tell him to stop any further trials, because it would just be a waste of time.  Of course, the counter to this savings in time is that it now takes 9 consecutive correct trials before the test is stopped on the other side.  Straight binomial distribution only requires 5 consecutive correct trials to declare a difference at 95% confidence.

(http://ff123.net/export/sequential.gif)

ff123
Title: Statistics For Abx
Post by: Delirium on 2002-08-28 10:24:27
I don't have my statistics books with me and it's a bit late to read a ton on it at 4am anyway, but from a quick perusal (primarily of your 2nd link) it seems like they're taking something additional into account to make sure that "the test is really safe to stop now".  I.e. with a normal binomial distribution after n trials you say "p < 0.05, so it's significant", but with this method, they seem to be saying "if we were to stop right now and do a normal analysis, p < 0.05, but are we reasonably confident that p will stay below 0.05 as the test continues?"

Of course my reading of it could be completely wrong, as 4am is not the best time to do statistical analysis. =]

But I'd expect there to be something different about the lines, or else there would be no need for a separate method of sequential analysis -- you'd just do the normal analysis after each test, and stop when p < 0.05.
Title: Statistics For Abx
Post by: ff123 on 2002-08-28 14:50:35
I guess what I'm having trouble understanding is why wouldn't someone just calculate the alpha and beta risks directly for each situation (like I currently do for alpha), and then stop the test when either alpha or beta dips below some pre-specified level?

Another website:

http://www.uib.no/isf/medseq.htm (http://www.uib.no/isf/medseq.htm)

As near as I can tell, if one is allowed to see the results of an ABX test in progress, then one needs to set the level of significance stricter than 0.05.  See Table 1 of the website reference above.

Also, apparently the method of using the double lines to stop a sequential test is pretty old and hoary.  The web page mentions Repeated Statistical Tests, replacing the borderlines with repeated t-tests.

ff123

Edit: using intuitive arguements:  basing a decision to stop a test on knowledge of the progress (as is currently the case with abchr) is subject to a bias.  I.e., I always stop the test when it's advantageous for me to do so.  That would seem to imply that a stricter stopping criterion is needed to be equivalent to a test where the number of trials is pre-determined.

Question has been submitted to sci.math.stat, as this appears to be a rather large issue which needs to be resolved.
Title: Statistics For Abx
Post by: ff123 on 2002-08-28 16:49:44
One more thing, Table 1 on the page I listed above also seems to imply (just like the lines in the chart I drew) that for a sequential test, one must achieve a "nominal" significance of 0.01 for 9 total trials to be equivalent to a fixed test significance of 0.05.  That is, one should perform at least 9 out of 9 trials on a sequential test (results known after every trial) before stopping an ABX test.  Current fixed test stopping point is 5 out of 5.

Probably a conservative approach would be to use the double-line method for now until I can figure out how to refine it using simulation.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-28 16:49:56
Quote
I guess what I'm having trouble understanding is why wouldn't someone just calculate the alpha and beta risks directly for each situation (like I currently do for alpha), and then stop the test when either alpha or beta dips below some pre-specified level?

I think the bottom line is that obtaining a certain p-value is not quite the same as obtaining the corresponding level of confidence in a test where the significance level is chosen a priori. This is what continuum was talking about in his previous thread.

As an example, this means that when you score 12/16 on an ABX test, the probability that you obtained the score by chance is actually greater than what the p-value indicates (by how much I don't know). This may seem at first absurd but it has to be true.

Calculating alpha and beta values for each situation (during the test) undermines the a priori part of significance testing and should be done with caution.
Title: Statistics For Abx
Post by: ff123 on 2002-08-28 17:15:48
I didn't understand the previous thread then, but I understand the problem now, I think.  And there is apparently at least one way to do a rough adjustment (the double line method), and as usual, several ways to refine it, the most accurate way probably being some sort of simulation.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-28 20:16:32
Quote
... hmm, I can post when not logged in???

ff123,

You may want to consider expressing the running (and final) result of an ABX test using confidence intervals. It is different from the more commonly used “hypothesis testing” method, but it could be a nice compliment here.

For example:

In another thread you reported the results of a test where you scored 52/82 giving a p-value of 0.010.

You could also report that, during the test, your probability of choosing the correct sample was p = 0.634 (i.e., 52/82) with a 99% confidence interval of (+/- 0.137) (i.e., p = 0.634 +/- 0.137).

Notice that the lower bound of the confidence interval just overlaps p=0.5, which jives with what the p-value indicates.

For comparing subtle differences in ABX tests I think this could be quite useful as compared to using a p-value alone, which gives no information about the magnitude of a perceived difference.

The caveat with this method is that the binomial distribution becomes non-normal at the extreme edges (where p is close to 1 in the case of ABX). Calculating the confidence interval becomes less than trivial in this case (otherwise it is quite simple).

… as far as your software is concerned, it sounds like you have some good ideas… but it also looks a little complicated. I’ll have to think about it.

shday,
How did you calculate the confidence interval? With what formula? (I suppose you are using something like: 0.99 = NORMALDIST( (x-np)/sqrt(npq) ) where n=82, p=q=1/2, but that's a wild guess. )

I don't really understand what the interpretation of this interval is. Which probability are we investigating?


For comparison, here's a graph with traditional p-val calculation: The y-coordinate of a black point is the required number of correct ABX-trials out of the total number of trials, represented via the x-coordinate, to achieve a p-val greater than 0.99 (respectively 0.95 with red).
http://www.freewebz.com/aleph/095-graph.png (http://www.freewebz.com/aleph/095-graph.png)
Title: Statistics For Abx
Post by: Continuum on 2002-08-28 20:23:49
Situation 1: the number of trials is determined. Depending on user input is the calculated p-val (Probability to get the same or a better result by guessing).

Situation 2: a level of confidence (calculated the same as the p-val above) is required, the number of trials used (test length) is irrelevant.

Obviously the two situations are quite different, which is our current problem. The question is: What is the probability to reach a certain level of confidence by guessing, if one is allowed to use unlimited many tests? It is clear, that the probability of achieving 0.99 confidence by guessing is greater than 0.01. But how much?

(This reminds me of the theory of the "simple random walk", where you reach a certain point with probability=1 but infinite expected moves...)
Title: Statistics For Abx
Post by: ff123 on 2002-08-28 23:45:51
Here is a web page of the clearest explanation I have seen so far of the problem:

http://www3.mdanderson.org/depts/biostatis...tatmethods.html (http://www3.mdanderson.org/depts/biostatistics/people/dberry/seqstatmethods.html)

----------------
An excerpt:

Consider another sequential design, one of a type of group-sequential designs commonly used in clinical trials. The experimental plan is to stop at 17 tries if 13 or more are successes or 13 or more are failures, and hence the experiment is stopped on target. But if after 17 tries the number of successes is between 5 and 12 then the experiment continues to a total of 44 tries. If at that time, 29 or more are successes or 29 or more are failures then the null hypothesis is rejected. To set the context, suppose the experiment is nonsequential, with sample size fixed at 44 and no possibility of stopping at 17; then the exact significance level is again 0.049. When using a sequential design, one must consider all possible ways of rejecting the null hypothesis in calculating a significance level. In the group-sequential design there are more ways to reject than in the nonsequential design with the sample size fixed at 17 (or fixed at 44). The overall probability of rejecting is greater than 0.049 but is somewhat less than 0.049 + 0.049 because some sample paths that reject the null hypothesis at sample size 17 also reject it at sample size 44. The total probability of rejecting the null hypothesis for this design is actually 0.080. Therefore, even though the results beyond the first 17 observations are never observed, the fact that they might have been observed makes 13 successes of 17 no longer statistically significant (since 0.08 is greater than 0.05).

To preserve a 0.05 significance level in group-sequential or fully sequential designs, investigators must adopt more stringent requirements for stopping and rejecting the null hypothesis. That is, they must include fewer observations in the region where the null hypothesis is rejected. For example, the investigator in the above study might drop 13 successes or failures in 17 tries and 29 successes or failures in 44 tries from the rejection region. The investigator would stop and claim significance only if there are at least 14 successes or at least 14 failures in the first 17 tries, and claim significance after 44 tries only if there are at least 30 successes or at least 30 failures. The nominal significance levels (those appropriate had the experiment been nonsequential) at n=17 and n=44 are 0.013 and 0.027, and the overall (or adjusted) significance level of rejecting the null hypothesis is 0.032. (No symmetric rejection regions containing more observations allow the significance level to be greater than this but still smaller than 0.05.) With this design, 13 successes of 17 is not statistically significant (as indicated above) because this data point is not in the rejection region.
-----------------

The best way to minimize this problem would be to take fewer "looks" at the results in progress.  For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether.  4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

It's possible that I could move the first look down to trial 6, and still be able to keep the nominal significance at 0.016 (although I'd have to simulate to make sure).  That would be the best case, because it would allow a forced ABX test to take place in which the minimum number of trials to achieve significance is set to something as low as reasonably possible.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-29 00:45:17
Quote
shday,
How did you calculate the confidence interval? With what formula? (I suppose you are using something like: 0.99 = NORMALDIST( (x-np)/sqrt(npq) ) where n=82, p=q=1/2, but that's a wild guess. )

I don't really understand what the interpretation of this interval is. Which probability are we investigating?

The CI was calculated from the standard deviation of the observed proportion of successes:

standard deviation = sigma = sqrt(pq/n) where p = 52/82, q = 1 - p, and n = 82

Then the CI was calculated assuming a normal distribution:

CI = p +/- 2.58 * sigma  (the 2.58 comes from the 99% confidence. In Excel the formula is NORMSINV(0.005)= 2.5758...)

CI = 0.634 +/- 2.58*0.053 = (0.497, 0.771)

As a rule of thumb, the assumption of a normal distribution can be considered adequate if:

(1/sqrt(n))(sqrt(q/p)-sqrt(p/q)) < 0.3

There are tables that give exact CI's for binomial distributions. Unlike the above approximation, they are never centred exactly at p (except when p = 0.5). If one were interested, there should be an way to calculate the exact CI's (no normal distribution assumption).

The 99% confidence interval given for 52/82 should be interpreted as follows: upon repeated tests, 99% of the intervals calculated this way will include the true value of p. This also means that, if the interval does not include p=0.5, there is a >99% probability the listener heard a difference (sort of).

Most of this stuff in new ground for me. It comes from "Statistics for Experimenters" by Box, Hunter and Hunter (1978).
Title: Statistics For Abx
Post by: shday on 2002-08-29 02:01:08
Quote
The best way to minimize this problem would be to take fewer "looks" at the results in progress.  For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether.  4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

It's possible that I could move the first look down to trial 6, and still be able to keep the nominal significance at 0.016 (although I'd have to simulate to make sure).  That would be the best case, because it would allow a forced ABX test to take place in which the minimum number of trials to achieve significance is set to something as low as reasonably possible.

ff123

IMO this seems like a simple and rigorous way to improve your tool. Also, the results may as well be kept hidden from the listener until the predetermined points. This has the additional advantage of keeping the trials more independent.
Title: Statistics For Abx
Post by: Continuum on 2002-08-29 08:22:20
I just did a quick calculation using a modified pascal triangle and the 0.95 confidence points from my chart above, basically I assumed a simplified version of situation 2: ABX trials are attempted by guessing. The test is stopped when a confidence level of 0.95 is reached (using the traditional p-val method) or when 16 tests are completed.
The result: the probability to pass this ABX test is 0.08755, i.e. significantly more than 0.05. (If there's no mistake on my side, this is an exact value)
Title: Statistics For Abx
Post by: ff123 on 2002-08-29 08:59:49
I hacked a quick and dirty sequential ABX simulator with the ability to perform up to 6 "looks."  Here is the screen shot:

(http://ff123.net/export/seqsim.gif)

Each of the 5 looks in this example (max of 30 trials are allowed) has a nominal p less than 0.05, but the total p for the test is about 0.05.  You can download this simulator (don't increase Num Sims too much or you'll hang your computer!) at:

http://ff123.net/export/seqsim.zip (http://ff123.net/export/seqsim.zip)

ff123

Edit:  The following looks and number correct distribute the alphas a little more evenly.

look = 6, numcorrect = 6, nomalpha = 0.016
look = 12, numcorrect = 10, nomalpha = 0.019
look = 18, numcorrect = 14, nomalpha = 0.015
look = 23, numcorrect = 17, nomalpha = 0.017
look = 28, numcorrect = 20, nomalpha = 0.018

total alpha = 0.050
Title: Statistics For Abx
Post by: Continuum on 2002-08-29 12:14:55
I've wrote a program for evaluating the corrected p-val for situation 2. It takes the maximum allowed number of ABX trials and the required confidence (i.e. traditional p-val) as arguments and returns the probability to pass the test with guessing.

Example:
(4, 0.95) -> 0 (impossible)
(5, 0.95) -> 0.03125 (5/5 -> pval=.96875)
(6, 0.95) -> 0.03125 (either the test is won at 5/5 or lost)
(16, 0.95) -> 0.08755493164
(100, 0.95) -> 0.2020580977 (!!!)
(16, 0.99) -> 0.01422119141

The idea is:
Construct a pascal triangle up to a certain level.
Code: [Select]
    1    
    1    1
    1    2    1
    1    3    3    1
    1    4    6    4    1
    1    5    10    10    5    1
    ....

Now we can read it as follows: A(row=trials+1, column=correct+1) / 2^trials = P(abx=correct/trials | trials used) (e.g. the probability to score 2 times correct out of 3 is A(4, 3)/2^3 = 3/8)
The pval would be the sum of all those probabilities to the left of the chosen item.

My program does the following: seek for the earliest win condition (e.g. 5/5), calculate its probability (e.g. 0.03125) and set the corresponding item in the pascal triangle to 0 and recalculate, e.g.:
   1   5   10   10   5   0
   1   6   15   20   15   5   0
   ....
The last step is to make sure, that nothing is counted twice. Now start again. The next win condition with probability /= 0 is 7/8, with remaining probability of P(abx=7/8 | not 5/5) = 0,0195.., and so on...

Here's the Maple source code, I hope it's readable (# starts a comment):
Code: [Select]
CorrPVal:=proc(n,reqConfidence,Prob)
local k, Trial, LastResult, Result, Confidence:

 Result:=array([1,seq(0,i=1..n)]):  # initialize [1,0,...,0]
 Prob:=0:

 for Trial from 1 to n do           # create new line of triangle
   LastResult:=copy(Result):        # only the last lines is required, so nothing
                                    # more is stored / copy to help variable
   Confidence:=0:
   Result[1]:=1:                    # set first element to 1

   k:=1:                            # now the rest
   Confidence:=Confidence+binomial(Trial,0)*1/2^Trial:

   while Confidence < eval(reqConfidence) and k <= Trial do
# check if target confidence is reached or all trials has been attempted
     Result[k+1]:=LastResult[k]+LastResult[k+1]:
# calculate new element of Pascal triangle
     Confidence:=Confidence+binomial(Trial,k)*1/2^Trial:
# increase Confidence (->more trials were correct)
     k:=k+1:
   end do:

   if k<=Trial then                 # winning condition
     Prob:=evalf(eval(Prob)+(LastResult[k]+LastResult[k+1])/2^Trial):
# add to sum of all winning probabilities
   end if:
 end do:
end proc:

# now follows the execution
# the result is stored in variable 'prob'
CorrPVal(16,0.95,prob):
prob;
# the result is displayed
Title: Statistics For Abx
Post by: ff123 on 2002-08-29 15:43:14
I think I'll probably create several different typical ABX "profiles" for a listener to choose from.  For example, one of the profiles will be the 28 max trials case, using 5 looks into the progress (4 of them before the end) as shown in my last edited message.  This gives the listener 4 decision points to terminate early, but still meet the overall p.

The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.

ff123

Edit:  Ok, I believe I can create an entire profile by using the simulator to pick enough points that I can create a sort of look up table, so that if the listener terminates in between look points, I can still tell if the overall p was met.  But basically, the in-between termination becomes an extra look point.  I'll construct such a table later.
Title: Statistics For Abx
Post by: ff123 on 2002-08-29 16:35:18
One more thing:  I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

For example, outcome 1:  all 5 trials are correct.  The listener cannot terminate early and still have the overall p be less than 0.05.  Outcome 2:  all 5 trials are incorrect.  The listener still has a chance of getting 17 out of 23 or 20 out of 28 to pass the test.

But the reason I'm not sure is because the listener may form an estimate of his chances of succeeding and decide to terminate because of this information.

Oh, for the 28-trial profile, I probably ought to tell the listener that he's wasting his time if he gets more than 8 trials incorrect.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-29 17:10:07
Quote
One more thing:  I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points. What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place. As I’ve said before, knowing the progress of the test compromises the independence of the trials and should be avoided if possible.

Quote
The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.


You probably shouldn’t be too concerned about the listener quitting early. If someone does a test they should be encouraged to stick it out to the end rather than quitting. I see no harm in allowing the test to terminate when the listener makes 9 incorrect choices. This way the listener will have no good reason to quit. (If they do quit, than give them the p-val, with all its caveats, and be done with it!)
Title: Statistics For Abx
Post by: ff123 on 2002-08-29 18:27:11
Quote
If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points.


That could be another profile:  for example, mandatory 16 trials, no quitting early, only get results at the end.  BTW, the profile scheme with look points is just as sound statistically as the no-look profile, provided that the overall p comes out at 0.05 or below.

Quote
What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place.


Yes, that's the whole idea of this exercise.  I want to make it as convenient as possible for the listener to complete a valid ABX session.  A no-look ABX test can be a huge time-waster.

I think I have the "terminate-in-between-look-points" problem solved, and I think I'll allow the listener to see the first 5 trials in the 28-trial profile (in addition to the look-point at trial 6).

Are there any other profiles that might be useful?  Perhaps a very large profile, like 60 max trials?  Although I'm not sure who would actually use such a profile.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-29 20:11:44
Quote
The best way to minimize this problem would be to take fewer "looks" at the results in progress. For example, suppose I looked at the results after 7 trials, then 14, then 21, and 28, where 28 would be the maximum allowable trials before I stop the test altogether. 4 looks would mean that the nominal significance at each look would have to be about 0.016 to achieve an overall significance of 0.05 (according to table 1 at the MEDSEQ website).

Until now I was busy calculating the corrected p-vals where the user is allowed to look at his results after every trial, but it shouldn't be too difficult to modify my source code to incorporate certain look-points (which probably is the best solution).
This way, we could calculate exact values. Furthermore, the algorithm is acceptable fast (polynomial), so I think it's quite possible to integrate a realtime calculation to your program.
Quote
Edit: Ok, I believe I can create an entire profile by using the simulator to pick enough points that I can create a sort of look up table, so that if the listener terminates in between look points, I can still tell if the overall p was met. But basically, the in-between termination becomes an extra look point. I'll construct such a table later.


BTW: Here is the same program as above in Excel VBA (version 95) (for those infidels  ): http://www.freewebz.com/aleph/CorrPVal.xls (http://www.freewebz.com/aleph/CorrPVal.xls)

Quote
The only problem I haven't figured out yet is what to do if the listener terminates the test in between the look points.

This might be a very difficult question. Allowing to choose the time of termination will increase complexity.

Quote
One more thing: I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

Theoretically, it shouldn't hurt, as the gained information is of no value to a guessing test person.

Quote
But the reason I'm not sure is because the listener may form an estimate of his chances of succeeding and decide to terminate because of this information.

Quote
You probably shouldn’t be too concerned about the listener quitting early.

I agree with shday here. If the listener still believes he can here a difference, he will continue the test anyway. If not, I don't think he would be able to abx something, he can't here.

Quote
That could be another profile: for example, mandatory 16 trials, no quitting early, only get results at the end. BTW, the profile scheme with look points is just as sound statistically as the no-look profile, provided that the overall p comes out at 0.05 or below.

Yes, profiles seem to be a good way to satisfy everyone.

Quote
Are there any other profiles that might be useful? Perhaps a very large profile, like 60 max trials? Although I'm not sure who would actually use such a profile.

Garf, for example. And I myself used long runs before.
Title: Statistics For Abx
Post by: ff123 on 2002-08-29 20:25:17
Here is the lookup table I would use for the 28-trial profile:

*0 wrong: at least 6 of 6 (can't have fewer than 6 trials with 0 wrong)
1 wrong: at least 9 of 10 (can't have fewer than 10 trials with 1 wrong)
*2 wrong: at least 10 of 12
3 wrong: at least 13 of 16
*4 wrong: at least 14 of 18
5 wrong: at least 17 of 22
*6 wrong: at least 17 of 23
7 wrong: at least 19 of 26
*8 wrong: at least 20 of 28

Notes:
* = look points
1. overall test significance is 0.05
2. listener is not allowed to perform ABX trials past the max of 28.
3. listener is allowed to see trials 1 through 5 in addition to the early-decision look points
4. ABX is terminated if listener gets 9 or more trials wrong.
5. listener can terminate at any time, with overall results taken from the above table.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-29 22:21:34
Quote
Quote
One more thing:  I'm not sure if it hurts to look at the progress of the first 5 trials for the 28-trial profile.

If you want the tool to be statistically sound than don’t let the listener see the progress at all, even at the look points. What does it add to the test anyhow? You’ve now taken steps to ensure that the listener isn’t wasting time (very nice solution btw). As far as I can tell, saving wasted time was the only valid reason for allowing the listener to watch the progress in the first place. As I’ve said before, knowing the progress of the test compromises the independence of the trials and should be avoided if possible.


hmm, I'm quoting myself here because I'd like to restate my comment. Basically, what I'm trying to say is that the look points solve the problem of wasting time, in a statistically sound manner. The listener doesn't have to "look" for there to be look points because the program takes care of it. It is therefor unnecessary for the listener to know how he is performing *during* the test... other than to satisfy his curiosity (perhaps this point is debatable).

It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything. But on the other hand, what if he does here a difference (p>0.5), but fails the first few trials? I think his attitude toward the test could change. This is why I think seeing the results in progress may make the test less statistically sound. It just seems that keeping the listener as "blinded" a possible is the best think to do.

So my question now is: how does allowing the listener to track his progress help the test? (curiosity is one reason, but are there others that I'm missing?)

edit : obviously if the listener reaches the desired confidence at a look point the test would terminate and he would be able to "look" at the results then
Title: Statistics For Abx
Post by: shday on 2002-08-29 23:31:47
Quote
Are there any other profiles that might be useful?

Perhaps a traditional 12/16 test with a look point at 6/6. This still gives 95% confidence. It could also terminate if more than 4 incorrect choices were made.
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 00:20:25
Quote
Quote

Perhaps a traditional 12/16 test with a look point at 6/6. This still gives 95% confidence. It could also terminate if more than 4 incorrect choices were made.


That's a possibility, although the difference between this and the 28-trial profile seems kind of small.  On the 28-trial profile, one gets a look at 18 with 4 allowable wrong guesses.  The minimum number of trials to achieve a significant result is still 6 of 6.  Also, in a 28-trial profile, there's the distinct advantage of being able to go all the way up to 28 trials if necessary, whereas a 16-trial profile will terminate the test at 16 no matter what.

To answer your other point, there is no statistical advantage to being able to look at progress.  This is purely driven by convenience and time savings.  However, if I can make it easier and faster to perform ABX trials with only a slight cost in power, I think that's a good tradeoff.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-30 00:51:06
Quote
To answer your other point, there is no statistical advantage to being able to look at progress.  This is purely driven by convenience and time savings.  However, if I can make it easier and faster to perform ABX trials with only a slight cost in power, I think that's a good tradeoff.

Sorry if I seem to be pressing this... but what are the time savings of allowing the listener to track his progress? How does it make the test easier and faster? (I'm not talking about the automated looks points here, they are a good trade-off. I'm just referring to the listener being able to see his score all, or part of, the time. I think this introduces potential, though probably not very serious, problems).

About the 6/6, 12/16 design... I guess it could be useful if one were not interested in going beyond 16 trials... for whatever reason. It's also kind of familiar territory.
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 01:46:39
Quote
Sorry if I seem to be pressing this... but what are the time savings of allowing the listener to track his progress? How does it make the test easier and faster? (I'm not talking about the automated looks points here, they are a good trade-off. I'm just referring to the listener being able to see his score all, or part of, the time. I think this introduces potential, though probably not very serious, problems).

The time savings arise because at each look point you can decide whether or not to stop the test early.  With a strictly fixed test, the listener isn't allowed to know the results until all the trials have been completed.

I understand the concern you have about looking at the progress.  The listener may decide to stop before he has a chance to pick out differences, based on his estimate on how the test is likely to continue.  That's what the Bayesian method of sequential testing (the double lines method) was supposed to do, except more rigorously.  That method doesn't put a cap on the number of trials.

However, there is an unspoken assumption I'm making about the ABX test:  I don't really care if the listener decides to stop early when he could continue on.  I really only care about the listener not claiming to hear a difference when there isn't one.  I.e., I am completely ignoring type II errors.  This is probably not the right approach for a generic ABX test, though, in which the listener may be interested in testing for similarity, not just difference.

Maybe the Bayesian method could still be appropriate for my purposes (ie., my goal is to make it as easy as possible for a listener to perform a valid ABX test for differences) if I can refine the lines into something which allows me to get significant results at something less than 9 trials minimum.  I haven't really looked closely at this method, though.

The method I am proposing to use is the frequentist method, and in order to calculate (simulate) it, I need to know the max number of trials allowed.  This has the significant advantage of pushing the minimum number of trials necessary to get significant results down to 6.

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-30 02:35:04
I guess most of the time a tester wouldn't be influenced in the way I fear. I'm probably placing more emphasis on it than it warrants. The proof of this can be found in the way I sometimes do an ABX test. I move the window somewhere so that the running score is hidden! Now it looks like I'll be able to continue doing this while not wasting as much time, thanks to ABX/HR

Quote
The time savings arise because at each look point you can decide whether or not to stop the test early.  With a strictly fixed test, the listener isn't allowed to know the results until all the trials have been completed.


I'd still argue that the same time savings could be achieved by allowing the software to deal with the look points automatically. Once a look point was reached the test would either terminate (because the desired confidence was reached) or it would go on as if nothing had happend. The listener would not have to know his exact score. This is indeed different from a strickly fixed test where the listener isn't allowed to know the results until all the trials have been completed... but not by much.

Maybe it could be an option?
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 03:20:33
Quote
I'd still argue that the same time savings could be achieved by allowing the software to deal with the look points automatically. Once a look point was reached the test would either terminate (because the desired confidence was reached) or it would go on as if nothing had happend. The listener would not have to know his exact score. This is indeed different from a strickly fixed test where the listener isn't allowed to know the results until all the trials have been completed... but not by much.

It doesn't matter at all whether the decision to terminate early is made automatically or by the listener.  The simulation I wrote always terminates at a look point when it is appropriate to do so, just like an automated process would do!

It's purely the fact that early termination is allowable at all which affects the overall pval.  What happens is a subtle form of "cherry picking."  That is, the program doesn't stop at a random point, but instead only when it's advantageous to do so.  That's what causes the problem.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 09:40:34
Here's a new version of the Excel-sheet. It allows specifying look-up points, and to each point a corresponding required p-val (nominal alpha). Result is the total alpha.
http://www.freewebz.com/aleph/CorrPVal2.xls (http://www.freewebz.com/aleph/CorrPVal2.xls)

Quote
Here is the lookup table I would use for the 28-trial profile:

*0 wrong: at least 6 of 6 (can't have fewer than 6 trials with 0 wrong)
1 wrong: at least 9 of 10 (can't have fewer than 10 trials with 1 wrong)
*2 wrong: at least 10 of 12
3 wrong: at least 13 of 16
*4 wrong: at least 14 of 18
5 wrong: at least 17 of 22
*6 wrong: at least 17 of 23
7 wrong: at least 19 of 26
*8 wrong: at least 20 of 28

Notes:
* = look points
1. overall test significance is 0.05

The accurate value appears to be 0.05080.. (according to my program/calculation).

Quote
2. listener is not allowed to perform ABX trials past the max of 28.
3. listener is allowed to see trials 1 through 5 in addition to the early-decision look points
4. ABX is terminated if listener gets 9 or more trials wrong.
5. listener can terminate at any time, with overall results taken from the above table.

The last point is a little dubious to me. But it shouldn't affect the results too much.
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 12:52:28
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

Quote
So my question now is: how does allowing the listener to track his progress help the test? (curiosity is one reason, but are there others that I'm missing?)

On difficult samples, I like to know if my efforts are enough. If my score is not as good as it should be, I can try to listen more carefully (but causing more fatigue). For me this information is very useful!
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 12:54:26
This should be a mode that allows 5/5 with total significance = 0.049567:
  at least  5 of  5
  at least  10 of  12
  at least  15 of  19
  at least  17 of  22
Not that different from the 28 profile (though shorter), but with 5/5 possibility. Might be good for finding obvious differences.
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 13:27:41
Quote
This should be a mode that allows 5/5 with total significance = 0.049567:
 at least  5 of  5
 at least  10 of  12
 at least  15 of  19
 at least  17 of  22
Not that different from the 28 profile (though shorter), but with 5/5 possibility. Might be good for finding obvious differences.

Thanks for the spreadsheet.  Hopefully I can incorporate the calculations into abchr instead of running a mini-simulation each time I perform a look.

The 22-trial version has an interesting property:  The nominal alphas are not spread evenly, but get tighter as the test progresses:

5 of 5:  0.031
10 of 12:  0.019
15 of 19:  0.010
17 of 22:  0.008

How about something like the following, where the last look point is also spaced 6 trials from the next-to-last look point, instead of only 3 trials.

5 of 5:  0.031
10 of 12:  0.019
15 of 19: 0.010
19 of 25: 0.007

overall p: 0.049

What are the implications of having a test which gets stricter as it progresses?

By comparison, the alpha spreading for the 28-trial version is more even:

6 of 6: 0.016
10 of 12:  0.019
14 of 18:  0.015
17 of 23:  0.017
20 of 28:  0.018

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-30 16:04:00
Quote
On difficult samples, I like to know if my efforts are enough. If my score is not as good as it should be, I can try to listen more carefully (but causing more fatigue). For me this information is very useful!

Now I see the point. Being able to see your results can increase your chances of passing the test because you can try harder if needed.  At first I was adverse to this because you are effectively manipulating (attempting to increase) p during the test. But upon further reflection this seems to be irrelevant to the statistics.

Perhaps some users would not treat the running score as you do, resulting in an effective lowering of p… which would be a problem. I suspect that most users would be knowledgeable enough to avoid this so that in practice it should not be an issue.
Title: Statistics For Abx
Post by: shday on 2002-08-30 16:12:56
Quote
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

If the true value of p=0.5, than is doesn't matter how much the listener knows, he will always be guessing. That's all I was trying to say 
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 17:19:48
Quote
The accurate value appears to be 0.05080.. (according to my program/calculation).

Hmm, I can't verify this using my simulator.  I made the total alpha precise to 4 digits and increased the simulations to 1 million, but come up with 0.0496.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 17:36:10
Quote
How about something like the following, where the last look point is also spaced 6 trials from the next-to-last look point, instead of only 3 trials.

5 of 5: 0.031
10 of 12: 0.019
15 of 19: 0.010
19 of 25: 0.007

Yes, this is better. Have missed that one.

Quote
Hmm, I can't verify this using my simulator. I made the total alpha precise to 4 digits and increased the simulations to 1 million, but come up with 0.0496.

Hmmm. There seems to be a little inaccuracy somewhere. Can you post the relevant part of your source, so that we can see if the programs are based on slightly different assumptions?
Or maybe there is a little mistake somewhere, although I'm quite sure that the idea behind it is correct.
Is the Excel code readable for you, or should I explain it a bit more?
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 17:40:44
Quote
Quote
Quote
It is true that, if the listener cannot hear a difference, his knowing the progress will not change anything.

Of course, you are only talking about the first 5 trials?

If the true value of p=0.5, than is doesn't matter how much the listener knows, he will always be guessing. That's all I was trying to say 

It does matter, if he can stop the test when it's advantegous to him. In fact, a guessing listener could pass any traditional ABX-test if he takes enough trials with probability 1.

But maybe I understood you wrong?
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 17:52:03
I have uploaded an updated binary to:
http://ff123.net/export/seqsim.zip (http://ff123.net/export/seqsim.zip)

and the source code to:
http://ff123.net/export/seqsimsource.zip (http://ff123.net/export/seqsimsource.zip)

The relevant portion of the code is in seqsimDlg.cpp in the function called OnRunsim().  But in a nutshell, I run N number of simulations of a 28 total-trial ABX session.  At each look point, including the 28th trial, I count the number of times that the number of correct answers equals or exceeds the specified entry at that look point.  I call this a "hit."  If I get a hit at a look point, I terminate and go on to the next simulation run.  Then I count all the hits and divide by the number of simulations to get the total alpha.

I might need some explaining on the macros in your spreadsheet.  The only thing I can think of right now is that there is a rounding error in the calculation (there are a lot of sums in the calculation).  From this standpoint, the simulation should be more accurate.

I also verified that the simulation gives close (but not exact!) agreement with my binomial calculations if I only have one look point.

Any thoughts on the non-even spreading of the alpha error?

ff123
Title: Statistics For Abx
Post by: shday on 2002-08-30 17:57:41
Quote
It does matter, if he can stop the test when it's advantegous to him. In fact, a guessing listener could pass any traditional ABX-test if he takes enough trials with probability 1.

Agreed!

One thing that this discussion reinforces for me is the caution one should use when interpreting p-values, for any statistical test, not just ABX.
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 19:54:46
I've been comparing the simulation vs. the binomial calculation and I see a difference that's not coming from roundoff error (I changed all doubles to long doubles, or 80 bits), and I set the simulation size to 10 million trials.  I also removed an approximation I was making with the binomial calculation (not summing values if they were less than 0.0001).

Here is a graph of the difference, and the absolute value of the difference in the resulting pvalues for a 20 trial ABX session:

(http://ff123.net/export/simvsbinomial.gif)

This is pretty weird, and I can't explain what's going on.

ff123

Edit:  anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

Edit2:  This was an artifact of the random number generator and/or calculation I was using.  I fixed this problem
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 21:42:00
Quote
Quote
5. listener can terminate at any time, with overall results taken from the above table.

The last point is a little dubious to me. But it shouldn't affect the results too much.

Yes, it would seem that the most advantageous places to stop an ABX test would be at the look points, and that the most advantageous look points to stop at would be the ones with the highest nominal alpha risks.  In the 28-trial, the best look-point to stop at would be the one at trial 12.  The worst stopping point would be an in-between early termination at trial 22, where the listener is required to get 17 correct (nominal alpha = 0.0085).

However, I'm thinking of the listener again.  If he wants to stop in between look points, it should be fine, but he's going to pay a small penalty for that.

In your 22 or 25-trial version, the best stopping look point is the first one (trial 5).  From there, it gets progressively harder to achieve a significant result.
Title: Statistics For Abx
Post by: Continuum on 2002-08-30 21:44:43
Quote
But in a nutshell, I run N number of simulations of a 28 total-trial ABX session. At each look point, including the 28th trial, I count the number of times that the number of correct answers equals or exceeds the specified entry at that look point. I call this a "hit." If I get a hit at a look point, I terminate and go on to the next simulation run. Then I count all the hits and divide by the number of simulations to get the total alpha.

Exactly what it should be. The randomization routine is beyond doubt, I guess?

Quote
The only thing I can think of right now is that there is a rounding error in the calculation (there are a lot of sums in the calculation). From this standpoint, the simulation should be more accurate.

Yes, this would explain why the results are close, but not the same.

Quote
Any thoughts on the non-even spreading of the alpha error?

Theoretically, it shouldn't be a problem. The calculated/simulated total alpha is significant. Intuitively, it takes into account that a listener that was wrong in the first trials is less to be trusted. So to speak, it gives the unknown/beginning user a little bonus.

Quote
I might need some explaining on the macros in your spreadsheet.

I'll write more commentary later.

Quote
Edit: anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

??? What do you mean?!
Here are accurate values of alphas (again from Maple):
Code: [Select]
>alpha:=(correct,trials)->evalf[25](sum(binomial(trials,k)*1/2^trials,k=correct..trials));
>for i from 0 to 20 do
>   alpha(i,20);
> end do;
1.
.9999990463256835937500000
.9999799728393554687500000
.9997987747192382812500000
.9987115859985351562500000
.9940910339355468750000000
.9793052673339843750000000
.9423408508300781250000000
.8684120178222656250000000
.7482776641845703125000000
.5880985260009765625000000
.4119014739990234375000000
.2517223358154296875000000
.1315879821777343750000000
.05765914916992187500000000
.02069473266601562500000000
.005908966064453125000000000
.001288414001464843750000000
.0002012252807617187500000000
.00002002716064453125000000000
.9536743164062500000000000*10^-6
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 22:10:20
Quote
Quote
Edit: anyway, there doesn't seem to be any reason to believe that the simulation would produce an oscillating effect like that, so I have to think that this is an artifact of the binomial calculation!

??? What do you mean?!
Here are accurate values of alphas (again from Maple):
[CODE]>alpha:=(correct,trials)->evalf[25](sum(binomial(trials,k)*1/2^trials,k=correct..trials));

I don't doubt the precision of the calculation (after all, I used 80 bits to represent a floating point number).  But if there is little or no roundoff error in the binomial calculation, and the simulation error is small enough (should be with 10 million trials), then I trust the simulation over the calculation as the more accurate one.

As I said, the oscillation of the difference between the simulation and the calculation is very suspicious.  And I don't see how it could have come from the simulation.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 22:17:37
I've been thinking some more about the in-between-look terminations.  Since the listener cannot make a decision to continue the test after he terminates it, I think I have calculated things wrong.  For example, if the listener gets a look at trial 6, but then stops at trial 8, then all the other looks at trial 12, 18, 23, and 28 should not be counted towards the overall alpha.

Also, I think I need to rethink the looks at trials 1 through 5.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 22:51:19
Ok, here is my corrected 28-trials profile

Code: [Select]
10 million simulations using the corrected random number generator
           total    look
           alpha    point?    
5     5    0.0313    no     no looks
6     6    0.0491    yes    look at trial 6
7     7    0.0156    no     look at trial 6
8     7    0.0390    no     look at trial 6
9     8    0.0273    no     look at trial 6
10     9    0.0214    no     look at trial 6
11     9    0.0406    no     look at trial 6
12    10    0.0491    yes    look at trial 6, 12
13    11    0.0295    no     look at trial 6, 12
14    11    0.0417    no     look at trial 6, 12
15    12    0.0356    no     look at trial 6, 12
16    13    0.0326    no     look at trial 6, 12
17    13    0.0424    no     look at trial 6, 12
18    14    0.0491    yes    look at trial 6, 12, 18
19    14    0.0495    no     look at trial 6, 12, 18
20    15    0.0430    no     look at trial 6, 12, 18
21    16    0.0399    no     look at trial 6, 12, 18
22    16    0.0487    no     look at trial 6, 12, 18
23    17    0.0491    yes    look at trial 6, 12, 18, 23
24    18    0.0435    no     look at trial 6, 12, 18, 23
25    18    0.0490    no     look at trial 6, 12, 18, 23
26    19    0.0462    no     look at trial 6, 12, 18, 23
27    20    0.0449    no     look at trial 6, 12, 18, 23
28    20    0.0491    no     look at trial 6, 12, 18, 23


Notes:
1.  No looks allowed for trials 1 through 5
Title: Statistics For Abx
Post by: ff123 on 2002-08-30 23:37:33
I updated seqsimsource.zip and seqsim.zip on my website with the same random number generator used by bootstrap.exe.  This seems to yield the same results as plain old rand(), though.

ff123

Eek!  I take it back.  The results for 10 of 20 now agree to 4 decimal places using the new random number generator and 10 million trials!

But going back to the 28-trial case, I still get 0.0491.  So there was something wrong with my random numbers, but there must still be something wrong with your calculation (could still be roundoff errors).
Title: Statistics For Abx
Post by: shday on 2002-08-31 00:01:25
Quote
As I said, the oscillation of the difference between the simulation and the calculation is very suspicious.  And I don't see how it could have come from the simulation.

The calculated numbers shown above by continuum are the right values. (It would be extremely unlikely for Maple to be wrong here anyhow).

This may be a stupid question. But when you run the simulation multiple times what is the variation? Could it account for the differences?
Title: Statistics For Abx
Post by: ff123 on 2002-08-31 00:31:25
Quote
Quote
As I said, the oscillation of the difference between the simulation and the calculation is very suspicious.  And I don't see how it could have come from the simulation.

The calculated numbers shown above by continuum are the right values. (It would be extremely unlikely for Maple to be wrong here anyhow).

This may be a stupid question. But when you run the simulation multiple times what is the variation? Could it account for the differences?

I corrected the oscillation (it was the simulation!) using a different random number generator.

I ran 19 cases of the 28-trial profile, using 1 million trials each:

Average = 0.04918
std dev = 0.0002
std error of mean = 0.00005

Just to be sure we're still on the same page regarding the 28-trial profile, here it is again:

required correct at look points:

6 of 6
10 of 12
14 of 18
17 of 23
20 of 28
Title: Statistics For Abx
Post by: shday on 2002-08-31 02:24:00
Quote
Any thoughts on the non-even spreading of the alpha error?

I can't see any problem with it. Maybe there are some subtle effects on the chances of type-2 errors. You could run some simulations to check that out.

My thought is that you shouldn't limit the design to evenly spread alpha errors. Arguably, the overall Pr(type-1 error) is all you really need to constitute a valid significance test anyhow.
Title: Statistics For Abx
Post by: ff123 on 2002-08-31 02:40:51
Quote
Quote
Any thoughts on the non-even spreading of the alpha error?

I can't see any problem with it. Maybe there are some subtle effects on the chances of type-2 errors. You could run some simulations to check that out.

My thought is that you shouldn't limit the design to evenly spread alpha errors. Arguably, the overall Pr(type-1 error) is all you really need to constitute a valid significance test anyhow.

The problem I see with a non-even alpha spreading is that the listener will probably not be aware of what's going on, i.e., why the test gets more difficult (in the case of Continuum's proposed profile) as it progresses.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-08-31 03:16:51
Investigating the rand() error a little further, the problem isn't in the rand() function itself, but in the way I used it.

To get a random 0 or a 1, one proper way to do it is:

response = (int)((2)*(rand()/(RAND_MAX+1.0)));

But I coded it as:

response = rand() % 2;

It's interesting that the latter method doesn't work, because it would seem like there's an equal number of 0's and 1's.  Oh well, chalk it up to yet another thing I don't understand.

ff123

Edit:  the only thing I can think of is that Microsoft's rand() doesn't generate an equal number of even and odd numbers!  Well, another lesson learned.
Title: Statistics For Abx
Post by: Continuum on 2002-08-31 09:40:14
Quote
But going back to the 28-trial case, I still get 0.0491. So there was something wrong with my random numbers, but there must still be something wrong with your calculation (could still be roundoff errors).
There are definitely no rounding errors. I executed the LookPVal algorithm in Maple and got the exact result:
Code: [Select]
>LookPVal(28, array([0.95, 0.95, 0.98, 0.98,0.98]), array([6, 12, 18, 23,28]));
> evalf[40](%);
                              1704631
                              --------
                              33554432

             .05080196261405944824218750000000000000000

(Excel was
Code: [Select]
at least  6 of  6
at least  10 of  12
at least  14 of  18
at least  17 of  23
at least  20 of  28
5,08019626140594E-02
)
Still, there could be a logical flaw somewhere. (Then again, I would be surprised by the close results.)


Quote
1. No looks allowed for trials 1 through 5

Why not? Statistically, they are irrelevant, but they are helpful for people like me (see my previous post).
Title: Statistics For Abx
Post by: Continuum on 2002-08-31 09:44:33
First Step: Interprete Pascal triangle as abx results

Definition: The pascal triangle A (up to degree n) is a n*n matrix where the upper triangle is filled with 0's, whereas the first column is filled with 1's. The remaining elements are calculated as follows: A_i,j := A_i-1,j-1 + A_i-1,j (i for row, j for column).

1  0  0  0  0  0
1  1  0  0  0  0
1  2  1  0  0  0
1  3  3  1  0  0
1  4  6  4  1  0
1  5 10 10  5  1
.............


Now consider an ABX test: With one trial, there are two possible outcomes, 0/1 and 1/1. Each have a probability of 1/2.
With two trials there are four results (correct-correct, correct-false, false-correct, false-false). These can be simplified to three cases: 0/2, 1/2 and 2/2. The corresponding probabilities are 1/4, 1/2 and 1/4, or in other words, 1/2^2, 2/2^2 and 1/2^2.

Theorem: The probability to score exactly k correct results out of n abx trials (if one's guessing) (=:P(abx=k/n)) is A_n+1,k+1 / 2^n.
Proof: The statement is correct for n=1.
May the statement be correct for n-1 abx trials. Then
P(k/n) =        P(k/n-1)*P(0/1) +          P(k-1/n-1)*P(1/1) =
      = A_n,k+1 /2^(n-1) * 1/2  +  A_n,k /2^(n-1) * 1/2 =
      = A_n+1,k+1 /2^n
is true. qed.


Second Step: Add look-points

Next we'll add look-points, e.g. let's say 4/4, 4/5 and 5/5 are winning conditions. Furthermore,
P(4/5)=P(4/4)/2 + P(3/4)/2 and P(5/5)=P(4/4)/2 is true.
The probability of reaching a winning condition, P(4/4 or 4/5 or 5/5), is not P(4/4) + P(4/5) + P(5/5), but
  P(4/4) + P(4/5 and not 4/4) + P(5/5 and not 4/4 and not 4/5) =
  (because scoring 5/5 after less than 4/4 is impossible)
= P(4/4) + P(4/5 and 3/4) =
= P(4/4) + P(4/5 | 3/4) * P(3/4) =
= 1/2^4  + P(4/5 | 3/4) * P(3/4) =
= 1/16  +          1/2 * 4/16 =
= 1/16  + 2/16 = 3/16 = 0.1875


Third Step: The Pascal triangle approach

An easier way to calculate this would be to recalculate the Pascal triangle, line after line, but remove (set to 0) the corresponding value for each winning condition (note that even the 1's in the main diagonal line are changed):
5th line (=4th trial):
1  0  0  0  0  0
1  1  0  0  0  0
1  2  1  0  0  0
1  3  3  1  0  0
1  4  6  4  0  0
gives us a changed 6th line (=5th trial):
1  5 10 10  4  0
where we remove the 4/5 and 5/5 (note that 5/5 has probability 0, as calculated above):
1  5 10 10  0  0
the 7th line would be:
1  6 15 20 10  0  0

The sought-after probability is the sum of the removed values, each divided by 2^trialnumber:
P(4/4 or 4/5 or 5/5) = 1 / 2^4  +  4 / 2^5  +  0 / 2^5 = 3/16

But what have we done? By removing the "1" in the 5th line, we changed the probability of P(4/5) to P(4/5 and not 4/4), because the "thread" 4/4->4/5 is taken away.


Fourth Step: Implementing the algorithm

To calculate a new line (variable name: Result) of the modified Pascal triangle we need only the last line, which is stored in the variable LastResult. To sum up all the probabilities of winning conditions, I've used the variable Prob, which is set to 0 at the beginning and increased as winning situations occur.

For Trial = 1 To n                      'Run through all required lines of the triangle
  If NextLook <= UBound(LookTimes) Then  'Is a look-time left?
  ...
    If Trial = LookTimes(NextLook) Then  'Is this trial a lookup-point?

The second if-clause checks, if the current trial has a winning condition.
 
Note: For my program I used an a little different approach: winning conditions are not directly specified (like 8/10), but indirectly calculated by passing a requested confidence for each trial wich is a look-point.
By adding
...
        Wend
       
        Debug.Print ("at least " & Str(k) & " of " & Str(Trial))
        'lists the included winning conditions

        If k <= Trial Then
...

you can review the used winning conditions (in the debug window).

I hope this explains, what my program is doing.

Edit: Corrected wrong indices in proof. Removed smilie.
(This board screwed my double-spaces...)
Edit 2: more corrections
Edit 3: Yet another correction: I erroneously used conditional probabilities on a few occasions.
Title: Statistics For Abx
Post by: ff123 on 2002-08-31 18:56:25
I still don't know why the simulation doesn't agree with the calculation.  Let's try a very simple one.  Can you calculate the overall alpha for the following lookpoints, maximum 6 trials:

2 of 2
3 of 4
4 of 6

The exact answer is 0.453125

My program yields 0.4530 with 10 million simulations

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-08-31 20:57:35
Quote
I still don't know why the simulation doesn't agree with the calculation.  Let's try a very simple one.  Can you calculate the overall alpha for the following lookpoints, maximum 6 trials:

2 of 2
3 of 4
4 of 6

The exact answer is 0.453125

My program yields 0.4530 with 10 million simulations

ff123

at least  2 of  2
at least  3 of  4
at least  4 of  6
0,453125 -> exact
Title: Statistics For Abx
Post by: ff123 on 2002-08-31 23:48:33
Quote
at least  2 of  2
at least  3 of  4
at least  4 of  6
0,453125 -> exact

This is really stumping me.

Can you try two more easy tests?

at least 2 of 2
at least 3 of 6

exact answer is 0.671875
simulation yields: 0.6715 after 10 million sims.

Also:

at least 2 of 4
at least 3 of 6

exact answer is 0.75
simulation yields: 0.7501 after 10 million sims.

The Pascal triangle method is really interesting.  I'm trying to verify that it works using excel right now.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-01 03:54:16
Yay!

I made up my own spreadsheet using the Pascal triangle method, and trying to get the simple examples to agree.  I finally ended up with an answer of 0.049155 for the 28-trials profile

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-01 09:21:08
So there is a mistake in the code. If you can find it, tell me.

Edit: ARGH! Found it! I forgot to reset the values to 0 for winning conditions. This problem couldn't occur with my first version, but now happens under certain circumstances.

I had to add a line:
          For k = k To Trial
            Prob = Prob + (LastResult(k) + LastResult(k + 1)) / 2 ^ Trial
            Result(k + 1) = 0                'THIS LINE WAS MISSING

          Next k


Here is the corrected version: http://www.freewebz.com/aleph/CorrPVal3.xls (http://www.freewebz.com/aleph/CorrPVal3.xls)

I have added a new version, which makes setting up tests easier. (LookPval2)
Title: Statistics For Abx
Post by: ff123 on 2002-09-01 16:53:47
Great!  The code looks like it should be very easy to incorporate into abchr, and it will be a lot faster than simulation

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-01 18:36:34
Another thing to consider is, that this test allows no obvious conclusions beyond the given 0.95 or 0.99 confidence, because the test is either passed or failed. Not that it was much different with old ABX tests, but a 16/16 result allowed claiming a difference with confidence >0.9998. Maybe some extreme high confidence test modes should be added. (at least some people think it's neccessary, otherwise we wouldn't see that many 16/16 or 29/30 results)

Just an idea.
Title: Statistics For Abx
Post by: shday on 2002-09-01 20:33:21
I think this is a major improvement to ABX testing. Great job!
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 07:43:33
Quote
I've been thinking some more about the in-between-look terminations. Since the listener cannot make a decision to continue the test after he terminates it, I think I have calculated things wrong. For example, if the listener gets a look at trial 6, but then stops at trial 8, then all the other looks at trial 12, 18, 23, and 28 should not be counted towards the overall alpha.


I think there are three options:
1. Show the progress to the listener, but do not allow him to quit the test here (or with worst case for next the look-point).
2. As (1), but don't show anything. This is the strictest option.
3. Don't show the progress, but allow the listener to quit with his current result (at the point-in-between). This might lead to statistical problems* (alpha might be higher).

Edit:
*) I have checked this and it is true. Let's say the listener achieved 13/18 at the third look point in the 28 profile, which is insufficient to pass the test.
If he continues after option 1 or 2, the probability to succeed is the same as passing a test with look-points 4/5 and 7/10 (= (17-13)/(23-18) and (20-13)/(28-18) ), i.e. 0.25586.
But if option three is applied, the user can stop at 1/1 (which he cannot see though). So if he stops after the next trial (after the 19th total trial) he succeeds with probability 0.5!

Therefore, option 3 is statistically unusable (or everything had to be recalculated very difficultly).
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 16:36:51
I'm having difficulty understanding your point.  Are you saying that my in-between values are correct or not?

My thinking came about because I was trying to decide what should be done if the listener performs without knowing progress up to trial 5, and then terminates.  What should the program show if he got all 5 correct?  It should show an unadjusted alpha of 0.031.  In other words, if the listener cannot make a decision to continue or not based on information he has seen, there should be no adjustment.

BTW, I will probably go ahead and show progress for trials 1 through 4.  Yes, the listener can perform a Bayesian analysis and decide to stop if he gets all 4 wrong, but since this test is mainly interested in type 1 errors, that should not be a problem.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 17:29:22
Quote
I'm having difficulty understanding your point. Are you saying that my in-between values are correct or not?
I'm not sure, how you calculated them. Could you explain it a little bit more? But I'm suspecting you would run into the same problem.

Quote
My thinking came about because I was trying to decide what should be done if the listener performs without knowing progress up to trial 5, and then terminates. What should the program show if he got all 5 correct? It should show an unadjusted alpha of 0.031. In other words, if the listener cannot make a decision to continue or not based on information he has seen, there should be no adjustment.
Yes, without any information the user should be allowed to stop at 5/5. (Which would theoretically be the same as if he had chosen this test method -- 5/5 -- right from the beginning.)
But if more information is available in later trials, a guessing listener might use it to his advantage; see the example in my last post, where the user cannot be allowed to stop after 19 trials (with his real score).

Quote
BTW, I will probably go ahead and show progress for trials 1 through 4. Yes, the listener can perform a Bayesian analysis and decide to stop if he gets all 4 wrong, but since this test is mainly interested in type 1 errors, that should not be a problem.
Personally, I like the idea to see the results for the first tests. But then the listener can't be allowed to stop at 5/5, as this would obviously increase the total alpha -- like adding a look point at 5/5.
Code: [Select]
at least  6 of  6
at least  10 of  12
at least  14 of  18
at least  17 of  23
at least  20 of  28
4,91552352905273E-02
at least  2 of  1
at least  3 of  2
at least  4 of  3
at least  5 of  4
at least  6 of  5
at least  6 of  6
at least  10 of  12
at least  14 of  18
at least  17 of  23
at least  20 of  28
4,91552352905273E-02
at least  2 of  1
at least  3 of  2
at least  4 of  3
at least  5 of  4
at least  5 of  5
at least  6 of  6
at least  10 of  12
at least  14 of  18
at least  17 of  23
at least  20 of  28
6,20170868933201E-02
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 17:58:32
Here's how I was calculating the in-between points:

For Trial 13, for example, I would include the look points at trials 6 and 12 in the total alpha calculation, but not the look points at trials 18 and 23.  In this respect, it would be just like calculating the overall alpha after stopping at trial 28.  So in my simulator, I would enter

6 of 6
10 of 12
11 of 13

to get a total alpha of 0.0295

Ok, I'm reviewing your 13/18 case.  Using my current table, if the listener is allowed to stop at 19, he does have a 50% chance of randomly getting the next one right and passing the test overall.  But what's wrong with that?  To get 13 of 18, he had to be pretty close to an overall significance of 0.05 in the first place (sim says about 0.062).

ff123

Edit:  still thinking about seeing trials 1 - 4
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 18:19:21
Quote
Ok, I'm reviewing your 13/18 case. Using my current table, if the listener is allowed to stop at 19, he does have a 50% chance of randomly getting the next one right and passing the test overall. But what's wrong with that? To get 13 of 18, he had to be pretty close to an overall significance of 0.05 in the first place (sim says about 0.062).
Yes, but nominally, without the option to stop at 19 his chances are far lower, i.e. in a strict 28-trial look point test, they are only about 0.25586. (see above)

The other in-between points might have the same problem.
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 18:19:42
Regarding trials 1-4, I am thinking about how I have simulated it.  Right now, if I enter 1 of 1 as a look point, then the simulation assumes the listener terminates if he gets a 1 of 1.  But that isn't how it would really work.  In real life, the listener should never terminate, no matter what results he gets on trials 1-4.  So I think allowing the listener to see those first 4 trials is ok (trial 5 should be blinded).

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 18:25:52
Quote
Yes, but nominally, without the option to stop at 19 his chances are far lower, i.e. in a strict 28-trial look point test, they are only about 0.25586. (see above)

The other in-between points might have the same problem.

I'm considering things purely from a simulation point of view right now.  I.e., what does the simulation say the overall probability of getting 14 of 19 is when he is allowed to look at trials 6, 12 and 18 (and terminate early) and then allowed to stop at trial 19?

The simulator says 0.0495 probability of terminating at any of the look points or at trial 19 with an adequate score.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 18:59:06
Quote
I'm considering things purely from a simulation point of view right now. I.e., what does the simulation say the overall probability of getting 14 of 19 is when he is allowed to look at trials 6, 12 and 18 (and terminate early) and then allowed to stop at trial 19?

The simulator says 0.0495 probability of terminating at any of the look points or at trial 19 with an adequate score.

But what if things sum up? The problem is, the listener can choose if he aborts the test early or not and has therefore an advantage. It is clear to me, that the total alpha would increase, maybe not above 0.05, but it definitely would be higher than what we calculated earlier.
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 19:40:37
How should one modify the simulation?

Right now the simulation says that the listener always terminates at a look point if the total alpha is 0.05 or less.  This is as much to the listener's advantage as possible.

But there is another approach to all this.  So far we have investigated the "frequentist" approach.  The Bayesian approach could be just as interesting.  Say that the listener based his decision on whether or not to continue based on his past performance.  What should be his decision at each look point?

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 20:07:08
Quote
How should one modify the simulation?
If you mean, to account for in-between termination, I have no idea right now. It would be possible to calculate the best strategy at each (unsuccessful) look-point, but rather difficult.

Quote
Right now the simulation says that the listener always terminates at a look point if the total alpha is 0.05 or less. This is as much to the listener's advantage as possible.
Yes, and this is good. But this only includes the look-points. Allowing the user to terminate in-between causes nontrivial problems (at least for me).

Quote
But there is another approach to all this. So far we have investigated the "frequentist" approach. The Bayesian approach could be just as interesting. Say that the listener based his decision on whether or not to continue based on his past performance. What should be his decision at each look point?
When either option 1 or 2 from my previous is used, it doesn't matter, I think, because his decision is obvious: stop when target is reached, else continue. I don't see what he should do different.
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 20:21:41
I still don't see the problem.

The listener can decide to stop at any in-between point, and true, there is a problem that he may be able to optimize his strategy and choose the best in-between point to stop at.  However, the overall type 1 risk can never be greater than 0.05.

Right now, stopping at trial 19 is the only time the overall alpha is greater (0.0495) than stopping at a look point (0.0492).  So if I was a betting man and knew that all my guess were random, I'd stop on trial 19.  One can eliminate this problem by eliminating trial 19 as a stopping point, though.  And then the look points become the most advantageous places to stop.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 20:42:45
Quote
The listener can decide to stop at any in-between point, and true, there is a problem that he may be able to optimize his strategy and choose the best in-between point to stop at. However, the overall type 1 risk can never be greater than 0.05.
Sorry, my statistical background is very limited: What is "the overall type 1 risk"?

Quote
Right now, stopping at trial 19 is the only time the overall alpha is greater (0.0495) than stopping at a look point (0.0492). So if I was a betting man and knew that all my guess were random, I'd stop on trial 19. One can eliminate this problem by eliminating trial 19 as a stopping point, though. And then the look points become the most advantageous places to stop.
These numbers (0.0492 and 0.0495) are not what I'm talking about. They assume that the listener terminates the test at the corresponding time, but none of them is his best choice. His optimal strategy would be to choose his stop point dynamically, depending on his previous results.

Example: User has scored 13/18. His best choice is to stop at trial 19 (winning probability LookPVal2(Array(1), Array(1))=0.5) instead of completing the whole 28-test (LookPVal2(Array(4, 7), Array(5, 10))=0.255859375).
On the other hand, if his score is 10/18, he is forced to complete the whole test, as this is his only possibility to win.

What we have calculated before is only true, when the listener is forced to take the trials up to next look-point.
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 20:54:54
Overall type 1 risk = overall alpha.  The probability that a person could achieve a certain score given that he is randomly choosing X, and further given that he chooses to stop the test at a look point if the number correct are as shown, and further given that he chooses to stop at the in-between point in question.

So the probability that a listener could achieve 14 of 19 given all of the above is 0.0495.

The current optimum strategy for a listener is to:

1. Always stop at look points 6, 12, and 18 if his scores meet the overall alpha.

2. Stop at trial 19 if he has 13 of 18

3. Otherwise, continue and stop at lookpoint 23

4. If he can't stop at lookpoint 23, continue to the end (trial 28).

ff123

Edit:  If the listener is not allowed to stop at trial 19, then the optimum strategy becomes:

1. Always stop at a lookpoint if you can, or when forced to stop at trial 28.
Title: Statistics For Abx
Post by: Continuum on 2002-09-02 21:29:44
Quote
The current optimum strategy for a listener is to:

1. Always stop at look points 6, 12, and 18 if his scores meet the overall alpha.

2. Stop at trial 19 if he has 13 of 18

3. Otherwise, continue and stop at lookpoint 23

4. If he can't stop at lookpoint 23, continue to the end (trial 28).
I'm afraid it's not that simple.

Another example: 16/23. His best choice is to stop at trial 25 (winning probability LookPVal2(Array(2), Array(2))=0.25) instead of completing the test (LookPVal2(Array(4), Array(5))=0.1875)

There might be more, I don't know.
Title: Statistics For Abx
Post by: ff123 on 2002-09-02 22:22:49
Looks like

If one has 12/18, it is optimum to stop at trial 22.
If one has 9 of 12, it is optimum to stop at trial 14.
If one has 8 of 12, it is optimum to stop at trial 17.
If one has 5 of 6, it is optimum to stop at trial 8.
If one has 4 of 6, it is optimum to stop at trial 11.

But even if the listener chooses the non-optimum strategy and stops at different in-between points, the overall alpha remains < 0.05.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-03 06:45:14
Quote
If one has 12/18, it is optimum to stop at trial 22.
If one has 9 of 12, it is optimum to stop at trial 14.
If one has 8 of 12, it is optimum to stop at trial 17.
If one has 5 of 6, it is optimum to stop at trial 8.
If one has 4 of 6, it is optimum to stop at trial 11.
Have you calculated each step? (I mean, have you compared all reasonable strategies at each of those trials?) Then we have already 7 problematic in-between looks.

Quote
But even if the listener chooses the non-optimum strategy and stops at different in-between points, the overall alpha remains < 0.05.
What is your non-optimum strategy? The problem is, what happens, if he uses an optimal strategy, i.e. stops at all points listed above? He plays stronger than 0.049155, maybe even stronger than 0.05.
Title: Statistics For Abx
Post by: ff123 on 2002-09-03 15:10:31
Quote
What is your non-optimum strategy? The problem is, what happens, if he uses an optimal strategy, i.e. stops at all points listed above? He plays stronger than 0.049155, maybe even stronger than 0.05.

Ah, I think I finally see what you're getting at.  The corrected simulation would have a listener with 13 of 18 stop at 14 of 19, for example, rather than waiting until trial 23 to stop.  That's an added wrinkle.

Well, that would take some time for me to code up.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-03 15:28:32
Probably the best solution is to disallow stop points if the optimum strategy does not lead to stopping at a look point.

So that would mean eliminating:

trials 1-4, 8, 11, 14, 17, and 22 as stop points.

I need to verify, though.

ff123

Edit:  Oh no, that's probably not enough.  I need to eliminate all suboptimal stop points as well, which are still better than stopping at a look point.  I'll look at this tonight, then.
Title: Statistics For Abx
Post by: Continuum on 2002-09-03 15:44:05
I think the easiest and savest method would be to disallow in-between stops generally. It wouldn't be very logical for the user to allow him to stop at some points while not at others.
If you find a way though, to keep the total below 0.05 it could be added. But I don't think it's possible. (or maybe for certain profiles only)

The best (easiest) way for all in-between points probably is:

Quote
1. Show the progress to the listener, but do not allow him to quit the test here (or with worst case for the next look-point).
Title: Statistics For Abx
Post by: ff123 on 2002-09-03 15:49:17
Quote
I think the easiest and savest method would be to disallow in-between stops generally. It wouldn't be very logical for the user to allow him to stop at some points while not at others.
If you find a way though, to keep the total below 0.05 it could be added. But I don't think it's possible. (or maybe for certain profiles only)

The best (easiest) way for all in-between points probably is:

Quote
1. Show the progress to the listener, but do not allow him to quit the test here (or with worst case for the next look-point).

Showing progress (while disallowing stopping in between) does seem more attractive.  Especially if the listener is not allowed to stop at important places, such as trials 7 through 11.  Let me work through all the places where I should eliminate stop points, and then reconsider.  There could even be a hybrid solution.  For example:  don't show progress on trials 1 through 4, but allow a stop point at trial 5.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-03 15:58:52
Hybrid version - sounds interesting!

Especially the 5/5 possibility could be useful, although personally, I still prefer to know my progress at each trial...
Title: Statistics For Abx
Post by: shday on 2002-09-03 22:57:09
Allowing in-between stops increases the chances of type-2 errors (failing the test when a difference was heard).

For example, if someone hears a difference and chooses to stop at trial 5, there is a chance they may fail the test (this is possible because they didn't know the results from trials 1 to 4). This can happen even though the full test may (very likely) result in a passing score. This means the Pr(type-2 error) has increased, versus a test without in-between stops.

The look points, on the other hand, actually decrease the chances of type-2 errors because they only terminate the test early in the event of a pass.

Edit: Actually, there are scenarios where the in-between stop points could result in an avoided instance of a type-2 error. But I still think the net result is that the Pr(type-2 error) is increased.  One thing is for sure... allowing the in-between stop points adds a lot of complexity.
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 02:03:09
Good point about the type 2 error, although the 28-trial profile is not particularly concerned with type 2 errors in the first place.  I think I'm reaching the conclusion that, especially for the benefit of people not familiar with ABX testing, that the results for trials 1 through 4 should be visible, and that means only allowing a stop at trial 6.

So to summarize:

1. number correct displayed for every trial.  There will be a table showing the stopping points and the number correct required to pass.
2. overall alpha value also displayed at every stopping point.
3. test can only be terminated at trials 6, 12, 18, 23 and 28 (with the number correct as specified previously to get a "passing" score).
4. 9 wrong terminates the test.
5. the listener can choose to continue with the test even if he achieves a passing score, but then runs the risk of failing at a later stopping point.

The final values for the overall (passing) alphas are in the following table:

(http://ff123.net/export/28trials.gif)
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 04:26:00
The table brings up an interesting point.  Should the profile be designed to keep a constant nominal alpha, or a constant overall alpha?

ff123

Edit:  never mind -- it's impossible to keep a constant overall alpha!
Title: Statistics For Abx
Post by: Continuum on 2002-09-04 06:42:15
Quote
2. overall alpha value also displayed at every stopping point.
What overall alpha? Depending on the results? I'm not sure how this could be calculated.

Quote
5. the listener can choose to continue with the test even if he achieves a passing score, but then runs the risk of failing at a later stopping point.
Hmm, couldn't that lead to incorrect conclusions by the listener? I think he can't increase his confidence with this, because this test is of a very strict passed-failed type.
Problem: which score is better, 6/6 or 14/18?

I think there should be a different profile for people who want high confidence results, because the 28-profile can't say more than passed with confidence 0.95 or failed.
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 06:59:56
duplicate post
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 07:00:52
Quote
Hmm, couldn't that lead to incorrect conclusions by the listener? I think he can't increase his confidence with this, because this test is of a very strict passed-failed type.
Problem: which score is better, 6/6 or 14/18?


6 of 6 is better than 14 of 18 if the listener always follows the procedure of quitting at the earliest stopping point whenever it is possible.  Otherwise, I don't know.  Point taken.  The program should terminate automatically.

However, the converse (program forces the listener to continue) is not possible.  The listener could decide to terminate at any time (by not continuing the test).  The program only displays the overall alpha at stopping points, though.  For example:

5 of 6:  0.109
13 of 18: 0.058

It is possible to calculate an overall alpha only when the listener is forced to stop at the earliest possible time.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-04 07:13:43
Quote
However, the converse (program forces the listener to continue) is not possible. The listener could decide to terminate at any time (by not continuing the test). The program only displays the overall alpha at stopping points, though. For example:

5 of 6: 0.109
13 of 18: 0.058

It is possible to calculate an overall alpha only when the listener is forced to stop at the earliest possible time.
You could calculate a worst case scenario for the next look point, e.g. the listener's score is 5/6, 9/12, 13/18 and 17/22 -> passed. 16/21 -> failed.
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 07:45:31
I think it is enough to display what the overall alpha would be at any particular lookpoint assuming the listener were to stop at that point.

so 5 of 6:  0.109
and 6 of 6: 0.0156

9 of 12: 0.078
10 of 12: 0.0295
Title: Statistics For Abx
Post by: Continuum on 2002-09-04 08:02:01
Quote
I think it is enough to display what the overall alpha would be at any particular lookpoint assuming the listener were to stop at that point.
For what? Wouldn't it be possible to achieve a score below 0.05 but still fail the 28-profile test?

What correct conclusions could be drawn from the displayed information?
Title: Statistics For Abx
Post by: shday on 2002-09-04 14:05:47
I think the test should be strictly pass or fail. I think the statistics become shaky if we try to go beyond that. How would we interpret an overall alpha?

What about 10/12 versus 11/12? Both are possible with the current plan. They are both passing scores but the calculated overall alphas are different (right?). I think distinguishing these scores with overall alphas would be tricky (i.e., how would one interpret this?).

Another possibility is to terminate the test once 10/11 is reached because 10/12 (and therefor as pass) has already been achieved. The same would apply to 14/17, 17/22, and 20/27.
Title: Statistics For Abx
Post by: Continuum on 2002-09-04 14:40:18
Quote
I think the test should be strictly pass or fail. I think the statistics become shaky if we try to go beyond that. How would we interpret an overall alpha?

What about 10/12 versus 11/12? Both are possible with the current plan. They are both passing scores but the calculated overall alphas are different (right?). I think distinguishing these scores with overall alphas would be tricky (i.e., how would one interpret this?).

I agree. It would be possible to claim differences between different results in a fixed-length test, e.g. 12/16 or 15/16, because the probability to score the same or a better result (=alpha!!), which can be calculated easily in this case, is different.
But with the 28-profile (or any test with look-points) it's not trivial to determine which scores are better than a given one.

What we can calculate is the probabilty to pass the entire test by guessing. Nothing more should be shown to the user.

Quote
Another possibility is to terminate the test once 10/11 is reached because 10/12 (and therefor as pass) has already been achieved. The same would apply to 14/17, 17/22, and 20/27.

True. But maybe this would be too confusing?
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 15:26:51
I still don't see the problem with calculating and displaying an overall alpha.

10 of 12 is 0.0295
11 of 12 is 0.0171
12 of 12 is impossible

Given that a listener must have terminated if he achieved 6 of 6.  The procedure (and therefore the exact odds of getting to any particular point) are completely prescribed now.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 15:29:30
Quote
Wouldn't it be possible to achieve a score below 0.05 but still fail the 28-profile test?

No.  Once a score of 0.05 is achieved at a stop point, the test is forced to stop.

ff123

Also, if a listener refuses to continue to a look-point, his overall alpha is not displayed, and he is not considered to have passed the ABX even if he stops at a score like 10 out of 11.  This is one disadvantage of displaying the number correct at every trial.  Misunderstandings about what constitutes a passing score might sometimes arise if the listener thinks he can stop at any time.  If no score is displayed, the listener knows that he must get to a stop point to see both the number correct and the overall alpha.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 17:01:16
I think shday's idea is a good one.  I think what you lose is the ability to get a lower overall alpha at the look point, but what you gain, of course is a time savings.  That might be worth it.  The other advantage is that there is then no conflict over showing the number correct for every trial.  Nobody gets unfairly penalized for not continuing a test at 10 of 11, because the program will automatically stop and count this as a success.

Let me change the simulation at home tonight to take into account an early termination and see what pops out, but I think it should be fine.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 18:57:15
Damn, I hope this is the last time I summarize what's going to happen!

1. The test will automatically stop if the following points are reached:

6 of 6
10 of 11
10 of 12
14 of 17
14 of 18
17 of 22
17 of 23
20 of 27
20 of 28

2. The program will display overall alpha values after each of the above stop points has been achieved.  Also, the overall alpha values will be displayed regardless of whether the test stops or not at the following (look) points:  trials 6, 12, 18, 23, and 28.

(The earlier the test is terminated when the listener passes, the lower the overall alpha is.)

3. The program will display the number correct after each trial is completed.

4. The test will automatically stop if 9 incorrect are achieved.

Have I missed anything?  This has sure been one humbling exercise.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-04 19:46:15
Quote
I still don't see the problem with calculating and displaying an overall alpha.

10 of 12 is 0.0295
11 of 12 is 0.0171
12 of 12 is impossible

Given that a listener must have terminated if he achieved 6 of 6. The procedure (and therefore the exact odds of getting to any particular point) are completely prescribed now.

Ahh! I think I finally understand your total alpha calculation.
Example: alpha(10/12) = P(6/6) + P(10/12 and not 6/6) + P(11/12 and not 6/6) + P(12/12 and not 6/6)
Is this correct?

Well, then it is possible to calculate a total alpha at each step; still, I'd be very careful with conclusions drawn from this value.
Title: Statistics For Abx
Post by: ff123 on 2002-09-04 21:36:51
Quote
Ahh! I think I finally understand your total alpha calculation.
Example: alpha(10/12) = P(6/6) + P(10/12 and not 6/6) + P(11/12 and not 6/6) + P(12/12 and not 6/6)
Is this correct?

Well, then it is possible to calculate a total alpha at each step; still, I'd be very careful with conclusions drawn from this value.

Here is how I think of it (in simulation terms):

alpha (10/12):  the probability that a listener will end up with a score of 6/6, 10/12, 11/12, or 12/12 (impossible), given that the listener must stop after achieving a score of 6/6.  Clearly, to get to the 10/12 and 11/12, the listener cannot have scored 6/6.  12/12 is not achievable using this scheme.  So yes, your formula looks like how the simulation works.

You can also use your spreadsheet to calculate these numbers.  If you do, you'll see that as the number of trials goes up, the total alpha increases at each stopping point until it reaches a value just under 0.05 at 20/28.  This is consistent with the idea of allowing stopping points:  basically, the listener gets multiple chances to pass the test, with his chances of passing getting better as the test progresses.

ff123
Title: Statistics For Abx
Post by: shday on 2002-09-05 00:41:07
ok, so let say someone passes the test with a score of 10/11. He would have two pieces of objective information from which to assess the "confidence" that he really heard a difference:

1) the a priori confidence of the test as a whole, ~95% (this would be true for any passing score)

2) the overall alpha (maybe better called the p-value in this context?), ~0.018

Clearly one may be tempted to declare they had passed with 98.2% confidence... but, as Continuum showed in his previous thread, this would not be an accurate statement. As long as this misinterpretation is not made, the overall alpha should help to differentiate passing scores, if one wishes to do so.
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 01:56:34
Quote
ok, so let say someone passes the test with a score of 10/11. He would have two pieces of objective information from which to assess the "confidence" that he really heard a difference:

1) the a priori confidence of the test as a whole, ~95% (this would be true for any passing score)

2) the overall alpha (maybe better called the p-value in this context?), ~0.018

Clearly one may be tempted to declare they had passed with 98.2% confidence... but, as Continuum showed in his previous thread, this would not be an accurate statement. As long as this misinterpretation is not made, the overall alpha should help to differentiate passing scores, if one wishes to do so.

Wait, I don't think it's a misinterpretation to say that someone has passed with 98.2% confidence if he scores 10 of 11.  For the same reason I don't think it's a misinterpretation to say that someone who scores 6/6 passes with 98.4% confidence.

Near the beginning of the test, the confidence is higher because there are fewer ways to achieve a passing score, but as you get more chances to pass as you attempt more trials, the confidence decreases, until at the end, if you score 20 of 28, you have a 95.1% confidence.

This is just another way of saying:  how probable is it that one has achieved a passing score if he gets 10 of 11, given all the possible ways of getting to this score (including passing with a score of 6/6)?

ff123
Title: Statistics For Abx
Post by: shday on 2002-09-05 03:03:45
Perhaps this is just a matter of semantics. But it is important to note the following (using the notation from that previous thread):

P(G | 10/11) != P(10/11 or better | G) = 0.018

The 6/6 score may be a special case because there is no "or better" part.

Anyhow, I'm really not sure what the "right" wording would be for the interpretation of the p-val. From my stats book it looks like you could safely say that the result was statistically significant "at the 0.018 level of probability". (Intentionally vague I think).

Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I am, however, quite sure you would be perfectly accurate in saying that you had passed the 95% confidence test.
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 04:43:28
Quote
Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I don't think of 10/11 as the edge of the rejection region.  The fact that one is allowed to continue the test if 10/12 is not achieved pushes the true edge of rejection out to trial 28.

The high confidence in the results early on in the test are needed to be allowed to have the option of stopping early and still have 95% confidence if trials continue out to 28.

ff123
Title: Statistics For Abx
Post by: shday on 2002-09-05 05:16:48
Quote
Quote
Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I don't think of 10/11 as the edge of the rejection region.  The fact that one is allowed to continue the test if 10/12 is not achieved pushes the true edge of rejection out to trial 28.

10/11 is on the edge of the rejection region for 98.2% confidence, clearly not for 95% confidence (it's well inside).

btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true) 
Title: Statistics For Abx
Post by: Delirium on 2002-09-05 06:14:17
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 06:44:20
Quote
btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true)

Suppose that probability of getting a trial correct is 0.6 instead of 0.5.  Then the probability of getting 10 of 11 is 0.061 instead of 0.018.

So yes, I would say that given the way the test is performed, the probability that the listener was guessing with a score of 10/11 is exactly 0.018.

ff123
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 06:47:17
Quote
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).

I haven't looked at it extensively, but after a bit of fiddling with the formulas, I found I couldn't get the minimum number of trials down below 9 before the test is declared to have been passed.  This is a big disadvantage compared with the 28 trial profile, which allows one to stop at trial 6.

ff123

Edit:  But maybe by properly choosing beta and p1 values, I could make the Wald test more palatable as far as minimum trials go.  I probably need to read up on this more.
Title: Statistics For Abx
Post by: Continuum on 2002-09-05 07:40:08
Quote
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).

If you want the option to stop after every trial, you have to increase the number of minimum trials (as ff123 points out) or the required "traditional" confidence at each point.

Example: The probability to pass an "traditional" 0.95-test by guessing when one's allowed to stop at every point up to 30 is 0.129! (you can test this with my Excel-sheet from above)

That's why the look profiles (like the 28-test) are a good compromise between information, early termination and high confidence.
Title: Statistics For Abx
Post by: Continuum on 2002-09-05 07:46:41
Quote
Quote
btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true)

Suppose that probability of getting a trial correct is 0.6 instead of 0.5.  Then the probability of getting 10 of 11 is 0.061 instead of 0.018.

So yes, I would say that given the way the test is performed, the probability that the listener was guessing with a score of 10/11 is exactly 0.018.

I think shday is correct here. We have no proof whatsoever that the probability that the listener was guessing is the same as the calculated p-val.
Example: If you compare two identical files, the probability that the listener is guessing is 1, while his p-val (probability of scoring the same or a better result) will be lower in most cases. The p-val gives only an indication.

But this is purely semantics and interpretation.

BTW: this is exactly what the previous thread (http://www.audio-illumination.org/forums/index.php?act=ST&f=1&t=2753) was about.
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 08:05:32
Ok, I think I finally understand the distinction you guys are making:

It's the difference between asking:

"What is the probability that the listener was guessing, given his score?" vs. "What is the probability that a listener gets a certain score, given that he is guessing?"

The value I plan to pop out for the stop points and the look points is the answer to the latter question.  Probably I should change the text in the ABX dialog box to be semantically correct.

ff123
Title: Statistics For Abx
Post by: Continuum on 2002-09-05 08:31:28
Yes, that's it! 
Title: Statistics For Abx
Post by: ff123 on 2002-09-05 14:53:36
I think what this thread has really taught me is to pay close attention to what the other guy is saying because he has something valuable to say.

ff123