Skip to main content

Topic: Probability of passing a sequencial ABX test (Read 39092 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • schnofler
  • [*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #25
Just a few thoughts on this whole discussion: In general I think there's no need to be all that dogmatic about the issue of "when can a test be considered meaningful?". You see, that's why all those programs compute the p-val instead of just saying "test passed!" or "test failed!". The p-val tells you exactly one thing: What is the chance of achieving this result (or an even better one) by simply guessing. Now, if someone writes "I did an ABX test, and I received a p-val of 0.1", then it's up to the reader to decide whether he considers this good enough or not.
You might very well say "Damn, if I hand a few headphones to 10 deaf monkeys, I have a nice 65% chance that at least one of them receives that result, so how the hell can this be meaningful?" or you might just say "Well, if I really had been guessing, there's a 90% chance to do worse, that should be enough".
Of course, that depends on the circumstances. So if you just do some quick private tests (this might apply to guruboolez's question), and you're pretty sure you hear something, and you don't need perfect results anyway, than even a p-val of 0.1 might be enough for you. On the other hand, if you're trying to prove in public that flac sounds worse than wav (or something like that), you better make sure you can support that with a strong p-val, if you expect people to believe your claims.

Now to some of the more recent posts:
Quote
Still continuing the experiment:
54 of 85, p = 0.008
(still random)
....

72 of 114, p = 0.003
...

78 of 127, p = 0.006

That is incredible: I can randomly generate meaningfull results.

2 possibilities:
*I am gifted and am able to do some divination
*we can not trust the current results of abc/hr

Quote
....
119 of 200, p = 0.004

...

Well, I'd have to put my money on the first possibility. All the p-vals are correct. If you did this test systematically (like always saying "X is A"), then there might of course be the third possibility, that the random number generator favors A over B.
Quote
On ABX tests, if you did one error each three trials, than you will have :

6/9 = 0.250
10/15 = 0.150 (15%)
20/30 = 0.049 (<5%)
30/45 = 0.018 (<2%)

more trials = more significant results.

This is correct.
Quote
It's good to know that. If you try to ABX something difficult, and to prove that you're right, better 50 trials than 16 ;-)

This is not. Actually these results make perfect sense. By guessing, you might very well guess two thirds of the trials correctly if you only do a few. But it's extremely improbable that you can maintain this two-thirds-streak for like 100 trials, if you really are only guessing. Conversely, if someone really manages to get two thirds right for 100 trials, you can be pretty sure he heard a difference.
So, if you can't hear any difference but you just really want to have a great ABX result, you should really do quite the opposite of what guruboolez suggests. If your result is already good at 16 trials, then by no means continue to do 50, you'll only mess it up  .
Quote
I'd very much appreciate an option (in ABC/HR and its Java counterpart) to clear ABX results after changing selected time,
as I like to use ABX to find differences as misses make the score go bad before I find the part I feel I'm able to ABX.
Maybe an option to clear the results? That would help to reduce warm-up effect.
(you can do the test any number of times before recording the results)

I don't think that's a good idea. It would make the results much less meaningful. If someone gets a p-val of 0.05 with one test, this is a pretty reliable result. But if he restarts the test 15 times, chances are he will get a p-val of 0.05 at least once (supposing he fixed the number of trials).
Also, I don't think the lack of a restart function poses much of a problem. You don't need to get a "perfect" score of 8/8 everytime. If you messed up some trials in the beginning, but after that you can hear the difference reliably, you can just do some more trials and the p-val will decrease rapidly. A short example: you started your test, and you can't hear a difference in the beginning. And on top, you have some serious bad luck, so you'll get only 2/8 correct (p-val of 0.96). But after that you can hear a difference very reliably, so you do some more tests (which probably will be much quicker than in the beginning), and you manage to get 15/16 correct. Summed up that's 17/24 with a respectable p-val of 0.03.
  • Last Edit: 10 November, 2003, 01:41:42 PM by schnofler

  • Moneo
  • [*][*][*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #26
The randomizer in WinABX does seem to be deficient. By repeatedly choosing a-b-a-b-a-b-... I've got 114/202 (pval=0.039). With foobar2000's ABX component (which uses Mersenne's twister to generate random numbers) I only got 106/202 with that strategy, which corresponds to a pval of ~0.25.

Edit: One might wonder why did I do 202 trials and not 200... well, I simply didn't stop in time
  • Last Edit: 10 November, 2003, 01:38:10 PM by Moneo

  • guruboolez
  • [*][*][*][*][*]
  • Members (Donating)
Probability of passing a sequencial ABX test
Reply #27
Quote
This is not. Actually these results make perfect sense. By guessing, you might very well guess two thirds of the trials correctly if you only do a few. But it's extremely improbable that you can maintain this two-thirds-streak for like 100 trials, if you really are only guessing. Conversely, if someone really manages to get two thirds right for 100 trials, you can be pretty sure he heard a difference.

Good point. But it clearly means that we had to take care with pval. For exemple, when KikeG said that he would't trust (too much) pval > 0.05, this mean that if people want to convice him, it's better to send him a 30/45 than a 10/15. Or, differently, if you have difficulties to maintain a good concentration and achieve good ABX score on 16 trials, better than performing another test, you should resume the first one, and reaching the 45...50 trials. It supposes of course that the tester is able to maintain the two third right on 50 trials. I'm sure that I could do it with some difficult samples : when 16/16 is strictly impossible, 30/45 isn't too difficult (not for ABXing Flac & PCM of course ;-)). I often "failed" on ABX tests : I did three, four or five different sessions of 16 trials, and all were 11/16 or 12/16. If I had decided to merge the small tests in one big 60 trials test, conclusion would change, from "failed" to "succeed".

I'm agree with your first comment ("there's no need to be all that dogmatic about the issue of "when can a test be considered meaningful?"). ABX score are nothing without precise comments about conditions of the test. For exemple, I often had 10/12 tests on anchor-like encodings, but 12/12 for high quality lossy encodings. The first is so easy that I need 30 seconds for 12 trials (and doing stupid mistake - sometimes with keyboard shortcuts), and the second is so hard that I need 15 minutes to perform it, taking "breaks" in order to keep some fresh ears.
  • Last Edit: 10 November, 2003, 02:15:38 PM by guruboolez

  • tigre
  • [*][*][*][*][*]
Probability of passing a sequencial ABX test
Reply #28
Quote
Quote
Does it mean that 5% is a complete useless value? Or does it mean that with 5-15%, there are still some (serious) presumptions about an audible difference?
I'm really interested about it.

Then you could read the Statistics For Abx-thread (long!).

But to give you an idea how much the results are affected: Think of a guessing tester who stops the test as soon as he reaches 0.95 confidence or the maximal length ( =: m) of the test. The probability for him to pass the test are:

m=10 => p-val = 0.0508
m=20 => p-val = 0.0987
m=30 => p-val = 0.1295
m=50 => p-val = 0.1579
m=100 => p-val = 0.2021

See this excel sheet for reference.

Thanks. This is exactly the answer to my question. IMO this should be integrated in ABX utilities: You would have to enter the confidence you want to reach before and if you want to perform a fixed number of trials or to stop after a certain confidence / a maximum number of trials is reached.

Do you know how these "corrected" values are calculated?
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

  • Continuum
  • [*][*][*][*]
Probability of passing a sequencial ABX test
Reply #29
Quote
Do you know how these "corrected" values are calculated?

I wrote the sheet so I hope I know it! 

You can read the source (comes with macros) and try to figure out what means what.

Or look at this post, for a detailed (and hopefully more understandable) explanation.

  • Mac
  • [*][*][*][*][*]
Probability of passing a sequencial ABX test
Reply #30
Quote
The randomizer in WinABX does seem to be deficient, I've got 114/202. With foobar2000's ABX component I only got 106/202 with that strategy.

So with Foobar you had 52.5% correct guesses, and with WinABX you got 56.4% correct?

Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

I think your claims about the deficiency of WinABXs randomness are unfounded.
< w o g o n e . c o m / l o l >

Probability of passing a sequencial ABX test
Reply #31
Heh, 116/200 is nearly 35% chance of missing according to my calculator.

P-val calculator is certainly wrong.
  • Last Edit: 10 November, 2003, 03:31:02 PM by AstralStorm
ruxvilti'a

  • Pio2001
  • [*][*][*][*][*]
  • Global Moderator
Probability of passing a sequencial ABX test
Reply #32
The problem is obvious : in his random test, Gabriel is always more right than wrong ! There is no way for this to happen by chance. If the generated sequence is truly random, you should get sometimes more right than wrong, and sometimes more wrong than right.

People seem to consider high confidence to be common when the number of trial rises. No way ! High confidence is high confidence, and by definition, a common result has a low confidence ! This is the definition of "common" and "confidence".
The example assumes that a good result is got two times out of three. This is nearly impossible to maintain this just by chance. Sooner or later you'll get two wrong results out of three, and the confidence will collapse.

Maybe it would be interesting to see the logs with every choice of the program and the user. Either the random generator is bad, and there is a correlation between the user choices and the program choices. Note that even if the user chooses random answers, there can be a correlation, because people have a very bad idea of randomness, and when asked to perform random guessings, usually generate a uniform distribution of answers, rather than a random one. A human list doesn't fluctuate in the long term. A random list does. But note also that as long as the program is really random, all correlation must disappear, because comparing a random list with a non random one must lead to another random list.
The other hypothesis is that the total of success recorded by the program is wrong. Maybe if we check each answer we'll find 50 right answers out of 100 while the program counts 70 of them. The last hypothesis would be that in the final results, the program records a different list that it actually generated. Example : X is A. The user says X is B, the program records "Program : B user : B, right answer".

Mac, my probability courses are far away, but if I'm not mistaken, the probability to be outside the standard deviation is 2 %, which is OK, since we got here a 4 % probability (144 out of 202) for something inside. Can someone comfirm this ?

  • Continuum
  • [*][*][*][*]
Probability of passing a sequencial ABX test
Reply #33
Quote
Quote
The randomizer in WinABX does seem to be deficient, I've got 114/202. With foobar2000's ABX component I only got 106/202 with that strategy.

So with Foobar you had 52.5% correct guesses, and with WinABX you got 56.4% correct?

Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

I think your claims about the deficiency of WinABXs randomness are unfounded.

I'm not sure what link exactly you are refering too. Anyway, the confidence 0.9608 for 114/202 is an exact value (which approximated with the normal distribution returns 0.954).

Maybe your "+/-"-interval is considering a 2/10 and an 8/10 result as equally important? This, however, is not how it's done in our case.

  • Continuum
  • [*][*][*][*]
Probability of passing a sequencial ABX test
Reply #34
Quote
Heh, 116/200 is nearly 35% chance of missing according to my calculator.

P-val calculator is certainly wrong.

???

The correct confidence value is 0.98593!

What are calculating?

  • Pio2001
  • [*][*][*][*][*]
  • Global Moderator
Probability of passing a sequencial ABX test
Reply #35
Quote
Quote
Heh, 116/200 is nearly 35% chance of missing according to my calculator.

P-val calculator is certainly wrong.

???

The correct confidence value is 0.98593!

My calculator agrees :


  • Mac
  • [*][*][*][*][*]
Probability of passing a sequencial ABX test
Reply #36
I was going on the standard deviation of a 202 trial binomial distribution as being 7.1, meaning any number of correct guesses between 94 and 108 is dead on target, and anything between 87 and 115 isn't completely unexpected.  As both 106 (Foobar) and 114 (WinABX) both fell into this, I saw no problem with that..  I admit I forgot all my statistics work the day after the exam on it, so I could be wrong
< w o g o n e . c o m / l o l >

  • schnofler
  • [*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #37
I think one little suggestion is necessary here: Please don't jump to conclusions. Two or three examples are not enough to conclude that some random number generator is faulty. Especially, if you don't do your tests carefully. Continuum's comments show that it's all too easy to "prove" that some program is faulty: Just press the buttons long enough, and it's pretty likely you dip below pval=0.05 at least once.

Quote
The problem is obvious : in his random test, Gabriel is always more right than wrong ! There is no way for this to happen by chance.

How did you conclude that? Maybe I am missing something here, but if I understand it correctly, Gabriel posted 4 intermediate results, out of 200! Certainly we can't conclude that he was always more right than wrong. And it's overwhelmingly probable that you will be more right than wrong 4 times in a 200 trials test.

  • Pio2001
  • [*][*][*][*][*]
  • Global Moderator
Probability of passing a sequencial ABX test
Reply #38
You're right.

I'd like to see a graph of p (probability) versus n (number of trials). Does p decrease ? Does it constantly fluctuate and sometimes (not often) reach low values ?

...I forgot one thing about confidence. The confidence level that is needed must depend on the number of tests performed by someone. For example if I perform one test per day, accept a 5% result as valid, and pass one test out of two...
After 40 days, I get 20 successes, but 5 % is one chance out of 20 !
Thus it is very probable that one of my 20 correct results is flawed !

  • Moneo
  • [*][*][*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #39
Quote
Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

Standard deviation alone does not give you the answer to the question if the behaviour that I have encountered wasn't normal. Instead, you need to perform a certain statistical test.

The basis of my statement that there seems to be a deficiency (note the 'seems', as when formally evaluated, my test would only be valid at ~92% confidence, which isn't generally considered high enough) is the following.

The probability of getting 113 or less trials correct by guessing is 0,960839995. Thus, the probability of getting 114 or more of them correct is less than 0,04.

Now, since I didn't expect the number to be higher or lower than the mean value of 101 beforehand, I must also include the event that I get 86 or less correct answers in the critical interval, making my statement valid only at 92% confidence.
Quote
I think your claims about the deficiency of WinABXs randomness are unfounded.

Well, you could help debunking them by performing a simple test.

Following the a-b-a-b-a-b-... strategy, do ~200 trials and post your results.

If you still doubt my methodology, I can write a formal mathematical description of my test.
  • Last Edit: 10 November, 2003, 04:19:21 PM by Moneo

  • Continuum
  • [*][*][*][*]
Probability of passing a sequencial ABX test
Reply #40
Quote
both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

I think I finally understand your calculation (blame my poor continuous probability knowledge ), I believe, there are two things wrong:
1. +/- is uninteresting. We are only concerned about good results.
2. 7.1 (=sqrt(0.5*0.5*202)) is not a percentage but an absolute number, so 114 is well outside it.

Probability of passing a sequencial ABX test
Reply #41
Check with what probability you can get this result with a random generator (PRNG will probably suffice).
If you get the result in ~5% of the half of the guesses (in this case 101), then they're random (p=~0.5)

It's not that the test gets harder at 100th try than at 20th. (of course given the results so far at 10/20 or 50/100)
  • Last Edit: 10 November, 2003, 04:42:46 PM by AstralStorm
ruxvilti'a

  • schnofler
  • [*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #42
Quote
Well, you could help debunking them by performing a simple test.

Following the a-b-a-b-a-b-... strategy, do ~200 trials and post your results.

OK, I tried it once, decided that it's too much work (having to click 400 times for each result), wrote a little program which remote-controls WinABX, and here's a few results:

1. 88/200, pval=96.1%
2. 110/200, pval=8.9%
3. 92/200, pval=88.5%
4. 77/200, pval=99.9%
5. 120/200, pval=0.3%
6. 92/200, pval=88.5%
7. 104/200, pval=31%
8. 91/200, pval=91%
9. 99/200, pval=58.4%
10. 104/200, pval=31%

edit: I did a few more tests, using different strategies (choosing always A or always B), and they seem to indicate that there's no problem with WinABX's RNG.
  • Last Edit: 10 November, 2003, 05:01:09 PM by schnofler

  • Mac
  • [*][*][*][*][*]
Probability of passing a sequencial ABX test
Reply #43
Erg, I mixed myself up a little  When saying +/- I meant that a result of 80 out of 200 is identical to a result of 120 out of 200, as the likelihood of success and failure is equal.  By 7.1% I meant 2 standard deviations away from the mean was 14.2, or 7.1%

It seems schnofler beat me to the test, but here are my 2 results from WinABX:

Choosing all A: 98/200, p=63.8
Choosing all B: 101/200 p=47.2

The P value may be totally screwed, but I see no problems with the randomness of it, hence I stick with saying your claim is unfounded
< w o g o n e . c o m / l o l >

  • schnofler
  • [*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #44
Quote
Choosing all A: 98/200, p=63.8
Choosing all B: 101/200 p=47.2

The P value may be totally screwed

No, it's not.

  • Pio2001
  • [*][*][*][*][*]
  • Global Moderator
Probability of passing a sequencial ABX test
Reply #45
Here's the graph for the Pval of my 200 answers :


  • Pio2001
  • [*][*][*][*][*]
  • Global Moderator
Probability of passing a sequencial ABX test
Reply #46
Schnofler, could you post your program, of the log file for 2,000 trials (if the probabilities are not too long to be computed, or don't overflow) ? I'd like to plot a larger graph...

Edit : 2,000 should be enough
  • Last Edit: 10 November, 2003, 05:24:35 PM by Pio2001

  • Gabriel
  • [*][*][*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #47
another test:
15 of  24, p = 0.154
16 of  25, p = 0.115
17 of  26, p = 0.084
18 of  27, p = 0.061
19 of  28, p = 0.044
20 of  29, p = 0.031
21 of  30, p = 0.021

another one:
27 of  44, p = 0.087


I tryed a 140 choices set, and only 25 times during the test my p-value was .5 or higher. If it was random, should't it be moving around .5?

  • schnofler
  • [*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #48
Quote
Schnofler, could you post your program, of the log file for 2,000 trials (if the probabilities are not too long to be computed, or don't overflow) ? I'd like to plot a larger graph...


I don't really understand what you mean, so here's a log file with 10 tests of 200 trials each, and this and
this one are log files with two giant tests of 2000 trials.

edit: I'm not so keen on posting the program itself, because that would mean I'd have to make it usable for anyone but myself  . (It's an absolutely awful hack. I wrote another program years ago, which does something similar, and I just replaced some parts to make it control WinABX).
  • Last Edit: 10 November, 2003, 05:49:30 PM by schnofler

  • Moneo
  • [*][*][*][*][*]
  • Developer
Probability of passing a sequencial ABX test
Reply #49
Quote
OK, I tried it once, decided that it's too much work (having to click 400 times for each result), wrote a little program which remote-controls WinABX, and here's a few results:

Nice work!

Quote
4. 77/200, pval=99.9%
5. 120/200, pval=0.3%


I'd say these two are a good indication that something is wrong with the PRNG.

Given that we wanted to test for an abnormal probability of a test returning pval of less than 1% or more than 99%, at a confidence level of 98% it can be claimed that it's bigger than 2% (which it should be equal to).

However, for the results to be statistically valid this value should have been set before the test...