Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Meaningful probability in ABX test? (Read 10473 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Meaningful probability in ABX test?

I'm trying out the foobar ABX tool for the first time.  I'd like to compare my sound files in a meaningful way (double blind etc) to see if I can really hear the difference.

I've read the sticky saying that 1% result on a 16 trial test would be very significant so I set my test to 20 trials just for good measure and did my test.  I got 0.1%.  So obviously I can hear the difference right?  Only problem was that I was purely guessing.  I didn't even listen to the files! I did it once for score (after two practice runs to figure out how the program worked) and I set the trial length beforehand at 20.  My result was 17/20 which foobar said was a 0.1% probability of guessing.  This is 10x better than what the sticky claimed would be considered a significant result!

I'm obviously new at this, so am hoping someone can help me understand the significance of all this and what it means from the viewpoint of getting meaningful answers?  I have a screenshot of my results but can't see how to attach the file to this message.  Anyway, I'm pretty sure I did everything correctly.

Feedback from you statisticians and experienced testers would be very helpful.

Thanks!

Meaningful probability in ABX test?

Reply #1
I'm obviously new at this, so am hoping someone can help me understand the significance of all this and what it means from the viewpoint of getting meaningful answers?


Interpretation: Assume you were really guessing at random, with 50/50 chance. Then in some cases you would -- by chance -- score as good as 17 or above. "some" cases --> very few, obviously, but it might happen.

The confidence levels are not measuring how good you are at telling differences. Someone with a hit rate of only 51% could end up with a "0.1%" confidence figure if only repeated sufficiently many times. What we are primarily interested in, is how sure ("confident") we can be that you are better than flipping the coin -- not how much better you are.


Of course, "how much better?" is still an interesting question. 17 out of 20 means (*looking up a calculator*) that we are 95% confident that your hit rate lies between 62.1% and 96.8%. But the starting question is: are you really better than the coin? If you are right in 2 out of 2, then of course that a perfect 100% hit, but it will happen all too often by chance to be confident that you are doing better.



(Heavily edited.)

Meaningful probability in ABX test?

Reply #2
I'm obviously new at this, so am hoping someone can help me understand the significance of all this and what it means from the viewpoint of getting meaningful answers?


What sort of feedback would you expect? The low probability of guessing that is reported simply reflects the fact that it is very unlikely to "get it right" 17 out of 20 times if simply guessing (which you surely agree is true!).

It's a bit like flipping a coin 10 times and getting heads all 10; the probability for this to happen is 0.1%, and yet of course it may. So I flip a coin 10 times, it comes up heads 10 times. What am I to conclude from this? That the 0.1% figure is wrong? No, only that one or more of
a) an unlikely event has occured,
b) I am cheating,
c) I made a mistake when calculating the probability
has occured. I can repeat the experiment a few more times; if it does not occur again, (a) is ruled out. Actually if I do it a few billion times, it should occur roughly 1 out of every thousand attempts, which is what the .1% probability means. If it occurs more often than that, either I miscalculated the probability or I am cheating and not telling myself!

I am not sure if that's an answer to your question because I am not sure I actually understood your question, though...

Meaningful probability in ABX test?

Reply #3
My null hypothesis about such questions is that one needs to have explained the distinction between hit probability and confidence. "Like, why is it reported 0.1% when 17/20 is not 99.9%?"

I'd say this is hypothesis testing FAQ#1.

Meaningful probability in ABX test?

Reply #4
My null hypothesis about such questions is that one needs to have explained the distinction between hit probability and confidence. "Like, why is it reported 0.1% when 17/20 is not 99.9%?"

I'd say this is hypothesis testing FAQ#1.


Yes, if that was the question! I think if the question is "hang on, it's .1% probability but it happens, how come" that's different (and much easier to explain) than hypothesis testing. It's just probability.

This is why I said I was not sure I understood the OPs question.

Meaningful probability in ABX test?

Reply #5
I was hoping ABX testing would be a reliable indicator for whether I can actually detect differences in my files.   

Based on the guidelines, I need to test as follows:

1)  Do at least 16 trials and fix the number at the start (I chose 20).
2)  Do one run for score.  Keep the result.
3)  Achieve a probability value of less than 1% to "pass".  Ie, this would indicate there is a significant probability that one is actually hearing a difference (as opposed to achieving that outcome by chance)

However I scored 10x better than this criteria simply by guessing like a monkey. 

How then do I conclude anything useful when actually listening to my files?  Even achieving 0.1% doesn't seem to be enough to conclude that I can actually hear a difference.

So is ABX testing useless for me?  If not, what do I have to do to make it useful (ie I'm trying to make a conclusion...and preferably a correct one!)?  The above guidelines don't seem to be sufficient.




Meaningful probability in ABX test?

Reply #6
Even achieving 0.1% doesn't seem to be enough to conclude that I can actually hear a difference.


Yes it would be, more than enough. The thing is, unlikely results show up from time to time, and therefore, conclusions which are essentially wrong are done from time to time. This is statistics. The fraction of wrong conclusions is low, but of course, if we keep on rolling dice an enormous # of times, there will -- most probably -- in some round be an "unbelievable" ratio of Yatzy's.


(Heavily edited.)

Meaningful probability in ABX test?

Reply #7
However I scored 10x better than this criteria simply by guessing like a monkey.
I dare you to try to do it again.

How then do I conclude anything useful when actually listening to my files?  Even achieving 0.1% doesn't seem to be enough to conclude that I can actually hear a difference.

It does 99.9% of the time.
elevatorladylevitateme

Meaningful probability in ABX test?

Reply #8
to OP:

Do the test again, on another sample. Then you would have greater confidence.

The thing to focus on, apart from the fact that improbable events do happen from time to time, and someone wins the lottery every week, is your statement that you were guessing. Did you really mean that? Or did you mean that you had a hunch, but couldn't name exactly what it was that gave you that hunch? In the latter case, you would be really hearing a difference, but wouldn't have learned how to identify and describe the nature of the difference. You could learn to do that, or not, depending on whether you wanted to be happy with your present set-up.

Meaningful probability in ABX test?

Reply #9
Next time, if you want to go random, literally flip a coin.

Meaningful probability in ABX test?

Reply #10
My statistics chops may be a little stale but I'm calculating a 5% chance of getting 17 of 20 right by guessing: 0.5^17 x 20 x 19 x 18.

I'm quite good with percentages. 5% is one-in-20: a believable report. 0.1% is one-in-1000: much more difficult to believe.

Meaningful probability in ABX test?

Reply #11
Thanks for the responses everyone.  It helps adds some perspective to the results.  As per one of the suggestions, I tried 10 more times the same way (just clicking randomly) and got the following:

58.8%
41.2%
58.8%
13.2
2.1
13
74
41
5.8
5.8

I'd get carpal tunnel clicking enough times to repeat my 0.1% result (or maybe it will be the very next try! lol).  So I can accept I was unlucky to get that unbelievable result on my very first ABX run.

The 10 runs seem skewed towards low p values but I suppose that's just statistics too and the sample size of 10 is too low for a good distribution.

Do people usually use 1% as a good indicator?  Three times during the runs I was at 1% on 16 tries so for me it is too common if only 16 tries are used.  Although from the above you can see that by 20 tries I never achieved 1%.  So I'll probably use 1% at 20 tries as my standard trial unless there are good reasons otherwise.

I'm still interested to know how to reduce the chance of a "bogus" result.  No-one has directly addressed that.  Is there a standard approach?  Intuitively I could increase the number of tries in a trial but 20 seems already a long time to concentrate.    Reducing the p value to 0.1% might work but how good one is at detecting differences might become a problem (hit rate versus confidence factor as one poster discussed earlier).  I could also see doing doing multiple runs as above, but then I'm not sure how one combines them into a valid statistical answer.  Doing multiple runs makes the most sense to me since intuitively if I can really hear a difference it should be repeatable.

Is there a corollary to trying to reduce the "false positive"?  Intuitively I would think the harder one makes this the more likely it is to suppress a genuine positive result.    I think this stuff can get very complicated quickly such that only a trained statistician can make sense of things.        But mainly I want to remove my own expectation bias when comparing my files so ABX testing seems to do that even if some interpretation is still required.

Thanks again for the reponses.

Meaningful probability in ABX test?

Reply #12
Quote
Is there a corollary to trying to reduce the "false positive"? Intuitively I would think the harder one makes this the more likely it is to suppress a genuine positive result.


Yes.  The whole process falls under what is known as Statistical Power.  Unfortunately it can be quite a PITA to reduce the various errors and so I suspect the easiest way is to just repeat an ABX tests.  The key is to give yourself time between tests (even if it is days) to recover your concentration and interest.

The probability of of scoring <1% twice is quite a bit of evidence that you can actually ABX something.  As you've found, doing it twice by guessing is not likely at all.




Meaningful probability in ABX test?

Reply #13
Mmm....

It seems it is not too difficult to get (relatively) good results.

Got one 12/16 (3.8% of guessing) right now by randomly clicking on the buttons.

But getting a good result twice by change is definitely a lot more harder. Usual results are correctly identified as random guessing ( 50~80% of guessing).

Meaningful probability in ABX test?

Reply #14
Do people usually use 1% as a good indicator?

The two most common thresholds are 1% and 5%. (And then there is the discussion on testing one-sided vs. two-sided: You would not expect to guess worse than the coin, except by coincidence, right? Well, assume that you were not testing against the coin, but against ... e.g., me. Then the hypothesis to be tested is not "you are better", but "we are not equally good". Testing on a 1% level would then be done with a 0.5% threshold in both ends.)


Three times during the runs I was at 1% on 16 tries so for me it is too common if only 16 tries are used.

Well, this is something different. Now you are retrospectively singling out the "best intermediate", rather than the final, and that is something different.

(Assume you flip the coin, and count a score S as the difference correct minus false. 50/50 chance means: The expected score in a fixed run of N times, is zero. But if you plot S against N for running N, you will find that it fluctuates quite a bit -- see e.g. http://en.wikipedia.org/wiki/Law_of_the_iterated_logarithm -- and so if you have the option to stop it at choice, you will with 100% probability be able to get any S you want, just by flipping over and over and over.)   



I'm still interested to know how to reduce the chance of a "bogus" result.  No-one has directly addressed that.  Is there a standard approach?

Yep, and you followed it. You got a bogus result. With a million monkeys doing the procedure, there will be a lot of bogus results  Just like winning in the lottery.


Doing multiple runs makes the most sense to me since intuitively if I can really hear a difference it should be repeatable.

Is there a corollary to trying to reduce the "false positive"?


Yep -- and there are various reasons why we are careful about not accepting false positives. For example:
- Burden of proof is on the seller. Assume you want a medicine approved. Come on, convince us! No, not "51%", convince us!
- Changing your mind is maybe not costly, but taking the consequences are. For example, consider the hypothesis "green light on top prevents accidents more effectively than red light on top". We don't want to go through the procedure of shifting every single traffic light just because we're 51% certain?


Then of course, as you mention, the cost is that one will frequently "suppress a genuine positive". But in many cases, this is easy to avoid: do more experiment! In other cases, e.g. in many social sciences, you don't do experiments, you gather available data -- then this is much harder. But for listening purposes -- experiment again.

Meaningful probability in ABX test?

Reply #15
Oh, by the way: This "0.1%" confidence level, etc., is the inference from the data and the data alone. (In this particular case, you had a fairly heavy piece of information apart from the data -- you knew that you were guessing.)

But "these data, and these data alone" yields a counterintuitive but yet natural result: getting the same result as you based your previous conclusion upon, might lead you to reverse it! Consider:

- 5 statisticians meet. They have tested the same coin for fairness, and obtained the same result: 7 heads in 10 tosses. The null hypothesis is "50/50", and the alternative hypothesis is "not 50/50".
- Before the meeting, each has concluded "do not reject the null".
- Even though all have the same result, and all have the same "not reject" outcome, they all change their conclusion to "reject the null".
- Why? Because their preliminary conclusion was really: "these data are not enough to reject the null".


(Two-sided test, 1% level, from an online calculator.)

Meaningful probability in ABX test?

Reply #16
Oh, by the way: This "0.1%" confidence level, etc., is the inference from the data and the data alone. (In this particular case, you had a fairly heavy piece of information apart from the data -- you knew that you were guessing.)


Guessing is a much bigger deal in statistics than many people think it is!  "Modern" models used for analyzing multiple choice questions (ABX is two choice item) actually have a guessing parameter.  Exams like the GRE use it.

As Porcus has pointed out, there is a difference between randomly pressing buttons (you know you are guessing) and actually trying an ABX and trying to pick the matching sample.  In the former you are consciously guessing.  In the latter you are not, especially if you support the claim that you may be able to subconsciously match samples in a way your conscious mind cannot.

If you want to increase accuracy you can also use an ABXY.  I strongly suspect this will increase the difficulty of you guessing at a significant level but I haven't run the numbers myself.

Quote
In other cases, e.g. in many social sciences, you don't do experiments, you gather available data


Even when you do conduct experiments in the social sciences there are a number of issues that can greatly reduce the practical significance of a statistically significant result.  One successful ABX can mean something.  In the social sciences, if you want to support any hypothesis purely on the basis of statistical significance many people will dismiss your argument.

I don't know if the above contributes to this discussion...

Meaningful probability in ABX test?

Reply #17
I was hoping ABX testing would be a reliable indicator for whether I can actually detect differences in my files.   

Based on the guidelines, I need to test as follows:

1)  Do at least 16 trials and fix the number at the start (I chose 20).
2)  Do one run for score.  Keep the result.
3)  Achieve a probability value of less than 1% to "pass".  Ie, this would indicate there is a significant probability that one is actually hearing a difference (as opposed to achieving that outcome by chance)

However I scored 10x better than this criteria simply by guessing like a monkey.


Please provide more details of this.

How many runs did you throw away?

How are you sure that your guesses were random?

Meaningful probability in ABX test?

Reply #18
How many runs did you throw away?

How are you sure that your guesses were random?


Well:
Quote from: OP link=msg=0 date=
I didn't even listen to the files! I did it once for score (after two practice runs to figure out how the program worked) and I set the trial length beforehand at 20.

Meaningful probability in ABX test?

Reply #19
How many runs did you throw away?

How are you sure that your guesses were random?


Well:
Quote from: OP link=msg=0 date=
I didn't even listen to the files! I did it once for score (after two practice runs to figure out how the program worked) and I set the trial length beforehand at 20.



More details, please.

If you were just flipping coins, it is highly improbable that you would "10 times better"  than 1% probability that you achieved your results by guessing randomly.  There thus had to be some difference between what you did and flipping coins.

Meaningful probability in ABX test?

Reply #20
He's not been able to reproduce the 0.1% result but but let's assume that it did happen as he said. My math says there's a 5% change of getting at least 17 of 20 right by guessing (see post 11). Foobar ABX says 0.1%. If my math is correct, the report seems quite plausible. If Foobar is correct, then yeah, I want to hear more details too.

Meaningful probability in ABX test?

Reply #21
He's not been able to reproduce the 0.1% result but but let's assume that it did happen as he said. My math says there's a 5% change of getting at least 17 of 20 right by guessing (see post 11). Foobar ABX says 0.1%. If my math is correct, the report seems quite plausible. If Foobar is correct, then yeah, I want to hear more details too.


Calculation fixed: the probability is 0.5^20*(20!/(17!*3!))=0.001087188.

 

Meaningful probability in ABX test?

Reply #22
Foobar ABX says 0.1%

which is correct, up to roundoff error.

2^20 cases total, of which:
20 successes: 1
19 successes: 20
18 successes: 20!/(2! 18!)
17 successes: 20!/(3! 17!)

Sum: (1 + 20 + 19 * 20 / 2 + 18 * 19 * 20 / 6) = 1351

2^20 is 1024^2, or appr. 1 million.  A thousand in a million = one in a thousand.


Edit  @ unekdoud: 
We want the probability of 17 or better. Difference small, though.