 Topic: Bayes Factors for ABX tests (Read 5450 times)
0 Members and 1 Guest are viewing this topic.

Bayes Factors for ABX tests 2016-03-30 22:49:51
In the light of ASA statement of use of p-values I'd like to discuss some alternative statistical approaches to simple ABX tests.

As we all know the frequentist approach to hypothesis testing calculates a p-value, that is we assume the null hypothesis (H0) to be true and calculate the probability of obtaining a result as extreme as the one observed, or more extreme.
For X ~ B(n, p):
H0: p = 0.5
H1: p > 0.5
P(X >= x | H0)

Example for ABX test with 9/10 result: P(X >= 9 | p=0.5) = 0.0107 which is a p-value < 5% (a commonly chosen significance level) and therefore considered "statistically significant". The Bayesian approach uses Bayes' theorem to turn this around:
P(H | data) = P(data | H) * P(H) / P(data) = P(data | H) * P(H) / ( P(data | H) * P(H) + P(data | ¬H) * P(¬H) )

It is the basis of Bayesian hypothesis testing, which can be used to compare different models, for example M0 vs M1:
M0: P(X = x | p=0.5)
M1: 2 * ∫0.5^1 P(X = x | p) dp  Then we pit the models against each other and get a Bayes Factor:
BF01 = P(data | M0) / P(data | M1), with values >1 supporting M0
BF10 = P(data | M1) / P(data | M0), with values >1 supporting M1

Now we can answer the question: how well, relative to each other, do the hypotheses explain the data?

I use log10(BF) so that negative evidence results can be read easier. The categories* I will use are:
= 0: no support
< 0.5: not worth more than a bare mention
< 1: moderate
< 1.5: strong
< 2: very strong
>= 2: decisive

Here are results for some common ABX trial counts including interpretation (according to Jeffreys 1961, Appendix B):

10 trials
 Correct 5 6 7 8 9 10 P(x|M0) 2.461E-01 2.051E-01 1.172E-01 4.395E-02 9.766E-03 9.766E-04 P(x|M1) 9.091E-02 1.319E-01 1.612E-01 1.759E-01 1.808E-01 1.817E-01 log10(BF10) -0.432 -0.192 0.139 0.602 1.267 2.270 negative negative barely moderate strong decisive

12 trials
 Correct 6 7 8 9 10 11 12 P(x|M0) 2.256E-01 1.934E-01 1.208E-01 5.371E-02 1.611E-02 2.930E-03 2.441E-04 P(x|M1) 7.692E-02 1.091E-01 1.333E-01 1.467E-01 1.521E-01 1.536E-01 1.538E-01 log10(BF10) -0.467 -0.248 0.043 0.437 0.975 1.720 2.799 negative negative barely barely moderate very strong decisive

14 trials
 Correct 7 8 9 10 11 12 13 14 P(x|M0) 2.095E-01 1.833E-01 1.222E-01 6.110E-02 2.222E-02 5.554E-03 8.545E-04 6.104E-05 P(x|M1) 6.667E-02 9.285E-02 1.132E-01 1.254E-01 1.310E-01 1.328E-01 1.333E-01 1.333E-01 log10(BF10) -0.497 -0.295 -0.033 0.312 0.771 1.379 2.193 3.339 negative negative negative barely moderate strong decisive decisive

16 trials
 Correct 8 9 10 11 12 13 14 15 16 P(x|M0) 1.964E-01 1.746E-01 1.222E-01 6.665E-02 2.777E-02 8.545E-03 1.831E-03 2.441E-04 1.526E-05 P(x|M1) 5.882E-02 8.064E-02 9.810E-02 1.092E-01 1.148E-01 1.169E-01 1.175E-01 1.176E-01 1.176E-01 log10(BF10) -0.524 -0.335 -0.095 0.214 0.616 1.136 1.807 2.683 3.887 negative negative negative barely moderate strong very strong decisive decisive

20 trials
 Correct 10 11 12 13 14 15 16 17 18 19 20 P(x|M0) 1.762E-01 1.602E-01 1.201E-01 7.393E-02 3.696E-02 1.479E-02 4.621E-03 1.087E-03 1.812E-04 1.907E-05 9.537E-07 P(x|M1) 4.762E-02 6.364E-02 7.699E-02 8.623E-02 9.151E-02 9.397E-02 9.490E-02 9.517E-02 9.523E-02 9.524E-02 9.524E-02 log10(BF10) -0.568 -0.401 -0.193 0.067 0.394 0.803 1.313 1.942 2.721 3.698 4.999 negative negative negative barely barely moderate strong very strong decisive decisive decisive

*) The above categories may seem somewhat arbitrary similar to significance levels. They are not needed however since we can just look at the odds directly:
Posterior Odds = Bayes Factor * Prior Odds

Example:
We have two files which we have prior data on that tell us that about one in ten people can distinguish them.
Prior Odds = 0.1 / (1 - 0.1) = 0.111...
A person scores 9/10 in an ABX test, which gives us a Bayes Factor of 10^1.267 = 18.5.
Posterior Odds = 2.056
So the odds for this person doing better than chance (M1 over M0) are about 2:1.

Let's say the person does another 9/10, so 18/20 in total for a Bayes Factor of 525.5, resulting in odds of about 58:1.

Please consider that a high BF does not guarantee that a difference was heard. Again, we all know that various problems can creep into such a test that will make the results meaningless.
For example, Evett (1991) has argued for a BF of at least 1000 against innocence in a criminal trial for forensic evidence alone. Also, even a BF of 1000 can still be too low to provide enough evidence for an extraordinary claim.

My 2¢ on this is that we want to see strong evidence or better for simple ABX tests. (Whether to take results seriously depends on much more than just this single number however.) Especially with higher trial counts this turns out to be more demanding than a 5% significance level.

edit2: tables updated, added odds and example
edit3: fixed Bayes factor definitions
"I hear it when I see it."

Re: Bayes Factors for ABX tests I'm glad someone else read that link. Wonder when we'll see Bayesian statistics incorporated into common ABX tools.....

Re: Bayes Factors for ABX tests I was familiar with some problems of p-values before, but it's nice to see some "official" statements and this gave me an excuse to finally dive a bit deeper into Bayesian statistics.

"I hear it when I see it."

Re: Bayes Factors for ABX tests This does look incredibly interesting. Do you know of any books on this approach to statistics that are outside of any specific context and make an effort to substantiate it mathematically?

Re: Bayes Factors for ABX tests I can only suggest Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan by John Kruschke.

There's another neat thing. The width of the posterior probability distribution is proportional to our uncertainty of the true value of θ. (θ = 0.5 corresponds to the M0 model from above.)
"I hear it when I see it."

Re: Bayes Factors for ABX tests Thanks for a very interesting post. It's a shame I only saw this recently.

It is my opinion too that posterior odds are a more intuitive way to interpret ABX test results. These days I find that the more traditional p-value approach makes less sense to me the more I think about it.

Re: Bayes Factors for ABX tests It's not just about the odds. You can convert any probability into odds if you like.

The important difference is that in the frequentist approach you calculate the probability of the data given/assuming that no difference was heard (the null hypothesis H0).
If the probability of the data is unlikely then we reject the hypothesis which we assumed to be true for the calculation in the first place.

But in the Bayesian approach you calculate the probability that no difference was heard given that data, or that a difference was correctly identified given the data.
Furthermore the Bayes Factors tell you how strong the data (the test results) supports one model or hypothesis over another.

And lastly it incorporates prior knowledge: extraordinary claims require extraordinary evidence.
A 10/10 result for the claim that a losslessly compressed file sounds different from an uncompressed one will not convince anyone and no-one should believe that claim based on that evidence alone.
But if the claim is that there's an audible difference between a lossless and 64 kbps mp3 then there's no real need for further evidence. (Although more data is always nice...)
"I hear it when I see it."

Re: Bayes Factors for ABX tests It's not just about the odds. You can convert any probability into odds if you like.
Yes. But with Bayes factors you can actually have a probabilistic statement about whether you were able to tell a difference.

When the Foobar ABX comparator says "probability you were guessing is x", this is not technically correct. As far as I can tell Foobar does not return the Bayesian "probability you were guessing", but rather a frequentist p-value. In a frequentist setting you cannot assign probabilities to fixed unknowns.

The Bayesian takes the simpler view that all unknowns are random. Therefore any unknown quantity can be assigned a probability.

Personally, I find posterior odds (or probabilities) easier to interpret than p-values.

Re: Bayes Factors for ABX tests Remember that a high Bayes Factor for a model alone doesn't tell you that the model is true. It just tells you that it's a much better fit for the data.
But there's the simple Posterior Odds = BF * Prior Odds.

In case of total ignorance (which realistically is as good as never the case) you'd start with 1:1 prior odds and then feed the resulting odds as prior odds into the same formula again for each repetition of the test.
(If you plotted the probability distributions you'd also see a decrease in uncertainty for each repetition.)

You're right that what the ABX comparator says isn't right. There's another thread about it somewhere.
It's the probability of getting the same or a more extreme result given no difference was heard, that the choices were made with random fair coin flips.
"I hear it when I see it."

Re: Bayes Factors for ABX tests If you want to be really (pedantically) Bayesian you can say that there is no objective truth when dealing with the unknowable and hence there is no underlying "true model". This is getting into troll territory though. I'm guessing you know the concept of prior sample size. With the uniform prior in your post you have a prior sample size of two: one success and one fail. You can even have a prior sample size of one if you use an appropriately scaled Beta(0.5,0.5) prior. The Beta(0.5,0.5) prior can be shown to minimize the influence of the prior on the posterior distribution. In this way you can maximize the objectivity of the test if you want to. https://en.wikipedia.org/wiki/Beta_distribution

Re: Bayes Factors for ABX tests A 10/10 result for the claim that a losslessly compressed file sounds different from an uncompressed one will not convince anyone and no-one should believe that claim based on that evidence alone.
In this case the first thing I will do is to redigitize the analog output of the playback device and examine them.

Re: Bayes Factors for ABX tests EekWit, pedantically we can never arrive at truth using such tests and evaluations. But we can get to odds "beyond reasonable doubt" either way.
That's just life ... where we cannot easily prove things such as in axiomatic systems. On the uniform prior: yeah. I was not precise enough when I spoke about "total ignorance". The flat or Beta(1, 1) prior contains the knowledge that a trial can both fail and succeed. I think that's a very reasonable and sensible assumption for an ABX test.
It also follows the principle of indifference: from .45 to .55 you get the same probability as from .9 to 1 which is - big surprise - 10%.

Beta(0, 0) could be interpreted as: either the trials always fail or they always succeed.
This would make more sense e.g. for testing a whether a chemical reaction happens or not.
0% everywhere except for the 100% at both extremes.

Beta(0.5, 0.5) could be interpreted as: we don't know that it's possible for trials to both fail and succeed.
But that gets you 6% from .45 to .55 and 20% from .9 to 1.

This prior could make sense in a situation where you didn't know what kind of proportion between 0 and 1 you're dealing with (could be linear, could be logarithmic ...) and try to minimize the effects of the prior.
And there are many other attempts at "objective" or "uninformative" or "diffuse" priors since the Jeffreys prior is not without problems can can even lead to inconsistent results, but that's a complex topic.
"I hear it when I see it."

Re: Bayes Factors for ABX tests Beta(0, 0) could be interpreted as: either the trials always fail or they always succeed.
This would make more sense e.g. for testing a whether a chemical reaction happens or not.
0% everywhere except for the 100% at both extremes.

Beta(0.5, 0.5) could be interpreted as: we don't know that it's possible for trials to both fail and succeed.
But that gets you 6% from .45 to .55 and 20% from .9 to 1.

This prior could make sense in a situation where you didn't know what kind of proportion between 0 and 1 you're dealing with (could be linear, could be logarithmic ...) and try to minimize the effects of the prior.
And there are many other attempts at "objective" or "uninformative" or "diffuse" priors since the Jeffreys prior is not without problems can can even lead to inconsistent results, but that's a complex topic.
Thank you. This is an excellent explanation of how to interpret "uninformative" beta priors. I agree that for ABX a uniform prior makes more sense, because we know it is always possible to respond correctly or incorrectly.

Re: Bayes Factors for ABX tests There are tons of books on Bayesian statistics out there. One of the most mathematically rigorous ones is by Mark Scherwish -- Theory of Statistics. But it is not easy reading 