In the light of

ASA statement of use of p-values I'd like to discuss some alternative statistical approaches to simple ABX tests.

As we all know the

**frequentist approach** to hypothesis testing calculates a p-value, that is we assume the null hypothesis (H

_{0}) to be true and calculate the probability of obtaining a result as extreme as the one observed, or more extreme.

For X ~ B(n, p):

H

_{0}: p = 0.5

H

_{1}: p > 0.5

P(X >= x | H

_{0})

Example for ABX test with 9/10 result: P(X >= 9 | p=0.5) = 0.0107 which is a p-value < 5% (a commonly chosen significance level) and therefore considered "statistically significant".

The

**Bayesian approach** uses Bayes' theorem to turn this around:

P(H | data) = P(data | H) * P(H) / P(data) = P(data | H) * P(H) / ( P(data | H) * P(H) + P(data | ¬H) * P(¬H) )

It is the basis of Bayesian hypothesis testing, which can be used to compare different models, for example M

_{0} vs M

_{1}:

M

_{0}: P(X = x | p=0.5)

M

_{1}: 2 * ∫

_{0.5^1} P(X = x | p) dp

Then we pit the models against each other and get a Bayes Factor:

BF

_{01} = P(data | M

_{0}) / P(data | M

_{1}), with values >1 supporting M

_{0}BF

_{10} = P(data | M

_{1}) / P(data | M

_{0}), with values >1 supporting M

_{1}Now we can answer the question: how well, relative to each other, do the hypotheses explain the data?

I use log10(BF) so that negative evidence results can be read easier. The

**categories*** I will use are:

= 0: no support

< 0.5: not worth more than a bare mention

< 1: moderate

< 1.5: strong

< 2: very strong

>= 2: decisive

Here are results for some common ABX trial counts including interpretation (according to Jeffreys 1961, Appendix B):

**10 trials****Correct** | 5 | 6 | 7 | 8 | 9 | 10 |

**P(x|M**_{0}) | 2.461E-01 | 2.051E-01 | 1.172E-01 | 4.395E-02 | 9.766E-03 | 9.766E-04 |

**P(x|M**_{1}) | 9.091E-02 | 1.319E-01 | 1.612E-01 | 1.759E-01 | 1.808E-01 | 1.817E-01 |

**log10(BF**_{10}) | -0.432 | -0.192 | 0.139 | 0.602 | 1.267 | 2.270 |

| negative | negative | barely | moderate | strong | decisive |

**12 trials****Correct** | 6 | 7 | 8 | 9 | 10 | 11 | 12 |

**P(x|M**_{0}) | 2.256E-01 | 1.934E-01 | 1.208E-01 | 5.371E-02 | 1.611E-02 | 2.930E-03 | 2.441E-04 |

**P(x|M**_{1}) | 7.692E-02 | 1.091E-01 | 1.333E-01 | 1.467E-01 | 1.521E-01 | 1.536E-01 | 1.538E-01 |

**log10(BF**_{10}) | -0.467 | -0.248 | 0.043 | 0.437 | 0.975 | 1.720 | 2.799 |

| negative | negative | barely | barely | moderate | very strong | decisive |

**14 trials****Correct** | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 |

**P(x|M**_{0}) | 2.095E-01 | 1.833E-01 | 1.222E-01 | 6.110E-02 | 2.222E-02 | 5.554E-03 | 8.545E-04 | 6.104E-05 |

**P(x|M**_{1}) | 6.667E-02 | 9.285E-02 | 1.132E-01 | 1.254E-01 | 1.310E-01 | 1.328E-01 | 1.333E-01 | 1.333E-01 |

**log10(BF**_{10}) | -0.497 | -0.295 | -0.033 | 0.312 | 0.771 | 1.379 | 2.193 | 3.339 |

| negative | negative | negative | barely | moderate | strong | decisive | decisive |

**16 trials****Correct** | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |

**P(x|M**_{0}) | 1.964E-01 | 1.746E-01 | 1.222E-01 | 6.665E-02 | 2.777E-02 | 8.545E-03 | 1.831E-03 | 2.441E-04 | 1.526E-05 |

**P(x|M**_{1}) | 5.882E-02 | 8.064E-02 | 9.810E-02 | 1.092E-01 | 1.148E-01 | 1.169E-01 | 1.175E-01 | 1.176E-01 | 1.176E-01 |

**log10(BF**_{10}) | -0.524 | -0.335 | -0.095 | 0.214 | 0.616 | 1.136 | 1.807 | 2.683 | 3.887 |

| negative | negative | negative | barely | moderate | strong | very strong | decisive | decisive |

**20 trials****Correct** | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |

**P(x|M**_{0}) | 1.762E-01 | 1.602E-01 | 1.201E-01 | 7.393E-02 | 3.696E-02 | 1.479E-02 | 4.621E-03 | 1.087E-03 | 1.812E-04 | 1.907E-05 | 9.537E-07 |

**P(x|M**_{1}) | 4.762E-02 | 6.364E-02 | 7.699E-02 | 8.623E-02 | 9.151E-02 | 9.397E-02 | 9.490E-02 | 9.517E-02 | 9.523E-02 | 9.524E-02 | 9.524E-02 |

**log10(BF**_{10}) | -0.568 | -0.401 | -0.193 | 0.067 | 0.394 | 0.803 | 1.313 | 1.942 | 2.721 | 3.698 | 4.999 |

| negative | negative | negative | barely | barely | moderate | strong | very strong | decisive | decisive | decisive |

*) The above categories may seem somewhat arbitrary similar to significance levels. They are not needed however since we can just look at the odds directly:

**Posterior Odds = Bayes Factor * Prior Odds**Example:

We have two files which we have prior data on that tell us that about one in ten people can distinguish them.

Prior Odds = 0.1 / (1 - 0.1) = 0.111...

A person scores 9/10 in an ABX test, which gives us a Bayes Factor of 10^1.267 = 18.5.

Posterior Odds = 2.056

So the odds for this person doing better than chance (M

_{1} over M

_{0}) are about 2:1.

Let's say the person does another 9/10, so 18/20 in total for a Bayes Factor of 525.5, resulting in odds of about 58:1.

Please consider that a high BF does not guarantee that a difference was heard. Again, we all know that various problems can creep into such a test that will make the results meaningless.

For example, Evett (1991) has argued for a BF of at least 1000 against innocence in a criminal trial for forensic evidence alone. Also, even a BF of 1000 can still be too low to provide enough evidence for an extraordinary claim.

My 2¢ on this is that we want to see strong evidence or better for simple ABX tests. (Whether to take results seriously depends on much more than just this single number however.) Especially with higher trial counts this turns out to be

*more* demanding than a 5% significance level.

edit1: graphs added

edit2: tables updated, added odds and example

edit3: fixed Bayes factor definitions