## Questions about pd

#####
Reply #5 –

1) On using ABX for testing non-inferiority of a codec vs. original: this **cannot** be an experiment in which the subject and experimenter are the same person. If the subject knows what's being tested, the experiment is already compromised wrt statistical significance. "What's being tested" means here just that the subject knows what A and B may be, regardless whether (s)he knows or doesn't know X.

This may seem contradictory with a normal ABX test, but it is not. Here is why: when you do a non-inferiority test, due to the different null hypothesis, the "direction" of the test is reversed. So type I errors in a normal/difference ABX test become type II, and vice versa, and alpha and beta (as used in the ABX program) also change roles, i.e. significance is beta and power is 1-alpha.

For a normal/difference ABX test, knowing A & B only decreases power but doesn't affect significance. The subject may not "try hard enough" to find the difference(s) which contributes to type II errors. For a non-inferiority ABX test, knowledge of A and B affects significance directly, because it is a type I error for this test.

To do something like this in practice would require that the subjects have no idea whatsoever of the "devices" being tested.

1.1) You have observed correctly that one cannot test for 100% transparency. If p1=p2 and s0=0, the required number of samples (n) becomes infinite in F-M size formula. This is "proof" if you like that "showing there's absolutely no difference" is impossible in statistical terms.

2) On codec equivalence/non-inferiority testing. First, it should be obvious from (1) that you cannot ABX test the difference between transparent codecs when subject and experimenter are the same. Second, since you cannot show that something is totally transparent (1.1), you cannot show that two totally transparent things are equivalent.

You can use ABX trials to test the equivalence/non-inferiority of two codecs that aren't exactly transparent, which is the case for lame examples I gave. Let C1 be lame3903 --aps. It is only "99% transparent", i.e. there are samples from set S1, on which some people can tell the difference from the original, so transparent is an abuse of language here.

Consider a contestant, codec C2 (say lame3961 --aps). Again some people can tell the difference from the original on some samples from another S2, which is does not include S1, and is not included in it (but may intersect it). In simple terms: "on some samples it does better, on some it does worse" than codec A. This is a scenario in which one can do a non-inferiority subject=experimeter ABX-based test.

In practice you'd need a blocked design like ABC/HR with multiple samples and multiple subjects to make a "HA recommended" change. For the sake of simplicity let's consider a test where only one sample and one subject are tested. For each ABX trial, the subject must not know if he's doing A vs the original, or B vs the original, so you can’t just use a normal ABX procedure. You need a program that on each trial randomly ask you to do a C1 vs O (original) or C2 vs O.

In the end you’ll have two ratios, P1=S1/N1, and P2=S2/N2 of successful “guesses” for each codec. In general N1 != N2 due to randomization. If C1 is the reference, and you want to show that C2 is not inferior, you must test whether P2 – P1 < delta. The difference can be negative if P2 has lower “guess rate” than P1, which intuitively means that P2 may be better. Equivalence means non-inferiority both ways i.e. |P1-P2| < delta. In practice if N1 ~ N2, the positive quantity of P1-P2 and P2-P1 will be harder to prove less than delta.

Given P1 & P2, you can first test H1: P2 < P1, which is basically “can you (implicitly) tell the difference between C1 and C2”. There are many ways to test this, one is Fisher’s exact test. If this test succeeds, then you can stop, because you’ve shown that there is a difference, and that C2 is actually better. You can also test H1: P1 < P2, which means, C2 is worse, stop.

If the above tests fail, then you are to choose a delta, and test if P2 < P1 + delta. There are several tests for this too. One is (you guessed) Farrington Manning. Because we couldn’t prove that P2 < P1, we won’t be able to make delta = 0 in here. The significance level and delta are obviously related. The more you tighten delta, the worse the significance level will become.

Examples:

i) P1 = 152/259 ~ 0.5869, P2 = 170/275 ~ 0.6182. Intuitively, the new codec C2 is worse than C1. Let’s see if this is significant. Fisher's exact test gives 0.7960 in one direction, and 0.2577 in the other. So we can’t claim neither P1 < P2 nor P2 < P1 with any reasonable significance. What about non-inferiority H1: P2 - P1 < delta? For delta = 0.2, H1: P1-P2 < 0.2 gets a F-M score of -4.046, which is uber-significant (F-M score tends to the cumulative normal distribution). What about delta=0.15? F-M score of -2.825 which gives a level of 0.0024. Can we make delta=0.10? Not really, F-M score is -1.627 so p-value=0.0519.

ii) P1 = 170/275 ~ 0.6182. P2 = 152/259 ~ 0.5869. This is the opposite case of the above. Intuitively, it should be better, because the observed value for P2 is lower than P1. As you’d expect, Fisher's exact test gives the same values as before (but in reverse direction), so we can’t claim neither P1 < P2 nor P2 < P1 with any reasonable significance. What has changed about non-inferiority H1: P2 – P1 < delta? For delta = 0.2, H1: P1-P2 < 0.2 gets a F-M score of -5.568, which is better than we got before (-4.046). Can we make delta=0.10 this time? Yes, F-M score is -3.116 for a level of 0.0009. We can make delta=0.05 and get F-M score -1.922 and level 0.0273. We can’t make it 0.01 however, F-M score is -0.975 for a level of 0.1647.

So what can we say about equivalence? You have to combine results from (i) and (ii), taking the least favorable significance number. So we can claim equivalence for delta = 0.15 at significance level of 0.0024.

iii) What if we had more data points for roughly the same P ratios as in (i)? P1 = 1993/3215 ~ 0.6199. P2 = 1789/3033 ~ 0.5898. Fisher’s exact test for P1 > P2 is 0.0081, so we can admit that C2 is simply better than C1. F-M test "agrees" with this result, for delta=0: we get a score of -2.430 with a level of 0.0076 (which is a bit off because F-M is an approximation). The moral of the story is that we needed roughly a 10x increase in sample size to show this for delta = 0. And it worked because P1 != P2 i.e. there was a difference to show.

The best way to check these examples is to download the 30-day trial version of NCSS. Go to Analysis then choose Proportions – Two Independent. You need to input the data as a 2x2 table with the success and failure rate, i.e. S1 N1-S1 etc. Note that P1 and P2 are reversed wrt to our discussion above, because P2 is the control group (reference), and P1 is the “treatment”, so the difference tested is P1-P2.