 Topic: Questions about pd (Read 4773 times)
0 Members and 1 Guest are viewing this topic. ##### 2004-11-04 05:52:49
I'm trying to wrap my head around the value of pd, aka p2, aka the probability distinguisher, referenced in the literature (and ff123's posts) as being the probability that a listener is actually detecting a difference in an ABX test.

My two questions:

FIRST: What I find most troubling about pd is that it is explicitly a context-dependent quantity - at the very least, potentially varying considerably between individuals - but more importantly, varying drastically between samples. For transparent encoders, most samples are going to be almost completely indistinguishable, implying a pd between 0 and 0.1, and a few problem samples are very distinguishable (close to 1). In general, every time the context of the listening environment changes in the course of the test, theta is rendered invalid.

How does the notion of a proportion distinguisher handle that? One potential solution is to  do what John Corbett outlined in the ff123-vs-Arny usenet thread, and treat problem samples as members of a "population" with varying pd values for each. The idea that half a population can hear something and the other half can't, implying split values of pd which implies a more complicated hypothesis test scheme, seems very analogous to a certain proportion of test samples being problem samples with high values.

For instance: say that for most samples, the transparency holds and you'll only notice a difference maybe 10% of the time, but for problem samples, which might be 1% of the theoretical "set" of samples, the difference is glaringly obvious and pd is more like 0.9 or so. That results in a pmax of... drumroll, please... 0.55. That's really, really low, but it does make a good bit of sense. The question is, is it correct?

SECOND: How exactly are you supposed to come up with pd-values (or theta-values) anyway? Is there a rhyme or reason to it, or do I just guess? ##### Reply #1 – 2004-11-04 20:46:14
In more typical usage, the pd is a requirement set by the test administrator.  For example, Coke switches to high-fructose corn syrup and wants to know if it still tastes the same.  So they choose an acceptable proportion of distinguishers, say 15% of the population, along with the type I and II error risks.  They plug it into their spreadsheet, and N, the minimum number of testers needed to fulfill the test requirements, pops out.

I agree that things are not really 100% satisfactory, though.  For testing codecs, some samples are more difficult than others, complicating the picture.  The proportion of distinguishers changes with sample difficulty.

Maybe a better way to approach things statistically is using latent trait (Rasch) analysis, which at least one group of people have done:

http://www.moultonlabs.com/slides/subject/

The idea is to create a two scales:  a listener severity scale and a program intolerance scale.  Then if you are able to map these scales into the general population, you can say much more useful things because you don't have to assume that the results apply only to a certain test group or for a particular difficulty of samples.

However, this type of testing may be beyond the means of Internet testing.  You need a lot of people of different ability and a lot of samples of varying difficulty.  And each listener has to test every sample.

ff123 ##### Reply #2 – 2004-11-04 22:19:04
Quote
In more typical usage, the pd is a requirement set by the test administrator.  For example, Coke switches to high-fructose corn syrup and wants to know if it still tastes the same.  So they choose an acceptable proportion of distinguishers, say 15% of the population, along with the type I and II error risks.  They plug it into their spreadsheet, and N, the minimum number of testers needed to fulfill the test requirements, pops out.

I agree that things are not really 100% satisfactory, though.  For testing codecs, some samples are more difficult than others, complicating the picture.  The proportion of distinguishers changes with sample difficulty.

Not only that, but I suspect the indivdual theta changes with sample difficulty. In your spreadsheet you assume individual theta=1, and compute population theta as (pd+1)/2.

From this thread for instace, you can detect -.5dB drop with 14/16 frequency, but -.3dB drop only with 27/40. If we extrapolate, your theta would be .875 vs .675.

Anyway, fixing individual theta to 1 (rather unrealistic), for a pd of .05, you need to test >4300 individuals, and for pd of .01 you need over 100,000 individuals! And how many "elite audiphiles" are there? I suspect that 1% of the population is actually an overestimate. So, even you prove "no difference" with pd = .01 over the entire population, you won't convince the "elite audiophile" that (s)he cannot possibly hear a difference. Maybe if you test 100,000 "elite audiophiles", then (s)he may be convinced. But how do you determine if someone is an "elite audiophile" in the first place? And then someone may believe (s)he belongs "elite of the elite".

Quote
Maybe a better way to approach things statistically is using latent trait (Rasch) analysis, which at least one group of people have done:

http://www.moultonlabs.com/slides/subject/

The idea is to create a two scales:  a listener severity scale and a program intolerance scale.  Then if you are able to map these scales into the general population, you can say much more useful things because you don't have to assume that the results apply only to a certain test group or for a particular difficulty of samples.

However, this type of testing may be beyond the means of Internet testing.  You need a lot of people of different ability and a lot of samples of varying difficulty.  And each listener has to test every sample.

ff123
[a href="index.php?act=findpost&pid=252040"][{POST_SNAPBACK}][/a]

The main pitfall of the Rasch model for audio is that it requires unidimensionality. This is usually ok for not-so-sublte-differences in DUTs when ordering on a dominant dimension implies almost the same ordering on all other dimensions, but when it comes to subtle differences between DUTs, this is probably not true. One encoded sample has pre-echo and another has missing high freqs. Which one is better? One amp has "better" bass, but "worse" highs, etc.

(For those who don't want to RTFA: DUT = device under test.)

The dimension tested was transparecy, which is ok if you assume the subjects can mentally compute a function like transparency = 6 - max_{all-defects-types}(defect-score), where defect-score is on a scale from 1 to 5.

It is interesting to note that experienced listeners found a difference between two DUTs, while they same two DUTs sounded "the same" to inexperienced ones.
The earth is round (P < 0.05).  -- Cohen J., 1994 ##### Reply #3 – 2004-11-05 14:18:11
Axon, if you are comparing only one codec for transparency, you don't have much of choice, just pick a pd.

You can however do an equivalence test for transparency for a pair of codecs, using for instance the ratio of success rates. To caculate the sample size of such test, use Farrington Manning formula. Here is an on-line calculator.

Ratio is by no means the "estabilshed" function. You can use difference, odd ratio, etc. These have different formulas for estimating size. There is a more general result that lets you pick from a larger family of functions (instead of using the ratio), see this paper.

Btw, isn't equivalence testing a good idea for a listening test for switching from lame 3.90.3 --aps to something newer? Since it doesn't seem likely (based on history) that a new version would better 3.90.3 on all samples at the same file size, wouldn't it suffice that the newer is not worse "on average", given that it has other advantages (speed)? It seems a textbook case for equivalence testing (2-sided, since it can acually be better).
The earth is round (P < 0.05).  -- Cohen J., 1994 ##### Reply #4 – 2004-11-08 07:44:18
Quote
Axon, if you are comparing only one codec for transparency, you don't have much of choice, just pick a pd.

You can however do an equivalence test for transparency for a pair of codecs, using for instance the ratio of success rates. To caculate the sample size of such test, use Farrington Manning formula. Here is an on-line calculator.

I'm not quite sure how to apply this calculator, because typing in all the required values sort of seems like overspecifying the problem - I mean, isn't p2 supposed to be an unknown?

Setting p1=0.5, alpha=0.05, beta=0.05, n2:n1=1... I assume that s0 means the interval beyond which the null hypothesis fails, and p2 is the assumed ratio if the null hypothesis fails ( ), so I set s0=0.05 and p2=0.6. I get n1=n2=237.

Quote
Ratio is by no means the "estabilshed" function. You can use difference, odd ratio, etc. These have different formulas for estimating size. There is a more general result that lets you pick from a larger family of functions (instead of using the ratio), see this paper.

An interesting link indeed, but it seems like this would only be useful for noninferiority tests of nontransparent encoders - ie, for transparency testing, p1 is always going to equal 0.5, so what difference does it make if we take a ratio or an offset to get, say, 0.6?
Quote
Btw, isn't equivalence testing a good idea for a listening test for switching from lame 3.90.3 --aps to something newer? Since it doesn't seem likely (based on history) that a new version would better 3.90.3 on all samples at the same file size, wouldn't it suffice that the newer is not worse "on average", given that it has other advantages (speed)? It seems a textbook case for equivalence testing (2-sided, since it can acually be better).
[a href="index.php?act=findpost&pid=252174"][{POST_SNAPBACK}][/a]

I'm still a little unclear on all the terminology, so I'm not really sure if any possible test to determine if an encoder performs as well as 3.90.3 APS already is an equivalence test. I mean, nearly the only thing which can't be used as an equivalence test is a straight-up ABX test, right? ##### Reply #5 – 2004-11-09 21:45:44
1) On using ABX for testing non-inferiority of a codec vs. original: this cannot be an experiment in which the subject and experimenter are the same person. If the subject knows what's being tested, the experiment is already compromised wrt statistical significance. "What's being tested" means here just that the subject knows what A and B may be, regardless whether (s)he knows or doesn't know X.

This may seem contradictory with a normal ABX test, but it is not. Here is why: when you do a non-inferiority test, due to the different null hypothesis, the "direction" of the test is reversed. So type I errors in a normal/difference ABX test become type II, and vice versa, and alpha and beta (as used in the ABX program) also change roles, i.e. significance is beta and power is 1-alpha.

For a normal/difference ABX test, knowing A & B only decreases power but doesn't affect significance. The subject may not "try hard enough" to find the difference(s) which contributes to type II errors. For a non-inferiority ABX test, knowledge of A and B affects significance directly, because it is a type I error for this test.

To do something like this in practice would require that the subjects have no idea whatsoever of the "devices" being tested.

1.1) You have observed correctly that one cannot test for 100% transparency. If p1=p2 and s0=0, the required number of samples (n) becomes infinite in F-M size formula. This is "proof" if you like that "showing there's absolutely no difference" is impossible in statistical terms.

2) On codec equivalence/non-inferiority testing. First, it should be obvious from (1) that you cannot ABX test the difference between transparent codecs when subject and experimenter are the same. Second, since you cannot show that something is totally transparent (1.1), you cannot show that two totally transparent things are equivalent.

You can use ABX trials to test the equivalence/non-inferiority of two codecs that aren't exactly transparent, which is the case for lame examples I gave. Let C1 be lame3903 --aps. It is only "99% transparent", i.e. there are samples from set S1, on which some people can tell the difference from the original, so transparent is an abuse of language here.

Consider a contestant, codec C2 (say lame3961 --aps). Again some people can tell the difference from the original on some samples from another S2, which is does not include S1, and is not included in it (but may intersect it). In simple terms: "on some samples it does better, on some it does worse" than codec A. This is a scenario in which one can do a non-inferiority subject=experimeter ABX-based test.

In practice you'd need a blocked design like ABC/HR with multiple samples and multiple subjects to make a "HA recommended" change. For the sake of simplicity let's consider a test where only one sample and one subject are tested. For each ABX trial, the subject must not know if he's doing A vs the original, or B vs the original, so you can’t just use a normal ABX procedure. You need a program that on each trial randomly ask you to do a C1 vs O (original) or C2 vs O.

In the end you’ll have two ratios, P1=S1/N1, and P2=S2/N2 of successful “guesses” for each codec. In general N1 != N2 due to randomization. If C1 is the reference, and you want to show that C2 is not inferior, you must test whether P2 – P1 < delta. The difference can be negative if P2 has lower “guess rate” than P1, which intuitively means that P2 may be better. Equivalence means non-inferiority both ways i.e. |P1-P2| < delta. In practice if N1 ~ N2, the positive quantity of P1-P2 and P2-P1 will be harder to prove less than delta.

Given P1 & P2, you can first test H1: P2 < P1, which is basically “can you (implicitly) tell the difference between C1 and C2”. There are many ways to test this, one is Fisher’s exact test. If this test succeeds, then you can stop, because you’ve shown that there is a difference, and that C2 is actually better. You can also test H1: P1 < P2, which means, C2 is worse, stop.

If the above tests fail, then you are to choose a delta, and test if P2 < P1 + delta. There are several tests for this too. One is (you guessed) Farrington Manning. Because we couldn’t prove that P2 < P1, we won’t be able to make delta = 0 in here. The significance level and delta are obviously related. The more you tighten delta, the worse the significance level will become.

Examples:

i) P1 = 152/259 ~ 0.5869, P2 = 170/275 ~ 0.6182. Intuitively, the new codec C2 is worse than C1. Let’s see if this is significant. Fisher's exact test gives 0.7960 in one direction, and 0.2577 in the other. So we can’t claim neither P1 < P2 nor P2 < P1 with any reasonable significance. What about non-inferiority H1: P2 - P1 < delta? For delta = 0.2, H1: P1-P2 < 0.2 gets a F-M score of -4.046, which is uber-significant (F-M score tends to the cumulative normal distribution). What about delta=0.15? F-M score of -2.825 which gives a level of 0.0024. Can we make delta=0.10? Not really, F-M score is -1.627 so p-value=0.0519.

ii) P1 = 170/275 ~ 0.6182. P2 = 152/259 ~ 0.5869. This is the opposite case of the above. Intuitively, it should be better, because the observed value for P2 is lower than P1. As you’d expect, Fisher's exact test gives the same values as before (but in reverse direction), so we can’t claim neither P1 < P2 nor P2 < P1 with any reasonable significance. What has changed about non-inferiority H1: P2 – P1 < delta? For delta = 0.2, H1: P1-P2 < 0.2 gets a F-M score of -5.568, which is better than we got before (-4.046). Can we make delta=0.10 this time? Yes, F-M score is -3.116 for a level of 0.0009. We can make delta=0.05 and get F-M score -1.922 and level 0.0273. We can’t make it 0.01 however, F-M score is -0.975 for a level of 0.1647.

So what can we say about equivalence? You have to combine results from (i) and (ii), taking the least favorable significance number. So we can claim equivalence for delta = 0.15 at significance level of 0.0024.

iii) What if we had more data points for roughly the same P ratios as in (i)? P1 = 1993/3215 ~ 0.6199. P2 = 1789/3033 ~ 0.5898. Fisher’s exact test for P1 > P2 is 0.0081, so we can admit that C2 is simply better than C1. F-M test "agrees" with this result, for delta=0: we get a score of -2.430 with a level of 0.0076 (which is a bit off because F-M is an approximation). The moral of the story is that we needed roughly a 10x increase in sample size to show this for delta = 0. And it worked because P1 != P2 i.e. there was a difference to show.

The best way to check these examples is to download the 30-day trial version of NCSS. Go to Analysis then choose Proportions – Two Independent. You need to input the data as a 2x2 table with the success and failure rate, i.e. S1 N1-S1 etc. Note that P1 and P2 are reversed wrt to our discussion above, because P2 is the control group (reference), and P1 is the “treatment”, so the difference tested is P1-P2.
The earth is round (P < 0.05).  -- Cohen J., 1994 ##### Reply #6 – 2004-11-10 01:47:02
Quote
SECOND: How exactly are you supposed to come up with pd-values (or theta-values) anyway? Is there a rhyme or reason to it, or do I just guess?
[a href="index.php?act=findpost&pid=251915"][{POST_SNAPBACK}][/a]

Well, this is a major problem in pharmacology. Unlike clinical trials, where you can use some value that's considered too low of an improvement to bother with in the field, in pharmacology you don't know that (yet), because there are no clinical studies done. The "standard" solution (which seems very applicable to audio) is use the previously known confidence interval for the reference.

In the 2xlame example I gave above, you can use a previous ABX ratio confidence interval. Mind you, even if you fail to ABX at 55/100, the 95% confidence interval is [0.447280; 0.649680]. The more trials, the tighter the interval is likely to get.
The earth is round (P < 0.05).  -- Cohen J., 1994 ##### Reply #7 – 2004-11-10 05:46:55
I'm going to warn you in advance that most of what I'm about to say is ad hoc and brainstorming, but since the discussion has veered onto the central topic of how to conduct a noninferiority transparency test, I believe it is appropriate.

Quote
1) On using ABX for testing non-inferiority of a codec vs. original: this cannot be an experiment in which the subject and experimenter are the same person. If the subject knows what's being tested, the experiment is already compromised wrt statistical significance. "What's being tested" means here just that the subject knows what A and B may be, regardless whether (s)he knows or doesn't know X.

<snip>

To do something like this in practice would require that the subjects have no idea whatsoever of the "devices" being tested.

Well, obviously if we all had access to clinical testing labs we could get some pretty good listening results.

I don't quite buy this argument. Reasoning: If the DUT is in fact known to the listener, the only bias which affects the result is, as I understand, expectation bias. And in the context of lossy encoder testing, and in the context of testing for statistical transparency of a single codec with no comparisons to any other codecs, the expectation is always that the codec is not transparent. I find it hard to believe that the type I error of a noninferiority test could thusly be affected. Yes, I did raise a point which sort of contradicts this in another thread.. but that argument only applies to comparing two different codecs.
Also, can the significance reduction be quantified?

I specifically see a few ways this could be worked around.

- Testing with a multitude of encoders all at once. As more encoders are added, the listener will have progressively less of a clue of what might be tested. I mean, "no idea whatsoever of the devices being used" is equivalent to an ABC test with an infinite number of "codecs", right? The sample ratios could also be fidgeted with (N2/N1 != 1) to accomplish a similar result. This may not quite quantify the reduction but it at least reduces it.

- Run a blind vs nonblind encoder ABX comparison, determine the reduction in significance, and apply that to the results of a nonblind encoder ABX. This only makes sense if the listeners stay the same and the samples in the above comparison are significantly less than transparent.

Quote
1.1) You have observed correctly that one cannot test for 100% transparency. If p1=p2 and s0=0, the required number of samples (n) becomes infinite in F-M size formula. This is "proof" if you like that "showing there's absolutely no difference" is impossible in statistical terms.

Transparency is sort of non-falsifiable - ie, not proving non-transparency is never going to be a 100% proof of transparency. However, I believe it is statistically reasonable to use prior knowledge of the encoder under test to test the known limits of the algorithm, where by "prior knowledge" I mean "problem samples". If you can provide good technical proof that the flaws of an encoder are reproducible with certain classes of signals, and all other signals will be categorically more transparent than these problem samples, then you can find samples which isolate those flaws, and you can thus limit the required audio sample size drastically.

This leaves unresolved potentially requiring an infinite number of listeners. Again, I think that this is not a problem once prior knowledge is taken into account, and the experiment, as usual, abused beyond reason.  I'm going to assert - and again, correct me if I'm obviously wrong here - that virtually any listener is capable of superhuman hearing of artifacts by certain distortion effects. High-frequency distortion and preringing can be made listenable by slowing down the signal; low-frequency distortion  by speeding up the signal. Equalization effects may be useful for other tasks. If we can show that all potentially audible artifacts can be made audible for most if not all listeners, then suddenly we've strengthened the significance of the test to where we can be very confident of transparency under normal listening situations, just like how using problem samples strengthens the test against less stringent material.

Quote
You can use ABX trials to test the equivalence/non-inferiority of two codecs that aren't exactly transparent, which is the case for lame examples I gave. Let C1 be lame3903 --aps. It is only "99% transparent", i.e. there are samples from set S1, on which some people can tell the difference from the original, so transparent is an abuse of language here.

<snip>

The best way to check these examples is to download the 30-day trial version of NCSS. Go to Analysis then choose Proportions – Two Independent. You need to input the data as a 2x2 table with the success and failure rate, i.e. S1 N1-S1 etc. Note that P1 and P2 are reversed wrt to our discussion above, because P2 is the control group (reference), and P1 is the “treatment”, so the difference tested is P1-P2.
[a href="index.php?act=findpost&pid=252888"][{POST_SNAPBACK}][/a]

Thanks for the pointers.

BTW, the specific standard for transparency I'm personally shooting for is "audible only 5% of the time for any set of samples and any set of listeners" - IOW, theta=0.525, problem samples only, prescreened ears only. Yes, wishful thinking, but I suspect the proportions can be relaxed a bit with problem samples. Say, type 1=0.1, type II=0.2, theta=0.6, although I have no idea how these numbers would relate to a noninferiority test. From what I understand, the only codecs that would come close to this standard would be MPC --insane and AAC at 320kbps or so. LAME has too many problem samples at 320 to qualify.

And the point of all of this? LAME APS is the closest thing online music stores have to a transparent audio format, which isn't really saying much. I'd like to state with some reasonable certainty that a codec is going to be transparent all the time, not just most of the time, so I can stop buying CDs once and for all.