Title: **That old chestnut... Type II error & ABX**

Post by:**Foobar3030** on **03 July, 2017, 03:57:54 PM**

Post by:

Our objective in listening tests is to establish whether a perceptible difference exists between samples. Convention entails setting alpha at 0.05, equating to the correct identification of at least 12 samples out of 16.

Beta is a slightly trickier matter. I know this has been debated*ad nauseam* here, but discussion has often been in the context of comparing the virtues of blinded vs. sighted testing. I don't want to re-litigate that, and I suspect that I and most people here accept that sighted tests are unacceptable and inappropriate for scientific studies. My intention here is certainly not to 'talk down' blinded testing, but to focus on the statistical approaches people adopt:

In the case of comparing Redbook to higher frequency formats, the existence of perceptible difference is well understood as implausible. The 'effect' of higher frequencies (such as it exists) must have extremely low effect size, but for the purposes of statistical assessment, that effect size is*theoretically *>0. This in and of itself effects a higher beta. Accordingly, high beta appears to be a natural consequence of comparing two things which are functionally identical.

With this in mind, my question...**Is the existence of low power even a concern for us, or does it merely confirm our point about samples being indistinct?**

My thinking: If we simply seek to establish whether perceptible difference exists, then the inability of individuals to demonstrate difference (i.e. 0.05 criterion not met) is consistent with high beta, because the effect is so small (or non-existent) as to be imperceptible.

Thoughts?

**Edit:** I should add, I realise many ABX tools now provide indications of statistical power. My question relates more to whether low power is even a legitimate concern in the first place, given that we're dealing with samples which are near-impossible to distinguish, and which necessarily display high beta anyway.

Beta is a slightly trickier matter. I know this has been debated

In the case of comparing Redbook to higher frequency formats, the existence of perceptible difference is well understood as implausible. The 'effect' of higher frequencies (such as it exists) must have extremely low effect size, but for the purposes of statistical assessment, that effect size is

With this in mind, my question...

My thinking: If we simply seek to establish whether perceptible difference exists, then the inability of individuals to demonstrate difference (i.e. 0.05 criterion not met) is consistent with high beta, because the effect is so small (or non-existent) as to be imperceptible.

Thoughts?

Title: **Re: That old chestnut... Type II error & ABX**

Post by:**saratoga** on **03 July, 2017, 04:30:08 PM**

Post by:

Abx tests set out to prove a difference is audible. Beta isn't applicable, since if you fail, the test is inconclusive.

At least that is my interpretation.

At least that is my interpretation.

Title: **Re: That old chestnut... Type II error & ABX**

Post by:**Arnold B. Krueger** on **14 July, 2017, 07:07:34 AM**

Post by:

Abx tests set out to prove a difference is audible. Beta isn't applicable, since if you fail, the test is inconclusive.

At least that is my interpretation.

FWIW I think that is what happens when you try to apply a modern of philosophy of science to ABX testing.

"Negative hypothesis are difficult or impossible to prove."

But you can collect a ton of negative evidence that is very convincing, particularly to the people who work hard to try not to collect negative evidence but end up collecting it anyway.

Fact is, positive evidence can be fairly convincing in a negative way if you step back and look at how hard weak positive evidence can be to collect.

The usual question at hand is usually trying to answer the question:

"Can I sell this as a good sounding or superior sounding product?"

So, let us say that you huff and you puff and after what seems like days of listener training, program material selection and monitoring system tuning, you can repeatedly reliably barely detect an audible difference.

Riddle me this, would a real world potential customer try this hard to hear such a small difference?

Heck, we can't even get most audio reviewers, who on some level seem to be the best prospects for this kind of stuff, to do their first serious ABX test.

Until they try it and get a little good at it, many don't believe that ABX tests can be too sensitive for the real world.

That's how much placebophile illusion dominates the world of audio.

Title: **Re: That old chestnut... Type II error & ABX**

Post by:**Foobar3030** on **16 July, 2017, 06:01:36 PM**

Post by:

Here's a follow up question: Do non-binary listening tests have higher or lower Type II error than ABX?

For example, the probability of guessing correctly in a triangle test is 1/3. Does this have any implications for Type II error, or just for the number of trials to reach 5% alpha (necessarily fewer need to be correct than under ABX).

For example, the probability of guessing correctly in a triangle test is 1/3. Does this have any implications for Type II error, or just for the number of trials to reach 5% alpha (necessarily fewer need to be correct than under ABX).

Title: **Re: That old chestnut... Type II error & ABX**

Post by:**Arnold B. Krueger** on **17 July, 2017, 08:43:23 AM**

Post by:

Here's a follow up question: Do non-binary listening tests have higher or lower Type II error than ABX?

For example, the probability of guessing correctly in a triangle test is 1/3. Does this have any implications for Type II error, or just for the number of trials to reach 5% alpha (necessarily fewer need to be correct than under ABX).

There are a number of ways of looking at this, and two come quickly to mind:

(1) Statistical view - the means for doing the experiment is fixed, but how the data is analyzed by various means can be compared.

(2) Operational view - the means for doing the experiment is varied.

From an operational view, I submit that current knowledge about how we hear strongly suggests that our ability to hear sameness and differences the most accurately and reliably is maximized when:

(a) The sounds are as alike as possible

(b) The sounds are compared to each other in the closest temporal proximity as possible. We know now that delays between presentation of the sonic alternatives much greater than 1 second or so can greatly diminish the ability to identify small or subtle

differences.

In my own limited mind, I put those two ideas together and I conclude that either binary comparisons are best, of maybe even they are the only kind of comparisons we can actually do with sound.