## Topic: Understanding ABX Test Confidence Statistics (Read 16896 times)previous topic - next topic

0 Members and 1 Guest are viewing this topic.
• Speedskater
Understanding ABX Test Confidence Statistics
##### 17 January, 2015, 06:31:33 PM
A discussion of an article in Wide Screen Review magazine written by Amir/amirm:
"Understanding ABX Test Confidence Statistics"

Kevin Graf :: aka Speedskater

• xnor
• Developer
Understanding ABX Test Confidence Statistics
##### Reply #1 – 17 January, 2015, 08:36:49 PM
So it seems like armirm has finally read up on statistics. (edit: after reading the whole post, I guess not)

Quote
this is an important topic for which there is next to nothing online.

I'm not sure if amirm is having access to the same Internet than the rest of the world, but there are loads of resources including online courses about statistics.

Quote
as I explain below here the threshold of confidence in the results was just 56% right answers. This has caused many to dismiss the results as its results being little better than "chance." That is completely wrong.

Here's the relevant sentence from the paper:
"One-sided t-tests were performed for each condition to test the null hypothesis that the mean score was not significantly different from 56.25% correct"

Here's a generic visualization of such a test:
[/quote]
(Bam! seems to be the sound of several heads hitting desks.)
160 is the number of 8 persons * (12 trials - 2 trials discarded) * 2 blocks per condition

Besides, the numbers above are wrong and do not reach a >=95% confidence interval. Here are the correct numbers for a binomial distribution:
10: 9
20: 15
40: 26
80: 48
160: 91 or ~57%
1000: 527

For obvious reasons (except if you cannot find any information on this on the Internet), a binomial distribution would not the best choice for the test from the paper, which is why they didn't use it.

Let's be stupid for a moment and assume each person achieved not 56%, not 57%, not even 60% but close to 70% correct. Each person did 20 trials, so that would be 13-14 correct. That's failed ABX tests for everyone.

Quote
To really blow your mind, we only need 95 right answers out of 160 to achieve 99% confidence the results are not due to chance! This only represents 59% right answers!!!

No, you'd need 96 coin flips with the same outcome in a binomial distribution...

Quote
Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed.

Almost everything is factually wrong, but I guess most people here are used to it, especially the "I cannot be wrong" attitude.
"I hear it when I see it."

• mzil
Understanding ABX Test Confidence Statistics
##### Reply #2 – 17 January, 2015, 11:56:41 PM
A discussion of an article in Wide Screen Review magazine written by Amir/amirm: "Understanding ABX Test Confidence Statistics"  http://www.whatsbestforum.com/showthread.p...ence-Statistics

Amir is a flat out liar. I go by "m. zillch" on AVS forums and I never said the following, at all, which Amir's post from the OP's link said I did, or at the very least implies I meant, by taking my words out of context, as I shall soon explain:

Quote
"As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? "
Like this forum, the AVSforum has an "ignore member" feature which allows me to distance myself from Amir as best I can, since I consider his posts detestable, designed by a self-appointed, incorrigible lobbyist for the snake oil peddling part of the high end audio industry, incessantly spewing lies, propaganda, anti-science, anti-DBT, quote mined distortions, half-truths, strawman arguments, rewriting history, putting words in other's mouths, goal post moving, belittling/condescending ad homs, etc., so I no longer respond to him, nor do I even see his posts, but even when I catch these little glimpses of what he's written about me there, or now in his "What's Best Forum", such as this lie, it turns my stomach.

Notice Amir doesn't provide a link to my original post [unless he makes an edit after I post this  ], so people can't easily examine what I actually wrote, themselves, in full, or in what context I wrote it. Here's my full post #931 he quote mined from, including the immediately preceding post which it directly references:

Post#930 of "AVS/AIX High-Resolution Audio Test: Take 2":
"Yeah, I was thinking about your [CharlesJ's] post(s) earlier and agree that the fact that none of the test subjects in the test conducted by Bobby, my main man, [that's a joke, people, a form of manipulation implying I know him well] could hear a statistically significant difference, on their own, using \$4,600 speakers, oops, I mean \$46,000 speakers, in a special room, using doctored "special" material created just for the test, speaks volumes as to how trivial the difference is, if it even exists." [see P.S. on "if it even exists"]

Post#931:
"^Imagine the sales pitch: "Ladies and Gentlemen, come one, come all, wait till you hear our incredible new Hi-Re\$ sound system that blows CD away! True, you may not be able to hear the difference on your own, as an individual, but merely invite seven of your closest friends over, listen through my necessary \$46,000 speakers*, use my specially prepared samples only, cast your votes over several listening trials, sum your totals, and then finally examine the results in aggregate form, AND BINGO! [>]56% correct responses don't lie (instead of a random coin flip's 50% results) and it conclusively shows, with statistical significance, that yes, you made the right decision to only buy THE BEST!" - not a real quote

*[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]- "audio chain used for such experiments must be capable of high-fidelity reproduction" [This is a real quote. It's the last line of the paper's abstract, protecting him from any subsequent failed attempts  to replicate his findings, by others who may discredit his paper: [/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]https://secure.aes.org/forum/pubs/conventions/?ID=416[/size][!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]  ]  "But your test setup didn't use \$46K speakers, now did it?!" He'll protest.[/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--] - "high fidelity" defined by me, or authorized agents of Meridian Audio, details not specified nor provided upon request[/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]- offer not valid under test supervision by a disinterested third party[/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]- must use exact, unpublished, unreleased down converted samples held in my possession[/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]- alternate forms of conversion, music, or the use of superior dither disallowed[/size]
[!--sizeo:2--][span style=\"font-size:10pt;line-height:100%\"][!--/sizeo--]- any attempt to measure the down converted sample's level match to the original, possibly showing a minor mismatch, as was found in the initial AIX records' down conversion samples for the AVSforum tests, for example, is disallowed[/size]."

My joke's point is who on earth would spend \$46,000+ on speakers, etc., which the BS paper implies are actually necessary to reach "high fidelity reproduction" and that lesser setups, say half quality \$23,000 speakers, might not fit the bill, all for an audio improvement which is SO incredibly subtle that one can't even pass a test showing an ability to discriminate the difference, with good statistical significance, as an individual listener taking the test over several trials, but ONLY when the scores of many other test subjects (listeners) are all pooled together and examined in toto does there emerge a level where the aggregate score of correct IDs, when it exceeds 56% in the case of 160 total trials conducted, can there be deemed a statistical significance?! Is that REALLY something any rational person could justify spending +\$46K on? "No, I admit, I can't prove I can hear a difference myself, using my new \$46K speakers I just bought, on my own, but when you pool my test results with a bunch of my friends together, over more than a hundred trials, then I can prove the difference exists." Are there people who would actually brag about having spent big money to achieve THAT goal?!

By deliberately taking what I said out of context he completely twisted my words. His manipulation implies I was asking
Quote
How on earth can getting just 6% more right answers gets us to 95% confidence?
I never asked that nor meant that.

P.S. By "if it even exists" in my first AVS post I meant there are still many lingering questions and concerns over how the BS paper's conclusions came to be, including the form of dither used which was not "best practice", but I in no way suggested that the findings themselves were NOT statistically significant, in fact I said they WERE, in aggregate form.

• mzil
Understanding ABX Test Confidence Statistics
##### Reply #3 – 18 January, 2015, 01:20:38 AM
Correction: I wrote he was anti-DBT in my last post, I meant actually that I have seen him spout anti-ABX baloney, parroting people like BS, not anti-DBT. oops. I also seem to recall that he level matches in his "many" tests "by ear", "not at all", "it isn't necessary", "turning up the lesser one" (?!) or something similar, I forget exactly; AJ or Arny may recall.

• mzil
Understanding ABX Test Confidence Statistics
##### Reply #4 – 18 January, 2015, 03:57:39 AM
I found it. This from Mr. "objectivist" (his word!), "long time", "trained", expert, not that long ago really, 2009, on level matching, from several posts in one AVS thread I found, devoted to how he "handles" it. It boggles the mind how so many people follow him as an "expert":

"I did not level match anything. However, once I found one source was worse than the other, I would then turn up the volume to counter any effect there. Indeed, doing so would close the gap some but it never changed the outcome. Note that the elevated level clearly made that source sound louder than the other. So the advantage was put on the losing side"

Then scientific reality gets explained to him by other(s) and he balks:

" Did you see the part where I said that I compensated for that by increasing the volume higher for the lower performing source? If so, how is it again that with its level higher, it would still underperform?"

Later:
"No I don't. First let's be clear on why people want to level match. It is because louder is preceived to be better. Matching levels eliminates this factor. My technique involves using the exact same principal in reverse. That is, I make a worse sounding source louder than the other. If the conclusion of the test does not change, then level had nothing to do with the quality difference.

Now let's look at an example:

Equipment A has default level that is at 90%.

Equipment B has default level that is at 85%.

I perform A/B test and find that B sounds worse than A. Before calling my results final, I turn up the volume by 10% on B. So here is the new situation:

Equipment A has default level that is at 90%.

Equipment B has default level that is at 95%.

I listen again and once more B sounds worse than A despite being louder!.

Surely if B sounds worse at 95%, it is not going to sound the same or better at 90% (level matched).

[Quote of fastl: "BTW, I've never heard of the plus-minus 10 percent volume method."]
You haven't heard of it because I invented it out of necessity . The beauty of this method is that you avoid having to use a seperate level control and introduce other unknown into the equation (some volume controls have more negative impact than the difference between DACs). And you don't need special test equipment to get there."

Here we learn partly what Amir means when he SO OFTEN mentions he's a trained listener:

"I am telling you that once you are trained, you can easily look past most if not all volume differences. A trained listener is not easily fooled by loudness differences because he can focus and identify real issues."

At one point there's an unexpected, sharp shift to "But level matching is hard" [paraphrased]:

"Besides, level matching is easier said than done. How do you suggest I change the level in ML DAC feeding a headphone amp? Put in another analog stage to change volume? Digital control? How do you know the effects of these circuits? There is a difference here between theory and practice."

Later:"Because you cannot match levels easily. If you are testing DACs you want as little equipment between you and the equipment. Sticking a multi-channel gain control which may change the audio and add cross-talk between the sources adds more variables than it removes. My method accomplishes the same without changing the experiment. In other words, you have to be aware of the Heisenberg principal in that the testing itself must not change the outcome..."

Later:
"That has never been the issue in this thread. I have said multiple times that I believe in level matching.

The issue here is that I have provided an alternative which costs nothing, makes the experiment more accurate, and everyone here could use it tomorrow to test equipment. In contrast, I am still waiting to hear what equipment one needs to level match DACs, and how it is practical for people shopping for gear to deploy it."

Actually, he'd be off listening to this wise man's sage advice: "Just because valid, scientifically controlled experimentation can be challenging to pull off properly (and time consuming) doesn't mean unscientific testing suddenly has validity."

I don't want to take anything he wrote out of context, like some people I know do, so here's the entire thread: http://www.avsforum.com/forum/86-ultra-hi-...ume-method.html There like 500+ other posts I admit I haven't read. But I'm pretty sure most people here will find these to be some of the juiciest. P.S. Sorry, my breaks between his different posts in that thread my not be 100% perfectly accurate.

• xnor
• Developer
Understanding ABX Test Confidence Statistics
##### Reply #5 – 18 January, 2015, 07:04:54 AM
I wouldn't have expected anything less from him.

"The dotted line shows performance that is signicantly different from chance at the p<0.05 level calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition)."

X ~ B(n, p)
with n = 160
p = 0.5

k = 90
P(X >= k) = 6.64% which is not statistically significant given a 95% confidence interval

same P with k = 91 is barely significant

"I hear it when I see it."

• Porcus
Understanding ABX Test Confidence Statistics
##### Reply #6 – 18 January, 2015, 09:10:52 AM
Quote
Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed.

Almost everything is factually wrong, but I guess most people here are used to it, especially the "I cannot be wrong" attitude.

... and, although I would say it is an undisputed fact that the use of p-values is widespread in statistical hypothesis testing, that is a fact from observation of scientific practice, not because those ninety/ninety-five/ninety-nine numbers follow purely from theory. It isn't above debate. But the guy (it's a "he", right?) is still blessed with the youthfulness that enables him to know f(x)ing everything after just a lesson and a half, uh?

Understanding ABX Test Confidence Statistics
##### Reply #7 – 18 January, 2015, 04:18:08 PM
A discussion of an article in Wide Screen Review magazine written by Amir/amirm: "Understanding ABX Test Confidence Statistics"  http://www.whatsbestforum.com/showthread.p...ence-Statistics

Amir is a flat out liar. I go by "m. zillch" on AVS forums and I never said the following, at all, which Amir's post from the OP's link said I did, or at the very least implies I meant, by taking my words out of context, as I shall soon explain:

Quote
"As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? "

Like this forum, the AVSforum has an "ignore member" feature which allows me to distance myself from Amir as best I can, since I consider his posts detestable, designed by a self-appointed, incorrigible lobbyist for the snake oil peddling part of the high end audio industry, incessantly spewing lies, propaganda, anti-science, anti-DBT, quote mined distortions, half-truths, strawman arguments, rewriting history, putting words in other's mouths, goal post moving, belittling/condescending ad homs, etc., so I no longer respond to him, nor do I even see his posts, but even when I catch these little glimpses of what he's written about me there, or now in his "What's Best Forum", such as this lie, it turns my stomach.

I think you are wise. I had been responding to Amir's misquotes and out and out lying, and it got me permanently booted off AVS. In the end on most forums you either ignore his taunts or eventually you're gone. Same thing happened on his WBF forum. I think that trying to maintain some sense of integrity and truthfulness is the more important thing, and any forum that won't support people who try to do that don't deserve the participation.

• mzil
Understanding ABX Test Confidence Statistics
##### Reply #8 – 19 January, 2015, 03:11:37 PM
Thanks Arny. AVS will not the same without you. Your contribution there, including your truthful and accurate posts on what's audibly significant in audio, and what isn't, was refreshing considering the overwhelming influence of the incessant, "high-end" audio con artists, propagandists,  and swindlers.

Should Amir delete or edit his text the OP links to, since I've exposed his quote mined lie, or if people have trouble loading it (as I did earlier) here it is:

"Understanding ABX Test Confidence Statistics.

OK, a mouthful of words for the title but this is an important topic for which there is next to nothing online.  The issue has come up recently because of the double blind test published by Stuart et al. as I explain below here the threshold of confidence in the results was just 56% right answers.  This has caused many to dismiss the results as its results being little better than "chance."  That is completely wrong.  Below, I explained this on AVS Forum on the poster making the same mistake.  I will make this a formal article later but I thought I share it now to get better awareness of this important topic.

Originally Posted by m.zillch on AVS Forum: "56% correct responses don't lie (instead of a random coin flip's 50% results) and it conclusively shows, with statistical significance, that yes, you made the right decision to only buy THE BEST!" [/B]- not a real quote "

[blockquote]I have answered this a few times but since it seems persistent, let me explain this in more detail.

ABX is a type of "forced choice" testing. At all times, the user can click on X being A or B. He has the answers. He just has to select the right one. Or vote randomly. We want to separate these two outcomes. To do that we use statistical analysis. And pick a threshold that says the probability of the listener randomly voting is less than 5%. Or put inversely, 95% chance that the results are not due to chance. Everyone more or less knows this part.

What is not known is the math that leads to this and how non-intuitive it is. Before I get into that, zillch is referencing the Stuart et al. peer reviewed listening test that was published in the AES journal. In there, they mention that the threshold that they had to cross was 56% hence the number zillch is using above. Note that this was NOT the outcome. The outcome was actually better than this. But the threshold for 95% confidence interval was just 56% of the listener answers being right.

As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? The answer lies in statistics. And the math here is conclusive and not subject to debate. Let me explain a bit of it.

Our ABX test has a statistical distribution that is "binomial." The listener either gets the results right or wrong (hence the starting letters "bi" or two outcomes). Probability of the listener getting the answer right is 0.5 or one out of two chances of being right. Given these two values, statistical math instantly gives us how many "right" answers we have to get to right, to achieve 95% confidence we desire.

If you want to follow along and repeat the math I am about to show you and have excel, the formula is "binom.inv". Here are the number of right answers we need to get for different number of trials to achieve 95% confidence and the percent right that it represents:

Trials: Number Right, Percent
10: 8, 80%
20: 14, 70%
40: 25, 63%
80: 47, 59%
160: 90, 56%

Bam!  we get the same answer as in the Stuart paper. It only takes 90 right answers out of 160 trials they ran, or 56% right, to achieve 95% confidence that the results were not due to chance.

To really blow your mind, we only need 95 right answers out of 160 to achieve 99% confidence the results are not due to chance! This only represents 59% right answers!!!

Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed. It says what it says and that is the end of that. The fact that in our belly it seems wrong that 50% would be pure chance and 59% means 99% confidence is cause to not use lay logic to examine these complex topics.

As I said at the outset, the results of the Stuart test was actually better than 56% as I have shown before. Here are the results again:

The dashed line is the 95% confidence line. The vertical bars show the percent right. Notice how with the exception of one test, the rest easily clear the 95% confidence interval of 56% right answers. So there is nothing wrong there to make fun of. Here is the paper itself saying the same:

The dotted line shows performance that is signicantly different from chance at the p<0.05 level calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition).

So in summary, you cannot, can NOT, use the percentage right answers as your confidence number in the outcome of ABX tests. That magnitude of that percentage in a sense is meaningless (because there is another important variable which is the number of trials). You need to compute the statistical formula and rely on that. Doing otherwise just leads to the wrong conclusions. The proof of this is mathematical and is not debatable or matter of opinion.
"

- amirm 01-16-2015, 04:28 PM

[/blockquote]

Understanding ABX Test Confidence Statistics
##### Reply #9 – 20 January, 2015, 08:37:34 AM
Thanks Arny. AVS will not the same without you. Your contribution there, including your truthful and accurate posts on what's audibly significant in audio, and what isn't, was refreshing considering the overwhelming influence of the incessant, "high-end" audio con artists, propagandists,  and swindlers.

Thanks for the kind words. My life at AVS started downhill with the entry of a moderator who is a DJ at a Country and Western FM outlet in my home city. I don't know if he is trying to make AVS profitable, or blinded by the image of an ex-MS top executive or if there was even some hint of investment which they clearly need.

• jkeny
Understanding ABX Test Confidence Statistics
##### Reply #10 – 30 January, 2015, 07:07:42 PM
Thanks Arny. AVS will not the same without you. Your contribution there, including your truthful and accurate posts on what's audibly significant in audio, and what isn't, was refreshing considering the overwhelming influence of the incessant, "high-end" audio con artists, propagandists,  and swindlers.

Thanks for the kind words. My life at AVS started downhill with the entry of a moderator who is a DJ at a Country and Western FM outlet in my home city. I don't know if he is trying to make AVS profitable, or blinded by the image of an ex-MS top executive or if there was even some hint of investment which they clearly need.

It went downhill when you started tripping yourself up & digging yourself many holes out of which you couldn't extricate yourself
I'm not surprised you were eventually banned.

M.Zilch - give us a break with your attempted revisionism of your AVS post - it's nearly as pathetic as ArnyK's posts on AVS.

But keep up the entertainment, Statler & Waldorf

• Porcus
Understanding ABX Test Confidence Statistics
##### Reply #11 – 30 January, 2015, 10:36:32 PM

• John Sully
Understanding ABX Test Confidence Statistics
##### Reply #12 – 30 January, 2015, 11:26:12 PM
2-sided t-test?

• krabapple
Understanding ABX Test Confidence Statistics
##### Reply #13 – 31 January, 2015, 01:38:29 AM

Understanding ABX Test Confidence Statistics
##### Reply #14 – 31 January, 2015, 05:50:11 AM
It went downhill when you started tripping yourself up & digging yourself many holes out of which you couldn't extricate yourself

No, it was all about people who can't win arguments with their crappy rhetoric so they silence people as permanently as they can.

Quote
I'm not surprised you were eventually banned.

Too much truth for many to handle.

Understanding ABX Test Confidence Statistics
##### Reply #15 – 31 January, 2015, 06:00:19 AM
jkeny?

http://www.avrev.com/home-theater-preampli...c-review-4.html

One in the same. Can be verified on the WBF forum.  Looks to me like a clear financial interest in getting people to disregard reliable listening tests.

• ajinfla
Understanding ABX Test Confidence Statistics
##### Reply #16 – 31 January, 2015, 06:43:39 AM
It went downhill when you started tripping yourself up & digging yourself many holes out of which you couldn't extricate yourself
I'm not surprised you were eventually banned.

M.Zilch - give us a break with your attempted revisionism of your AVS post - it's nearly as pathetic as ArnyK's posts on AVS.

But keep up the entertainment, Statler & Waldorf

Congrats John, you almost made you 10th anniversary of lurking!

I know you dismiss any ABX test lacking positive controls/ hidden reference anchors, so I'm curious what you think about your and Amirs own self administered online file ABX test log results? Did I miss where they conform to ABCHR/MUSHRA et al?
Thanks for finally chiming in.

cheers,

AJ
Loudspeaker manufacturer

Understanding ABX Test Confidence Statistics
##### Reply #17 – 31 January, 2015, 07:23:09 AM
I know you dismiss any ABX test lacking positive controls/ hidden reference anchors,

I agree that given their propensity for false negatives (IME a small price to pay for the much-needed control over false positives), listening tests need positive controls.

I have problems with people who rant and rave about this issue as an ABX-only problem when it is inherent in any listening test.  The only reason why nobody says much about the false negatives in sighted evaluation is that they are washed out by the very many false positives.

I have more problems with people who won't recognize that false negtatives are a problem that is easy enough to manage.

• xnor
• Developer
Understanding ABX Test Confidence Statistics
##### Reply #18 – 31 January, 2015, 07:39:18 AM

---

Amir has demonstrated to have no idea what he's talking about and given his "character" it is not surprising to see him quote-mining whenever possible.
"I hear it when I see it."

• jkeny
Understanding ABX Test Confidence Statistics
##### Reply #19 – 01 February, 2015, 04:12:10 PM
It went downhill when you started tripping yourself up & digging yourself many holes out of which you couldn't extricate yourself
I'm not surprised you were eventually banned.

M.Zilch - give us a break with your attempted revisionism of your AVS post - it's nearly as pathetic as ArnyK's posts on AVS.

But keep up the entertainment, Statler & Waldorf

Congrats John, you almost made you 10th anniversary of lurking!

I know you dismiss any ABX test lacking positive controls/ hidden reference anchors, so I'm curious what you think about your and Amirs own self administered online file ABX test log results? Did I miss where they conform to ABCHR/MUSHRA et al?
Thanks for finally chiming in.

cheers,

AJ

It's always amusing to me that so called objectivists don't understand the very tests that they swear by & question the validity of including controls in a test on a forum section called "Scientific Discussion". Yet they still refer to it as a "reliable test"

When an overall positive ABX result is recorded, it has by design, passed the false positive aspect of the test - which is essentially what the test is designed to do, a positive overall result means you have statistically successfully identified the audible difference i.e your results are not false positives.

There really is no concern given to false negatives in these tests i.e how many of the trial results are due to the many, many reasons that people don't hear differences when real, measurable differences actually exist - these are false negatives. They can happen for all sorts of reasons.

To get more technical, in forced choice, binary classification tests, specificity & sensitivity are the necessary measures needed to judge the reliability & performance of the test. From Wiki
Quote
"Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate."

In any valid test false negatives & false positives should be given equal consideration
Most audio DBTs are almost solely focussed on eliminating false positives - look at the recent changes to Foobar ABX.
As a result nobody has any handle on the error rate embedded in the results due to false negatives.
Asking people to accept test results whose error rate is unknown is simply asking for their blind faith.

ArnyK's recent ABX test results were a perfect example - most of his listening trials were false negatives as he didn't listen. This was only evident because of the timings & not because there was any internal controls in the test to sense false negatives. His overall test result was a no better than random (which is exactly what he was doing, randomly hitting keys).

So, if you want to produce results that aren't based on blind belief then include controls for false negatives & produce these stats along with the results.

• jkeny
Understanding ABX Test Confidence Statistics
##### Reply #20 – 01 February, 2015, 04:25:29 PM
I know you dismiss any ABX test lacking positive controls/ hidden reference anchors,

I agree that given their propensity for false negatives (IME a small price to pay for the much-needed control over false positives), listening tests need positive controls.
This is typical of the skewed approach to audio DBTs - you are happy that they are a small price to pay but you have no idea how skewed the results are towards false negatives. So what if 90% of audio DBTs were found to be suffering from an unacceptable level of false negatives - would this be a small price?

Quote
I have problems with people who rant and rave about this issue as an ABX-only problem when it is inherent in any listening test.  The only reason why nobody says much about the false negatives in sighted evaluation is that they are washed out by the very many false positives.
This is as ridiculous a statement as I have heard & shows your lack of knowledge & understanding - there's no such thing as a false negative in sighted tests

Quote
I have more problems with people who won't recognize that false negtatives are a problem that is easy enough to manage.
yes, it's easy to manage by including hidden controls within the test but have you ever done so & can give us the stats on false negatives for any such test you've administered? I don't even think I have ever seen you give a practical, sensible approach to how these controls could be included in a test?

• xnor
• Developer
Understanding ABX Test Confidence Statistics
##### Reply #21 – 01 February, 2015, 05:07:34 PM
When an overall positive ABX result is recorded, it has by design, passed the false positive aspect of the test

Nope. I've multiple times made a perfect score by just randomly clicking. Now imagine what a spectrum analyzer does...

An online ABX test only works if you have honest participants that will not only not cheat but also point out and accept problems with the test files (like the time offset in the AVS AIX test files).

a positive overall result means you have statistically successfully identified the audible difference i.e your results are not false positives.

Nope. You really should read up on statistics again.

There really is no concern given to false negatives in these tests i.e how many of the trial results are due to the many, many reasons that people don't hear differences when real, measurable differences actually exist - these are false negatives. They can happen for all sorts of reasons.

What is a "real" difference? A measurable difference certainly does not mean that there's an audible difference anyway.

And nope, there is concern given to false negatives, for example by including low anchors in test files. But again, in an online test you can only assume that people try their best and also list the equipment they actually used.

If this is not the case then it still does matter less than false positives, because we do not accept the null hypothesis anyway. Again, read up on statistics.

To get more technical, in forced choice, binary classification tests, specificity & sensitivity are the necessary measures needed to judge the reliability & performance of the test. From Wiki
Quote
"Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate."

In any valid test false negatives & false positives should be given equal consideration
Most audio DBTs are almost solely focussed on eliminating false positives - look at the recent changes to Foobar ABX.
As a result nobody has any handle on the error rate embedded in the results due to false negatives.
Asking people to accept test results whose error rate is unknown is simply asking for their blind faith.

It certainly takes faith to accept (positive) results of demonstrably dishonest people.

But that aside, it seems you are trivializing this. Black/white kinda thinking, as you did above.
How would an online test look like where you can calculate specificity for each participant? I would really be interested in your answer.
It's hard enough (= impossible) to get honest people doing the required number of trials and sending in their results regardless of success.

So, if you want to produce results that aren't based on blind belief then include controls for false negatives & produce these stats along with the results.

Nope.
No faith required.

Where you need faith, or let's better call it gullibility, is with the (most of the time) positive dishonest sighted listening tests.
"I hear it when I see it."

• ajinfla
Understanding ABX Test Confidence Statistics
##### Reply #22 – 01 February, 2015, 05:30:21 PM
I'm curious what you think about your and Amirs own self administered online file ABX test log results? Did I miss where they conform to ABCHR/MUSHRA et al?

It's always amusing to me that so called objectivists don't understand the very tests that they swear by & question the validity of including controls in a test on a forum section called "Scientific Discussion". Yet they still refer to it as a "reliable test"

When an overall positive ABX result is recorded, it has by design, passed the false positive aspect of the test - which is essentially what the test is designed to do, a positive overall result means you have statistically successfully identified the audible difference i.e your results are not false positives.

There really is no concern given to false negatives in these tests i.e how many of the trial results are due to the many, many reasons that people don't hear differences when real, measurable differences actually exist - these are false negatives. They can happen for all sorts of reasons.

To get more technical, in forced choice, binary classification tests, specificity & sensitivity are the necessary measures needed to judge the reliability & performance of the test. From Wiki
Quote
"Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity (also called the true positive rate, or the recall rate in some fields) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Specificity (sometimes called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate."

In any valid test false negatives & false positives should be given equal consideration
Most audio DBTs are almost solely focussed on eliminating false positives - look at the recent changes to Foobar ABX.
As a result nobody has any handle on the error rate embedded in the results due to false negatives.
Asking people to accept test results whose error rate is unknown is simply asking for their blind faith.
ArnyK's recent ABX test results were a perfect example - most of his listening trials were false negatives as he didn't listen. This was only evident because of the timings & not because there was any internal controls in the test to sense false negatives. His overall test result was a no better than random (which is exactly what he was doing, randomly hitting keys).
So, if you want to produce results that aren't based on blind belief then include controls for false negatives & produce these stats along with the results.

Ok, forgive me if I missed it in that Gish Gallop, is that a no or yes, you and fellow Hi End distributor Amir did/not adhere to MUSHRA et al?
Thanks again for unlurking after nearly 10 years, to set us straight.

cheers,

AJ
Loudspeaker manufacturer

• greynol
• Global Moderator
Understanding ABX Test Confidence Statistics
##### Reply #23 – 01 February, 2015, 05:39:42 PM
Straight in that he doesn't have even the remotely faintest idea what he's talking about?

I'd prefer he go back to lurking instead of miring down what could otherwise be a useful discussion due to being an active party in the useless art of remaining ignorant/intellectual dishonesty/selective quote mining/misdirection/trolling/troll-baiting.  The same goes for the rest of you who are considering doing similar if not the same.
Is 24-bit/192kHz good enough for your lo-fi vinyl, or do you need 32/384?

Understanding ABX Test Confidence Statistics
##### Reply #24 – 01 February, 2015, 05:44:05 PM
This is as ridiculous a statement as I have heard & shows your lack of knowledge & understanding - there's no such thing as a false negative in sighted tests

I do not understand this at all. If you have a sighted test where you believe the two items under test ought to sound identical (but in fact they don't) - surely that could quite easily generate false negatives?