Understanding ABX Test Confidence Statistics

Topic: Understanding ABX Test Confidence Statistics (Read 52058 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Understanding ABX Test Confidence Statistics

Reply #125 – 2015-02-03 02:05:49

Quote from: jkeny on 2015-02-03 01:59:22

Yep he gave up & stopped listening (but he never owned up to this on AVS) - the test was over at that point - probably trial 4, certainly trial 5. Yet he presented this set of results without mention that he stopped listening.

Wait, so if I don't think I can hear a difference after a few trials, I'm supposed to tick off some check box at the bottom of the test, which I can't seem to find at this moment, and my 8 correct out of 16 trials isn't enough for the test conductors to figure this out for themselves? Did I get this right, or again am I missing something?

Understanding ABX Test Confidence Statistics

Reply #126 – 2015-02-03 02:07:46

Quote from: jkeny on 2015-02-03 01:44:28

I've reproduced the ABX log along with the trial timings beside each trial

And of course the log isn't what you misrepresented it to be. The total test took almost 3 minutes, for an average of about 10 seconds per trial.

The one test that was claimed to take 1 second is probably an artifact of the 1 second resolution of the time stamps and the fact that it actually took more like 2 seconds. Just because I may have blown off one trial doesn't mean that I necessarily blew off the whole test.

Finally to prove that there was a false difference you have to prove that under some reasonable conditions I could have passed the test on that day and at that time. Like I've said to you many times, I had done a far longer set of trials of the same files earlier on that day with similar results. I also had said before the test that my hearing wasn't up to snuff, and I wasn't doing ABX tests for the record because it wouldn't be fair to the tests. Quit misrepresenting the truth!

My high speed internet account has been down since Sunday morning due to the failure of my ISP's Arris Media Gateway (third one in 2 months! :-( ). This post is via a smart phone. I'll upload the files for people to listen to for themselves when it is feasible. You obviously have no experience with high bitrate LAME files. ;-)

Understanding ABX Test Confidence Statistics

Reply #127 – 2015-02-03 02:12:59

Quote from: Wombat on 2015-02-03 01:49:37

Looks to me he tried 21 seconds and thought to have found a difference but didn't. He tried 60 seconds on the next try. After that he lost interest or gave up because he was not sure.
How on earth does this discredit abx testing?

What really happened was that I did a long drawn out knock down fight-em ABX earlier that day, scored miserably, but had overwritten that tests ABX log doing another test.

This time I listened to a few samples, knew that the earlier results were going to be duplicated, and wasted no more time with the matter.

Besides its a 256k Lame file based on a fairly easy to code string quartet. Probability of any 68 year old set of ears or any ears at all reliably hearing a difference would appear to be pretty low.

I'll upload the files when I get my high speed internet back, probably late tomorrow.

Understanding ABX Test Confidence Statistics

Reply #128 – 2015-02-03 03:58:01

Hmm, so let's recap the recent activity here:

1) Man who sells $400-600 Euro DAC assemblies claims that "false negatives" (which can be parsed a number of ways; type is conditionally dependent) are a serious problem in audio testing; man does not want to accept ABX style testing, but appears to offer no reasonable alternative (sighted trials being an inherently unreasonable alternative)

2) Individuals who have engaged in extensive ABX testing of hardware and/or encoders dispute claims of man in 1)

3) Man from 1) asserts a belief that "Type II" errors (again, must be specifically parsed) are a significant defect associated with ABX-style testing

4) Man from 1) presents no significant evidence whatsoever to validate assertion of belief from 3)

5) Forum users who disagree with assertions made by Man from 1) are accused by Man from 1) of being dishonest, and told that the onus is upon them to disprove his apparently baseless contentions.
______________

Bottom-line: An Individual with a strong financial interest in invalidating a well-established form of testing attempts to invalidate said form of testing using nothing but verbiage.

One word works.

Understanding ABX Test Confidence Statistics

Reply #129 – 2015-02-03 05:41:26

Type I and Type II error are standard statistical terms and shouldn't need special parsing.

I'm so tired of seeing old argument brandished as if they were brand new and cutting. Old heads may recall that Les Leventhal was hammering on Type II erros way back in the mid 80s, and AFAIK Stereophile still trots his objections out every so often as part of their longstanding campaign to impugn audio DBT (and ABX specifically) by any means necessary

They reprinted this 1985 exchange about Type II error in ABX in 2000:
http://www.stereophile.com/features/141/index.html

Over time , Leventhal published several articles about listening test stats in JAES or AES conventions, some more controversial than others

How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275
Publication Date:October 1, 1985

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
AES Volume 34 Issue 6 pp. 437-453; June 1986
(followed by two corrections published the same year in JAES, and comments by David Clark, Tom Nousaine, and Daniel Shanefield)

Statistically Significant Poor Performance in Listening Tests
JAES Volume 42 Issue 7/8 pp. 585-587;
Publication Date:July 1, 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
JAES Volume 44 Issue 10 pp. 850-863; October 1996
Publication Date:October 1, 1996

Understanding ABX Test Confidence Statistics

Reply #130 – 2015-02-03 11:17:46

Quote from: Arnold B. Krueger on 2015-02-03 02:07:46

The one test that was claimed to take 1 second is probably an artifact of the 1 second resolution of the time stamps and the fact that it actually took more like 2 seconds. Just because I may have blown off one trial doesn't mean that I necessarily blew off the whole test.

Well you did eleven trials in 25 seconds - uh, "less than 26" due to the resolution. Listening for a mere two seconds to judge, that would be OK if you actually heard enough to know - which, from your scores, you didn't.

Understanding ABX Test Confidence Statistics

Reply #131 – 2015-02-03 11:37:55

So there was a negative, but not necessarily a false negative. What's the problem?

Understanding ABX Test Confidence Statistics

Reply #132 – 2015-02-03 12:17:43

Quote from: Wombat on 2015-02-03 02:05:23

Quote from: jkeny on 2015-02-03 01:59:22
Yep he gave up & stopped listening (but he never owned up to this on AVS) - the test was over at that point - probably trial 4, certainly trial 5. Yet he presented this set of results without mention that he stopped listening. Yes, honesty is needed in these tests but that's a fairly shaky foundation on which to judge the results (as we've seen on this thread)

Whose trying to discredit ABX testing? I'm saying we could do with a measure of the false negatives in the results - in Arny's trial 5 to 15, a cannon shot could have been in one sample & a humming bird in another he wouldn't have discriminated the difference due to not listening.

False negative? I see a perfect negative abx test of Arny. I guess he is used to 16 trials so he finished it.

So you suggest that the correct way to do an ABX test is like Arny did - stop listening after 4 trials

Understanding ABX Test Confidence Statistics

Reply #133 – 2015-02-03 12:20:46

Quote from: mzil on 2015-02-03 02:05:49

Quote from: jkeny on 2015-02-03 01:59:22
Yep he gave up & stopped listening (but he never owned up to this on AVS) - the test was over at that point - probably trial 4, certainly trial 5. Yet he presented this set of results without mention that he stopped listening.

Wait, so if I don't think I can hear a difference after a few trials, I'm supposed to tick off some check box at the bottom of the test, which I can't seem to find at this moment, and my 8 correct out of 16 trials isn't enough for the test conductors to figure this out for themselves? Did I get this right, or again am I missing something?

So what you are suggesting is that in a listening test if you can't hear a difference in the first 4 trials, you should stop listening & just randomly hit buttons. Sorry, I missed those recommendations in the ITU standards document - can you quote them for me, please?

Understanding ABX Test Confidence Statistics

Reply #134 – 2015-02-03 12:28:57

Quote from: Audible! on 2015-02-03 03:58:01

Hmm, so let's recap the recent activity here:

1) Man who sells $400-600 Euro DAC assemblies claims that "false negatives" (which can be parsed a number of ways; type is conditionally dependent) are a serious problem in audio testing; man does not want to accept ABX style testing, but appears to offer no reasonable alternative (sighted trials being an inherently unreasonable alternative)

man wants ABX testing to include internal controls & has suggested examples of these controls & how they would be used. Man is suggesting a way to improve ABX testing

Quote

2) Individuals who have engaged in extensive ABX testing of hardware and/or encoders dispute claims of man in 1)

On the contrary, false negatives are stated by Arny to be a known issue with blind testing & he proceeeded to say how he deals with it. Only problem was his methodology doesn't deal with it

Quote

3) Man from 1) asserts a belief that "Type II" errors (again, must be specifically parsed) are a significant defect associated with ABX-style testing

If you are going to repeat yourself then so am I - "On the contrary, false negatives are stated by Arny to be a known issue with blind testing & he proceeeded to say how he deals with it. Only problem was his methodology doesn't deal with it"

Quote

4) Man from 1) presents no significant evidence whatsoever to validate assertion of belief from 3)

What evidence is needed - it's an accepted issue with blind tests & even Arny agrees

Quote

5) Forum users who disagree with assertions made by Man from 1) are accused by Man from 1) of being dishonest, and told that the onus is upon them to disprove his apparently baseless contentions.

When tests are being used dishonestly to try to prove a point, win a debate - then I accuse those people of being dishonest, yes!

Understanding ABX Test Confidence Statistics

Reply #135 – 2015-02-03 12:37:17

Quote from: jkeny on 2015-02-03 12:17:43

So you suggest that the correct way to do an ABX test is like Arny did - stop listening after 4 trials

Why not? He couldn't tell them apart. Why wasting lifetime better spent elsewhere?
Where are exactly your positive abx logs of the same samples?

Understanding ABX Test Confidence Statistics

Reply #136 – 2015-02-03 12:41:02

Quote from: krabapple on 2015-02-03 05:41:26

Type I and Type II error are standard statistical terms and shouldn't need special parsing.

I'm so tired of seeing old argument brandished as if they were brand new and cutting. Old heads may recall that Les Leventhal was hammering on Type II erros way back in the mid 80s, and AFAIK Stereophile still trots his objections out every so often as part of their longstanding campaign to impugn audio DBT (and ABX specifically) by any means necessary

They reprinted this 1985 exchange about Type II error in ABX in 2000:
http://www.stereophile.com/features/141/index.html

Over time , Leventhal published several articles about listening test stats in JAES or AES conventions, some more controversial than others

How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275
Publication Date:October 1, 1985

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
AES Volume 34 Issue 6 pp. 437-453; June 1986
(followed by two corrections published the same year in JAES, and comments by David Clark, Tom Nousaine, and Daniel Shanefield)

Statistically Significant Poor Performance in Listening Tests
JAES Volume 42 Issue 7/8 pp. 585-587;
Publication Date:July 1, 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
JAES Volume 44 Issue 10 pp. 850-863; October 1996
Publication Date:October 1, 1996

Thank you for those links - much appreciated.

I'm not just identifying the problem - I'm also offering the solution! It seems like it's both not accepted that there is a problem & therefore the solution is not even discussed

Understanding ABX Test Confidence Statistics

Reply #137 – 2015-02-03 12:48:03

Quote from: Wombat on 2015-02-03 12:37:17

Quote from: jkeny on 2015-02-03 12:17:43
So you suggest that the correct way to do an ABX test is like Arny did - stop listening after 4 trials

Why not? He couldn't tell them apart. Why wasting lifetime better spent elsewhere?

Ah, I see - very revealing & explains the large number of null results obtained in ABX tests

Quote

Where are exactly your positive abx logs of the same samples?

There are positive ABX results on that forum if you care to look - so that shows Arny's reults to be false negatives (although the fact that he stopped listening is proof enough)

So, Arny, what say you - are you accepting the excuses these people are making for you - you stopped listening (& most seem to think so) or are you trying to say you actually did listen 7 decide & click on the button in 1 second trials (as you maintained was the case before)? Which is it?

Understanding ABX Test Confidence Statistics

Reply #138 – 2015-02-03 12:55:08

Quote from: jkeny on 2015-02-03 12:41:02

Quote from: krabapple on 2015-02-03 05:41:26
Type I and Type II error are standard statistical terms and shouldn't need special parsing.

I'm so tired of seeing old argument brandished as if they were brand new and cutting. Old heads may recall that Les Leventhal was hammering on Type II erros way back in the mid 80s, and AFAIK Stereophile still trots his objections out every so often as part of their longstanding campaign to impugn audio DBT (and ABX specifically) by any means necessary

They reprinted this 1985 exchange about Type II error in ABX in 2000:
http://www.stereophile.com/features/141/index.html

Over time , Leventhal published several articles about listening test stats in JAES or AES conventions, some more controversial than others

How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275
Publication Date:October 1, 1985

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
AES Volume 34 Issue 6 pp. 437-453; June 1986
(followed by two corrections published the same year in JAES, and comments by David Clark, Tom Nousaine, and Daniel Shanefield)

Statistically Significant Poor Performance in Listening Tests
JAES Volume 42 Issue 7/8 pp. 585-587;
Publication Date:July 1, 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
JAES Volume 44 Issue 10 pp. 850-863; October 1996
Publication Date:October 1, 1996

Thank you for those links - much appreciated.

I would think that a person who has made such far reaching claims related to the topic would already know of these articles and have studied them. I know that I have.

Quote

I'm not just identifying the problem - I'm also offering the solution!

Please provide the URL of the related documentation.

Quote

It seems like it's both not accepted that there is a problem & therefore the solution is not even discussed

Speaks to a less-than-adequate familiarity with the topic, particularly for a person who seems to seek to correct practitioners with decades of hands-on experience.

(1) Type 1 and Type 2 errors are constant dangers in any experiment.

(2) Even though sighted evaluations are usually utterly corrupted and overwhelmed by type 1 errors, the pervasive type 1 errors may hide type 2 errors.

(3) Controlling type 1 errors can be reasonably expected to make type 2 errors more apparent and thus facilitate their management. I think this has happened quite a bit with ABX, for example.

(4) Nobody with a brain tries to rank their importance because they are both potentially fatal errors. Harping on type 2 errors while glossing over type 1 errors or vice versa is unwise.

(5) Mr. Keny's hobby seems to be going ballistic over type 2 errors and glossing over type 1 errors.

(6) Figuring what these and other errors are in the context of a particular experiment may be very difficult. Its not as trivial as Mr. Keny makes it seem with his wild unfounded accusations.

Understanding ABX Test Confidence Statistics

Reply #139 – 2015-02-03 12:56:36

Quote from: jkeny on 2015-02-03 12:48:03

There are positive ABX results on that forum if you care to look - so that shows Arny's reults to be false negatives (although the fact that he stopped listening is proof enough)

We are going in circles. Arny couldn't hear a difference so it is not a false negative. You suggest that if one positive was done all negatives must be false?

Understanding ABX Test Confidence Statistics

Reply #140 – 2015-02-03 12:58:34

Quote from: Wombat on 2015-02-03 12:56:36

Quote from: jkeny on 2015-02-03 12:48:03
There are positive ABX results on that forum if you care to look - so that shows Arny's reults to be false negatives (although the fact that he stopped listening is proof enough)

We are going in circles. Arny couldn't hear a difference so it is not a false negative. You suggest that if one positive was done all negatives must be false?

Keny is trying to lead us in circles. We can get off when we want to.

Has he even bothered to give a workable definition of False Negative?

It would be fun to see him try to ABX the same two files in a properly-run test. Sauce for goose, sauce for the gander.

I suspect that he has seen the results of Amir's test based on my keys jangling file, and thinks that all DBTs related to MP3s are easy.

An alternative viewpoint is that MP3 coders as a rule implement low pass filters in the 16-20 KHz range, and a monitoring system with spurious responses with signals > 20 Hz will thus also be invalid for MP3-related tests, just as it was invalid for tests of resampling.

While I still had sanctioned access to AVS I noteiced a post from a certain well-known person that suggests that his monitoring system never passed my tougher, pure tone-based monitoring system validation test, but that he was instead basing his claims that he validated his monitoring system using my files, based on some other random noise based files that I had prepared for a different purpose.

One more good reason why a neutral expert test proctor could shed a lot of lights on certain exceptional results.

Understanding ABX Test Confidence Statistics

Reply #141 – 2015-02-03 14:15:31

Quote from: Arnold B. Krueger on 2015-02-03 12:55:08

I would think that a person who has made such far reaching claims related to the topic would already know of these articles and have studied them. I know that I have.

I would suggest you read them again as, based on your posts, you have not absorbed them.

Quote

Quote
It seems like it's both not accepted that there is a problem & therefore the solution is not even discussed

Speaks to a less-than-adequate familiarity with the topic, particularly for a person who seems to seek to correct practitioners with decades of hands-on experience.

(1) Type 1 and Type 2 errors are constant dangers in any experiment.

Agreed

Quote

(2) Even though sighted evaluations are usually utterly corrupted and overwhelmed by type 1 errors, the pervasive type 1 errors may hide type 2 errors.

We're not talking about sighted listening tests - why are they always introduced as a comparator? As I said before what is being said is "Our tests might be bad but don't look here, look over there at the sighted tests - they're worse". A typical deflection tactic seen on forums

Quote

(3) Controlling type 1 errors can be reasonably expected to make type 2 errors more apparent and thus facilitate their management. I think this has happened quite a bit with ABX, for example.

Absolutely wrong & the opposite of what you state is true - tightening type 1 errors INCREASES type II errors. As I said, you haven't absorbed your claimed readings above. Did you really read them or did you stop after the first 20 seconds & pretended after that? This would explain your lack of knowledge

Quote

(4) Nobody with a brain tries to rank their importance because they are both potentially fatal errors. Harping on type 2 errors while glossing over type 1 errors or vice versa is unwise.

What are you talking about - in ABX test type I errors are very well dealt with? Trying to bring in sighted listening tests again, perhaps?

Quote

(5) Mr. Keny's hobby seems to be going ballistic over type 2 errors and glossing over type 1 errors.

(6) Figuring what these and other errors are in the context of a particular experiment may be very difficult. Its not as trivial as Mr. Keny makes it seem with his wild unfounded accusations.

Not difficult at all to introduce controls for sensing false negatives but lots of posing & deflecting going on in this thread

Understanding ABX Test Confidence Statistics

Reply #142 – 2015-02-03 14:18:46

Quote from: Wombat on 2015-02-03 12:56:36

Quote from: jkeny on 2015-02-03 12:48:03
There are positive ABX results on that forum if you care to look - so that shows Arny's reults to be false negatives (although the fact that he stopped listening is proof enough)

We are going in circles. Arny couldn't hear a difference so it is not a false negative. You suggest that if one positive was done all negatives must be false?

Of course it is - anybody who deliberately stops listening, as you maintain, should be eliminated from the test & their results discarded. Read up on how to do these tests or would you prefer to make up your own guidelines as you go along - seems to be the MO here

Understanding ABX Test Confidence Statistics

Reply #143 – 2015-02-03 14:38:46

Quote from: jkeny on 2015-02-03 14:18:46

Of course it is - anybody who deliberately stops listening, as you maintain, should be eliminated from the test & their results discarded. Read up on how to do these tests or would you prefer to make up your own guidelines as you go along - seems to be the MO here

Own rules? Where is the rule to listen at least for a specific time when i realize i hear no difference? Is there a minimum fail time to do a valid fail?
I give up.

Understanding ABX Test Confidence Statistics

Reply #144 – 2015-02-03 15:15:53

Quote from: jkeny on 2015-02-03 12:28:57

man wants ABX testing to include internal controls & has suggested examples of these controls & how they would be used.

Translation, man selling Biochemically engineered boutique DACs want's to make it harder/improbable to easily expose $cam products. Only ITU blind tests acceptable, because probability of ITU standards blind tests being done for Biochemically engineered DACs and $50k amps, etc, etc. = Zero.
Man and mans fellow retailer own ABX results do not conform to ITU and must be rejected by man...well, maybe.

Quote from: jkeny on 2015-02-03 12:28:57

Man is suggesting a way to improve ABX testing

Great, let's see man's improved ABX test results for boutique DACs and $50k amps, etc.
The D-K gang rejects any blind tests for audio, so they won't be impressed and those with functional brains won't be particularly surprised when boutique DACs and $50k amps are either indistinguishable or distinguishable due to "audiophile" injurneering practices.
Thanks for the dearly-concerned-about-truth-hero-comes-to-save-the-day effort though.

Understanding ABX Test Confidence Statistics

Reply #145 – 2015-02-03 15:18:54

Quote from: Wombat on 2015-02-03 14:38:46

Quote from: jkeny on 2015-02-03 14:18:46
Of course it is - anybody who deliberately stops listening, as you maintain, should be eliminated from the test & their results discarded. Read up on how to do these tests or would you prefer to make up your own guidelines as you go along - seems to be the MO here

Own rules? Where is the rule to listen at least for a specific time when i realize i hear no difference? Is there a minimum fail time to do a valid fail?
I give up.

Yep, your suggestion that in a listening test, if you don't hear a difference in the first few tries, that it's OK to stop listening. Geez, do you know what a listening test is?

Understanding ABX Test Confidence Statistics

Reply #146 – 2015-02-03 15:22:38

Yes it is OK to give up for the reason given countless times already.
Time to move along.
Discussion closed.

Notice