Should HA promote a more rigorous listening test protocol?

Topic: Should HA promote a more rigorous listening test protocol? (Read 25615 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Should HA promote a more rigorous listening test protocol?

2012-11-23 18:01:49

I was taken aback to read today this exchange on gearslutz, from earlier this year

Quote

It's important to understand that what JJ considers a listening test and what the ABX/Hydogen Audio skeptics crowd considers a listening test are two very very different things.

Quote

Perhaps JJ can explain what he considers a listening test and how it's different from the Hydrogenaudio standpoint.
I was somehow under the impression they were not that different.

Quote

Including positive and negative controls, lots of training for the test as well as familiarity with the equipment and music, and equiment validation are the biggies.

Test evaluation might be an issue, too. Many tests, including some of the MPEG tests and 1116 make assumptions that the entire population reacts the same to impairments. While basic masking is universal, what people dislike when they can hear something is NOT universal.

http://www.gearslutz.com/board/7672621-post329.html
http://www.gearslutz.com/board/7674886-post337.html
http://www.gearslutz.com/board/7677113-post348.html

Now, I agree with Kees -- I don't think the HA community 'take' on listening tests is that different from what JJ mentions. Few here, I suspect, would dismiss the real utility of training , or of positive controls, or familiarity etc., in making a listening test maximally sensitive. (as for the rest, I confess I;m not really clear whether JJ's criciticsm of test evaluation is directed at HA)

What I think is happening is a difference in what listening tests are used for. Most individual HA reports of ABX tests are from users wanting to know if file X sounds different from file Y to them, as they are now, using the equipment they have, not as they would be after training to hear artifacts, on the most revealing equipment. They aren't doing basic research into a difference's audibility, as JJ did, for example, when developing lossy codecs. For that purpose, trained listeners, positive & negative controls, familiarity and 'validated' equipment are necessities.

Still, HA *does* host mass listening tests from time to time -- which are more akin to 'basic research' -- and its few 'official' guidelines on setting up listening tests -- the HA wiki, and Pio's sticky threads -- make no mention of training, +/- controls, etc. as factors in such tests.

Time to change this?

Should HA promote a more rigorous listening test protocol?

Reply #1 – 2012-11-23 18:16:23

Lots of the personal listening tests are by people with considerable training. As for abx tests by me members the goal is determine if a given file or system is good enough for that individual. In this case training may not even be desirable let alone necessary. I think it comes down to what you want to measure and how you analyze the results.

Should HA promote a more rigorous listening test protocol?

Reply #2 – 2012-11-23 18:24:54

Pio's post does make mention of relegating ABX testing as practice trials, so training is touched upon at least indirectly.

I don't see that we should go out of our way to engage in some debate by proxy. Maybe those players who are members here can have the debate here. Those who are not members can certainly join so long as they do so in compliance with our rules, namely TOS12. Personally I'm not interested in advocating for yet another thread dedicated to trolling TOS8, however (somewhat related to those who aren't welcome back per TOS12).

Should HA promote a more rigorous listening test protocol?

Reply #3 – 2012-11-23 22:16:51

With all due respect to Mr. J., while his criticism of many of our public mass listening tests is valid, we do not stick to that approach dogmatically. Our intent in those tests is to get ordinary-citizen feedback regarding codec quality. All of his criticisms can be addressed without violating TOS8 in any way, they just make tests harder to conduct. We're aiming to maximize the audience we get feedback from, not maximize the quality of the results. Furthermore, as an Internet-based community, some criticisms are nigh impossible to address, such as equipment validation.

I think the criticisms were made in good faith without the intent of demeaning what we do. I fear that this thread will become divisive.

Should HA promote a more rigorous listening test protocol?

Reply #4 – 2012-11-24 03:02:04

Quote from: greynol on 2012-11-23 18:24:54

Pio's post does make mention of relegating ABX testing as practice trials, so training is touched upon at least indirectly.

I don't see that we should go out of our way to engage in some debate by proxy. Maybe those players who are members here can have the debate here. Those who are not members can certainly join so long as they do so in compliance with our rules, namely TOS12. Personally I'm not interested in advocating for yet another thread dedicated to trolling TOS8, however (somewhat related to those who aren't welcome back per TOS12).

That gearslutz thread wasn't a debate about HA's practices -- it was about 'Mastered for iTunes'. The three posts from March were a minor and fleeting sidenote there -- but one IMO obviously pertinent to HA, for reasons of personnel (2 of 3 'players' being respectable HA posters too), and content. I'd certainly have posted about it here earlier, if I'd read it earlier. Certainly I hope Kees and JJ will both participate in this thread.

I'm not advocating or encouraging TOS8 trolling, and i honestly don't see how you went from what I posted, to that. And I haven't a clue who you are referring to re:TOS12. Kees? JJ? Bob Ohlsson? I do hope Kees and JJ will both participate in this thread! Guilty as charged, if that's the charge.

Should HA promote a more rigorous listening test protocol?

Reply #5 – 2012-11-24 03:09:07

Quote from: Canar on 2012-11-23 22:16:51

With all due respect to Mr. J., while his criticism of many of our public mass listening tests is valid, we do not stick to that approach dogmatically. Our intent in those tests is to get ordinary-citizen feedback regarding codec quality. All of his criticisms can be addressed without violating TOS8 in any way, they just make tests harder to conduct. We're aiming to maximize the audience we get feedback from, not maximize the quality of the results. Furthermore, as an Internet-based community, some criticisms are nigh impossible to address, such as equipment validation.

Sounds reasonable to me. Where I'm heading is a discussion of whether there should be a revision of whatever formal HA guidelines exist for conducting listening tests. And before someone snarks 'knock yourself out', yes, I'm willing to help craft such revision....*after* discussion.

Quote

I think the criticisms were made in good faith without the intent of demeaning what we do. I fear that this thread will become divisive.

I'm already a bit perplexed (not defensive) at the responses. People are concerned that a discussion about listening test rigor as defined by JJ will devolve into TOS8 and TOS12 violations? Seriously? Is it because I used the phrase 'guilty as charged'? Would it help if I put air-quotes around "guilty" and "charged"?

Should HA promote a more rigorous listening test protocol?

Reply #6 – 2012-11-24 03:10:43

Quote from: saratoga on 2012-11-23 18:16:23

Lots of the personal listening tests are by people with considerable training. As for abx tests by me members the goal is determine if a given file or system is good enough for that individual. In this case training may not even be desirable let alone necessary. I think it comes down to what you want to measure and how you analyze the results.

That, and how you interpret them. What claims you make from them.

Should HA promote a more rigorous listening test protocol?

Reply #7 – 2012-11-24 03:39:22

Honestly, I think our procedure is fine, given what we're trying to achieve. We get statistically significant results. There's no need to change anything. We can run tests with altered procedure, should there be a desire, but what would the goal of such a test be?

Should HA promote a more rigorous listening test protocol?

Reply #8 – 2012-11-24 04:00:08

My concern about people coming here to argue that TOS8 is for fools and that they are attracted to threads such as this one is hardly unfounded. That said, I hope this discussion doesn't devolve into this.

About the external discussion which inspired this one, I didn't bother to look at it. As for TOS #12, perhaps it's something not apparent to non-staff. Like everyone else, I would like to see constructive participation and welcome new members. My invitation only extends to truly new members. Those who have been previously banned need not apply.

EDIT: Since my caveat about TOS#12 seemed to stir a mini shitstorm, let me be clear, any member here who is able to post freely is in good standing. Kees and JJ are fine. As far as I am aware, Bob Ohlsson has never registered here and is perfectly free to do so (I never said nor implied otherwise, unless he was indeed previously banned).

Should HA promote a more rigorous listening test protocol?

Reply #9 – 2012-11-24 13:07:28

Quote from: Canar on 2012-11-24 03:39:22

Honestly, I think our procedure is fine, given what we're trying to achieve. We get statistically significant results. There's no need to change anything. We can run tests with altered procedure, should there be a desire, but what would the goal of such a test be?

Some of the caveats , I would think, apply more to 'no difference' results than to statistically signficant (positive) results.

Should HA promote a more rigorous listening test protocol?

Reply #10 – 2012-11-24 16:36:31

Quote from: krabapple on 2012-11-24 03:09:07

Would it help if I put air-quotes around "guilty" and "charged"?

Not that it has anything to do with attracting TOS8 bashing, but you should suggest a new title that is compliant with TOS #6. The current one doesn't make the grade, with or without scary quotes.

Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.

Should HA promote a more rigorous listening test protocol?

Reply #11 – 2012-11-25 07:04:19

There's a tradeoff going on here.

One the one hand, reducing the barriers for Joe Sixpack forum readers to contribute listening test results is extremely important, and the policies of HA's listening tests have been very, very good with that.

On the other hand, listening testers might self-select anyways, so that those who go to the trouble to take such tests may very well find the request for additional documentation of their listening experience, training, etc. to be reasonable. And such documentation would be extremely useful to use HA test results as an adjunct for clinical-/institutional-grade listening tests, of the sort that jj describes.

Should HA promote a more rigorous listening test protocol?

Reply #12 – 2012-11-25 08:20:50

Ok, I'm a little confused here. How does what I said have anything to do with TOS 8 bashing? I'm asking for better tests, and yes, there should ALWAYS be positive and negative controls in a test, and no, they aren't that hard to add, and yes, you can add them in varying levels of positive control and get some very useful information. So you should. I'm standing absolutely firm on that position, because I see so many tests that I can't even evaluate the results coming to me in capacities as reviewer or editor, tests that have no way to relate them to other sets of results in any fashion. (no, I don't mean you should combine results)

As to evaluating for multiple axis, that's only for tests that do more than "can you detect" testing, obviously. I am known to be a very serious unfan of the "impaired signal multiple choice" tests people are using these days. (I am avoiding the name of the popular test, I've been accused of stealing a trademark once when I mentioned the name of this test in a critical fashion.) One of the big failures of that kind of testing is the forced ranking. Such tests assume that relative rankings are transitive. We all know better.

I am frankly surprised at the apparent offense taken to what I said. I'm simply describing standard practice.

Should HA promote a more rigorous listening test protocol?

Reply #13 – 2012-11-25 15:34:35

Quote from: greynol on 2012-11-24 16:36:31

Quote from: krabapple on 2012-11-24 03:09:07
Would it help if I put air-quotes around "guilty" and "charged"?

Not that it has anything to do with attracting TOS8 bashing, but you should suggest a new title that is compliant with TOS #6. The current one doesn't make the grade, with or without scary quotes.

OK, how about, 'Should HA promote a more rigorous listening test protocol'?

Quote

Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.

Seems to me it's a it's a bit of both, and Listening Tests is the more specific of the two. But feel free to move it wherever you think it fits best.

Should HA promote a more rigorous listening test protocol?

Reply #14 – 2012-11-25 16:31:45

Quote from: Woodinville on 2012-11-25 08:20:50

How does what I said have anything to do with TOS 8 bashing

My reply may sound defensive, but I don't care. Please show me where I called any particular individual out on bashing TOS8. You can't because I didn't. If I didn't make myself clear enough earlier, I don't want this thread to attract yet another set of placebophile trolls to railroad the discussion into another referendum on TOS8. I could link discussions and name names if you want, but I don't see the point; except to demonstrate that you and Kees do not provide cause for concern.

Quote

I am frankly surprised at the apparent offense taken to what I said. I'm simply describing standard practice.

I agree with you on principle, but I am frankly surprised you haven't taken the opportunity to talk about it here; rather you seem to only talk about it in forums which either don't require objective evidence or worse, forums where this criteria is rejected and even shunned by a sizable portion of it's more vocal and respected members.

Hopefully this thread will prove me wrong, assuming that I'm not wrong already, though I've closely followed this forum and your contributions in particular for many years now.

Should HA promote a more rigorous listening test protocol?

Reply #15 – 2012-11-25 17:17:30

Quote from: krabapple on 2012-11-25 15:34:35

Quote
Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.

Seems to me it's a it's a bit of both, and Listening Tests is the more specific of the two. But feel free to move it wherever you think it fits best.

You're right, it is a bit of both. Thanks for the updated title.

Should HA promote a more rigorous listening test protocol?

Reply #16 – 2012-11-26 07:25:30

I agree with Axon, if that is what is being discussed (which is also a bit unclear to me). If it is the public listening tests, then they seem not to have the scope of e.g. identifying annoyances in order to address them in development. Any reason that they should?

Should HA promote a more rigorous listening test protocol?

Reply #17 – 2012-11-26 12:58:43

Do that many tests meet BS.1116? It's a long time since I read it, but IIRC the requirements for the listening room, number of listeners and trials, selection of test content, and training all pose a challenge.

Certain organisations find it far easier to provide a suitable listening room.
Certain groups of people find it far easier to identify problem samples.

Cheers,
David.

Should HA promote a more rigorous listening test protocol?

Reply #18 – 2012-11-26 13:22:36

I am frankly surprised that there is no sticky at the top of the Listening Tests forum that explains what a reasonably good listening test entails, how to set it up, and how to present the results.

Should HA promote a more rigorous listening test protocol?

Reply #19 – 2012-11-26 17:14:08

Great. A lot of problem statements.
Now people can start make a propositions and formulate an alternative solutions.

As a reminder, Hydrogen Audio is the community created purely on enthuasist's resources.
So Somebody have a real deal and is eager to work on it on his/her spare time for free, Welcome.

Quote from: Woodinville on 2012-11-25 08:20:50

One of the big failures of that kind of testing is the forced ranking. Such tests assume that relative rankings are transitive. We all know better.

Sorry, "transitive" doesn't describe enough well your central idea and I'm quite sure people interpret it different (read as wrong) ways.
You're questioning not only HA's methodic but the whole ABC/HR, hence all previous tests which were used for standarization of lossy encoders. But that's not an issue. Everybody is free to beleive and express an ideas freely.

Hydrogen Audio as the rest of the internet is for free speech here so if You have an ideas You can start to work on them and share. We are open to talk about anything but someone should start to work on it and make a new steps.

Quote from: Axon on 2012-11-25 07:04:19

On the other hand, listening testers might self-select anyways, so that those who go to the trouble to take such tests may very well find the request for additional documentation of their listening experience, training, etc. to be reasonable. And such documentation would be extremely useful to use HA test results as an adjunct for clinical-/institutional-grade listening tests, of the sort that jj describes.

You are simply not aware of the fact that the documentation was provided. http://listening-tests.hydrogenaudio.org/i...96-a/readme.txt

And the whole job maded with every single participant!
You simply don't know that.

We all have suggestions, now does anybody want to work on them? Huh?

Should HA promote a more rigorous listening test protocol?

Reply #20 – 2012-11-26 17:30:00

Krabapple, the author of this discussion, did in fact graciously offer his time an effort towards improvement.

Should HA promote a more rigorous listening test protocol?

Reply #21 – 2012-11-26 17:38:11

With the talk about "including positive and negative controls", isn't this base already covered? We've been including low and high anchors for a while now. Is there more to this criticism than just the two forms of anchor?

Should HA promote a more rigorous listening test protocol?

Reply #22 – 2012-11-27 01:27:56

Quote from: IgorC on 2012-11-26 17:14:08

Sorry, "transitive" doesn't describe enough well your central idea and I'm quite sure people interpret it different (read as wrong) ways.
You're questioning not only HA's methodic but the whole ABC/HR, hence all previous tests which were used for standarization of lossy encoders. But that's not an issue. Everybody is free to beleive and express an ideas freely.

I'm doing no such thing. ABC/hr is doing individual rankings, not confusing things like the ones with 4 anchors, 10 probe conditions, and that asks you to rank the lot of them on one scale? Not ABC/hr or BS1116, although I do have some questions about some of the evaluations following some 1116 tests.

So what are you talking about?

ETA: Graynol, this is why I hesitate to say anything here. Just like in audiophile forums, it seems that anything you say can and will be used against you, even if you didn't say it. In case you weren't aware, I'm tired of audio, tired of audio enthusiasts of all sorts, and multiply-tired of the people who like to grind axes.

Should HA promote a more rigorous listening test protocol?

Reply #23 – 2012-11-27 01:32:49

Quote from: Canar on 2012-11-26 17:38:11

With the talk about "including positive and negative controls", isn't this base already covered? We've been including low and high anchors for a while now. Is there more to this criticism than just the two forms of anchor?

A negative control is A vs. A, present as ABX or ABC/hr, of course. If that's what you mean by 'high anchor', that's good.

A positive control might be a low anchor, but you would then perhaps want multiple anchors. So anchors that are not tests of identity can all be positive controls IF they should all be audible.

Basically, you want a positive control of a level equal to your desired test sensitivity. Yes, I know, this isn't the easiest thing in the world to spec.

But any test result has to show the results of the controls.

Anchors are generally for a different purpose, that of relating one test to another, of course.

Should HA promote a more rigorous listening test protocol?

Reply #24 – 2012-11-27 14:05:13

Quote from: Woodinville on 2012-11-27 01:32:49

A negative control is A vs. A, present as ABX or ABC/hr, of course. If that's what you mean by 'high anchor', that's good.

A positive control might be a low anchor, but you would then perhaps want multiple anchors. So anchors that are not tests of identity can all be positive controls IF they should all be audible.

Basically, you want a positive control of a level equal to your desired test sensitivity. Yes, I know, this isn't the easiest thing in the world to spec.

But any test result has to show the results of the controls.

Anchors are generally for a different purpose, that of relating one test to another, of course.

I think I understand now. We're talking about Control as in Control Condition in a Controlled Experiment, where the Control is used to compare against the Test Condition.

Negative Control in this case does not refer to negative or positive numbers, but to a Null Condition where no difference should be expected.
This means that the Negative Control is there to catch False Positives (where listeners falsely detect non-transparency)
We are comparing the original sample (or possibly the high anchor) with itself, so should expect no difference. This eliminates testers who claim to discern a difference when they cannot, but might believe they can because of expectation bias or something similar and also those who might be tempted to score somewhat at random.

All the recent HA public listening tests include in their methodology a method of excluding results for any sample & tester in which the reference sample is rated for impairment. Given that ABC/HR is used, in the case of uncertainty (i.e. non-obvious flaws) a tester should be performing an ABX to verify that a difference is discernible before committing their ranking.

I think it's then obvious that the meaning of Positive Control is a sample that should be obviously inferior to the reference to all listeners, but not necessarily inferior to all the samples under test.
The Positive Control is there principally to catch False Negatives (where people thing it is transparent).

It's difficult to get the idea that 'negative' = 'bad' out of one's mind. In this case 'negative' means 'good' as in 'unable to detect the difference from the reference'.

In some cases, it's a low-pass filtered sample. In the case of the recent speech codec comparisons conducted by Google to evaluate Opus (in its SILK and Hybrid modes) versus other speech codecs, there was both a 3.5 kHz LPF and a 7 kHz LPF in the test which should function as a Positive Control and something of an anchor to provide comparison between different listening tests.

In recent HA tests the low anchor has consistently been scored low by all participants who weren't excluded, if I recall correctly, which tends to indicate that False Negatives (false transparency results) have been excluded. Usually the low anchor is below all the tested codecs on every sample. There may be scope for using an intermediate anchor whose quality should fall consistently in about the range of impairments expected by the codecs under test. The problem may be that the nature of impairment is consistent, making it too easy to detect the anchor.

We usually do plot the low anchor in HA public listening tests, but not the reference, though one or two tests did use a high anchor that was not the original audio and plotted it. Where ranked references result in exclusion from the results, the screened results will obviously place the Negative Control (for False Positives) at the screening level (typically 5.0), making a plot of these values trivial.

Notice