HydrogenAudio

Hydrogenaudio Forum => Listening Tests => Topic started by: krabapple on 2012-11-23 18:01:49

Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-23 18:01:49
I was taken aback to read today this exchange on gearslutz, from earlier this year 

Quote
It's important to understand that what JJ considers a listening test and what the ABX/Hydogen Audio skeptics crowd considers a listening test are two very very different things.


Quote
Perhaps JJ can explain what he considers a listening test and how it's different from the Hydrogenaudio standpoint.
I was somehow under the impression they were not that different.


Quote
Including positive and negative controls, lots of training for the test as well as familiarity with the equipment and music, and equiment validation are the biggies.

Test evaluation might be an issue, too. Many tests, including some of the MPEG tests and 1116 make assumptions that the entire population reacts the same to impairments. While basic masking is universal, what people dislike when they can hear something is NOT universal.



http://www.gearslutz.com/board/7672621-post329.html (http://www.gearslutz.com/board/7672621-post329.html)
http://www.gearslutz.com/board/7674886-post337.html (http://www.gearslutz.com/board/7674886-post337.html)
http://www.gearslutz.com/board/7677113-post348.html (http://www.gearslutz.com/board/7677113-post348.html)


Now, I agree with Kees -- I don't think the HA community 'take' on listening tests is that different from what JJ mentions.  Few here, I suspect,  would dismiss the real utility of training , or of positive controls,  or familiarity etc., in making a listening test maximally sensitive.  (as for the rest, I confess I;m not really clear whether JJ's criciticsm of test evaluation  is directed at HA)

What I think is happening is a difference in what listening tests are used for.  Most individual HA reports of ABX tests are from users wanting to know if file X sounds different from file Y to them, as they are now, using the equipment they have, not as they would be after training to hear artifacts, on the most revealing equipment.  They aren't doing basic research into a difference's audibility, as JJ did, for example, when developing lossy codecs.  For that purpose, trained listeners, positive & negative controls, familiarity and 'validated' equipment are necessities.

Still, HA *does* host mass listening tests from time to time -- which are more akin to 'basic research' --  and its few 'official' guidelines on setting up listening tests  -- the HA wiki, and Pio's sticky threads -- make no mention of training, +/- controls, etc.  as factors in such tests.     

Time to change this?
Title: Should HA promote a more rigorous listening test protocol?
Post by: saratoga on 2012-11-23 18:16:23
Lots of the personal listening tests are by people with considerable training. As for abx tests by me members the goal is determine if a given file or system is good enough for that individual. In this case training may not even be desirable let alone necessary. I think it comes down to what you want to measure and how you analyze the results.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-23 18:24:54
Pio's post does make mention of relegating ABX testing as practice trials, so training is touched upon at least indirectly.

I don't see that we should go out of our way to engage in some debate by proxy.  Maybe those players who are members here can have the debate here. Those who are not members can certainly join so long as they do so in compliance with our rules, namely TOS12.  Personally I'm not interested in advocating for yet another thread dedicated to trolling TOS8, however (somewhat related to those who aren't welcome back per TOS12).
Title: Should HA promote a more rigorous listening test protocol?
Post by: ExUser on 2012-11-23 22:16:51
With all due respect to Mr. J., while his criticism of many of our public mass listening tests is valid, we do not stick to that approach dogmatically. Our intent in those tests is to get ordinary-citizen feedback regarding codec quality. All of his criticisms can be addressed without violating TOS8 in any way, they just make tests harder to conduct. We're aiming to maximize the audience we get feedback from, not maximize the quality of the results. Furthermore, as an Internet-based community, some criticisms are nigh impossible to address, such as equipment validation.

I think the criticisms were made in good faith without the intent of demeaning what we do. I fear that this thread will become divisive.
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-24 03:02:04
Pio's post does make mention of relegating ABX testing as practice trials, so training is touched upon at least indirectly.

I don't see that we should go out of our way to engage in some debate by proxy.  Maybe those players who are members here can have the debate here. Those who are not members can certainly join so long as they do so in compliance with our rules, namely TOS12.  Personally I'm not interested in advocating for yet another thread dedicated to trolling TOS8, however (somewhat related to those who aren't welcome back per TOS12).



That  gearslutz thread wasn't a debate about HA's practices -- it was about 'Mastered for iTunes'.  The three posts from March were a minor and fleeting sidenote there  -- but one IMO obviously pertinent to HA, for reasons of personnel (2 of 3 'players' being respectable HA posters too), and content.  I'd certainly have posted about it here earlier, if I'd read it earlier.  Certainly I hope Kees and JJ will both participate in this thread.

I'm not advocating or encouraging TOS8 trolling, and i honestly don't see how you went from what I posted, to that.  And I haven't a clue who you are referring to re:TOS12.  Kees?  JJ?  Bob Ohlsson?  I do hope Kees and JJ will both participate in this thread!  Guilty as charged, if that's the charge.
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-24 03:09:07
With all due respect to Mr. J., while his criticism of many of our public mass listening tests is valid, we do not stick to that approach dogmatically. Our intent in those tests is to get ordinary-citizen feedback regarding codec quality. All of his criticisms can be addressed without violating TOS8 in any way, they just make tests harder to conduct. We're aiming to maximize the audience we get feedback from, not maximize the quality of the results. Furthermore, as an Internet-based community, some criticisms are nigh impossible to address, such as equipment validation.


Sounds reasonable to me.  Where I'm heading is a discussion of whether there should be a revision of whatever formal HA guidelines exist for conducting listening tests.  And before someone snarks 'knock yourself out', yes,  I'm willing to help craft such revision....*after* discussion. 


Quote
I think the criticisms were made in good faith without the intent of demeaning what we do. I fear that this thread will become divisive.



I'm already a bit perplexed (not defensive) at the responses.  People are concerned that a discussion about listening test rigor as defined by JJ will devolve into TOS8 and TOS12 violations?  Seriously?  Is it because I used the phrase 'guilty as charged'?  Would it help if I put air-quotes around "guilty" and "charged"?
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-24 03:10:43
Lots of the personal listening tests are by people with considerable training. As for abx tests by me members the goal is determine if a given file or system is good enough for that individual. In this case training may not even be desirable let alone necessary. I think it comes down to what you want to measure and how you analyze the results.


That, and how you interpret them.  What claims you make from them.
Title: Should HA promote a more rigorous listening test protocol?
Post by: ExUser on 2012-11-24 03:39:22
Honestly, I think our procedure is fine, given what we're trying to achieve. We get statistically significant results. There's no need to change anything. We can run tests with altered procedure, should there be a desire, but what would the goal of such a test be?
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-24 04:00:08
My concern about people coming here to argue that TOS8 is for fools and that they are attracted to threads such as this one is hardly unfounded. That said, I hope this discussion doesn't devolve into this.

About the external discussion which inspired this one, I didn't bother to look at it.  As for TOS #12, perhaps it's something not apparent to non-staff. Like everyone else, I would like to see constructive participation and welcome new members.  My invitation only extends to truly new members. Those who have been previously banned need not apply.

EDIT: Since my caveat about TOS#12 seemed to stir a mini shitstorm, let me be clear, any member here who is able to post freely is in good standing.  Kees and JJ are fine.  As far as I am aware, Bob Ohlsson has never registered here and is perfectly free to do so (I never said nor implied otherwise, unless he was indeed previously banned).
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-24 13:07:28
Honestly, I think our procedure is fine, given what we're trying to achieve. We get statistically significant results. There's no need to change anything. We can run tests with altered procedure, should there be a desire, but what would the goal of such a test be?



Some of the caveats , I would think, apply more to 'no difference' results than to statistically signficant (positive) results.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-24 16:36:31
Would it help if I put air-quotes around "guilty" and "charged"?

Not that it has anything to do with attracting TOS8 bashing, but you should suggest a new title that is compliant with TOS #6. The current one doesn't make the grade, with or without scary quotes.

Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Axon on 2012-11-25 07:04:19
There's a tradeoff going on here.

One the one hand, reducing the barriers for Joe Sixpack forum readers to contribute listening test results is extremely important, and the policies of HA's listening tests have been very, very good with that.

On the other hand, listening testers might self-select anyways, so that those who go to the trouble to take such tests may very well find the request for additional documentation of their listening experience, training, etc. to be reasonable. And such documentation would be extremely useful to use HA test results as an adjunct for clinical-/institutional-grade listening tests, of the sort that jj describes.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-25 08:20:50
Ok, I'm a little confused here. How does what I said have anything to do with TOS 8 bashing? I'm asking for better tests, and yes, there should ALWAYS be positive and negative controls in a test, and no, they aren't that hard to add, and yes, you can add them in varying levels of positive control and get some very useful information. So you should. I'm standing absolutely firm on that position, because I see so many tests that I can't even evaluate the results coming to me in capacities as reviewer or editor, tests that have no way to relate them to other sets of results in any fashion. (no, I don't mean you should combine results)

As to evaluating for multiple axis, that's only for tests that do more than "can you detect" testing, obviously.  I am known to be a very serious unfan of the "impaired signal multiple choice" tests people are using these days. (I am avoiding the name of the popular test, I've been accused of stealing a trademark once when I mentioned the name of this test in a critical fashion.)  One of the big failures of that kind of testing is the forced ranking. Such tests assume that relative rankings are transitive. We all know better.

I am frankly surprised at the apparent offense taken to what I said. I'm simply describing standard practice.
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-25 15:34:35
Would it help if I put air-quotes around "guilty" and "charged"?

Not that it has anything to do with attracting TOS8 bashing, but you should suggest a new title that is compliant with TOS #6. The current one doesn't make the grade, with or without scary quotes.



OK, how about, 'Should HA promote a more rigorous listening test protocol'?


Quote
Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.


Seems to me it's a it's a bit of both, and Listening Tests is the more specific of the two. But feel free to move it wherever you think it fits best.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-25 16:31:45
How does what I said have anything to do with TOS 8 bashing

My reply may sound defensive, but I don't care.  Please show me where I called any particular individual out on bashing TOS8.  You can't because I didn't.  If I didn't make myself clear enough earlier, I don't want this thread to attract yet another set of placebophile trolls to railroad the discussion into another referendum on TOS8.  I could link discussions and name names if you want, but I don't see the point; except to demonstrate that you and Kees do not provide cause for concern.

Quote
I am frankly surprised at the apparent offense taken to what I said. I'm simply describing standard practice.

I agree with you on principle, but I am frankly surprised you haven't taken the opportunity to talk about it here; rather you seem to only talk about it in forums which either don't require objective evidence or worse, forums where this criteria is rejected and even shunned by a sizable portion of it's more vocal and respected members.

Hopefully this thread will prove me wrong, assuming that I'm not wrong already, though I've closely followed this forum and your contributions in particular for many years now.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-25 17:17:30
Quote
Also, if we're talking about forum policy, this discussion belongs in site related discussion, not listening tests. Please read the subforum descriptions if you haven't already.


Seems to me it's a it's a bit of both, and Listening Tests is the more specific of the two. But feel free to move it wherever you think it fits best.

You're right, it is a bit of both.  Thanks for the updated title.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Porcus on 2012-11-26 07:25:30
I agree with Axon, if that is what is being discussed (which is also a bit unclear to me).  If it is the public listening tests, then they seem not to have the scope of e.g. identifying annoyances in order to address them in development. Any reason that they should?
Title: Should HA promote a more rigorous listening test protocol?
Post by: 2Bdecided on 2012-11-26 12:58:43
Do that many tests meet BS.1116? It's a long time since I read it, but IIRC the requirements for the listening room, number of listeners and trials, selection of test content, and training all pose a challenge.

Certain organisations find it far easier to provide a suitable listening room.
Certain groups of people find it far easier to identify problem samples.

Cheers,
David.
Title: Should HA promote a more rigorous listening test protocol?
Post by: dhromed on 2012-11-26 13:22:36
I am frankly surprised that there is no sticky at the top of the Listening Tests forum that explains what a reasonably good listening test entails, how to set it up, and how to present the results.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-26 17:14:08
Great. A lot of problem statements.
Now people can start make a propositions and formulate an alternative solutions.

As a reminder, Hydrogen Audio is the community created purely on  enthuasist's resources.
So Somebody have a real deal and is eager to work on it on his/her spare time for free, Welcome.

One of the big failures of that kind of testing is the forced ranking. Such tests assume that relative rankings are transitive. We all know better.

Sorry,  "transitive" doesn't describe enough well your central idea and I'm quite sure people interpret it different (read as wrong) ways.
You're questioning not only HA's methodic but the whole ABC/HR, hence all previous tests which were used for standarization of lossy encoders. But that's not an issue. Everybody is free to beleive and express an ideas freely.

Hydrogen Audio as the rest of the internet is for free speech here so if You have an ideas You can start to work on them and share. We are open to talk about anything but someone should start to work on it and make a new steps.

On the other hand, listening testers might self-select anyways, so that those who go to the trouble to take such tests may very well find the request for additional documentation of their listening experience, training, etc. to be reasonable. And such documentation would be extremely useful to use HA test results as an adjunct for clinical-/institutional-grade listening tests, of the sort that jj describes.

You are simply not aware of the fact that the documentation was provided. http://listening-tests.hydrogenaudio.org/i...96-a/readme.txt (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/readme.txt)

And the whole job maded with every single participant!
You simply don't know that.



We all have suggestions, now does anybody want to work on them? Huh?
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-26 17:30:00
Krabapple, the author of this discussion, did in fact graciously offer his time an effort towards improvement.
Title: Should HA promote a more rigorous listening test protocol?
Post by: ExUser on 2012-11-26 17:38:11
With the talk about "including positive and negative controls", isn't this base already covered? We've been including low and high anchors for a while now. Is there more to this criticism than just the two forms of anchor?
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-27 01:27:56
Sorry,  "transitive" doesn't describe enough well your central idea and I'm quite sure people interpret it different (read as wrong) ways.
You're questioning not only HA's methodic but the whole ABC/HR, hence all previous tests which were used for standarization of lossy encoders. But that's not an issue. Everybody is free to beleive and express an ideas freely.


I'm doing no such thing. ABC/hr is doing individual rankings, not confusing things like the ones with 4 anchors, 10 probe conditions, and that asks you to rank the lot of them on one scale?  Not ABC/hr or BS1116, although I do have some questions about some of the evaluations following some 1116 tests.

So what are you talking about?

ETA: Graynol, this is why I hesitate to say anything here. Just like in audiophile forums, it seems that anything you say can and will be used against you, even if you didn't say it.  In case you weren't aware, I'm tired of audio, tired of audio enthusiasts of all sorts, and multiply-tired of the people who like to grind axes.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-27 01:32:49
With the talk about "including positive and negative controls", isn't this base already covered? We've been including low and high anchors for a while now. Is there more to this criticism than just the two forms of anchor?


A negative control is A vs. A, present as ABX or ABC/hr, of course.  If that's what you mean by 'high anchor', that's good.

A positive control might be a low anchor, but you would then perhaps want multiple anchors. So anchors that are not tests of identity can all be positive controls IF they should all be audible.

Basically, you want a positive control of a level equal to your desired test sensitivity. Yes, I know, this isn't the easiest thing in the world to spec.

But any test result has to show the results of the controls.

Anchors are generally for a different purpose, that of relating one test to another, of course.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Dynamic on 2012-11-27 14:05:13
A negative control is A vs. A, present as ABX or ABC/hr, of course.  If that's what you mean by 'high anchor', that's good.

A positive control might be a low anchor, but you would then perhaps want multiple anchors. So anchors that are not tests of identity can all be positive controls IF they should all be audible.

Basically, you want a positive control of a level equal to your desired test sensitivity. Yes, I know, this isn't the easiest thing in the world to spec.

But any test result has to show the results of the controls.

Anchors are generally for a different purpose, that of relating one test to another, of course.


I think I understand now. We're talking about Control as in Control Condition in a Controlled Experiment, where the Control is used to compare against the Test Condition.

Negative Control in this case does not refer to negative or positive numbers, but to a Null Condition where no difference should be expected.
This means that the Negative Control is there to catch False Positives (where listeners falsely detect non-transparency)
We are comparing the original sample (or possibly the high anchor) with itself, so should expect no difference. This eliminates testers who claim to discern a difference when they cannot, but might believe they can because of expectation bias or something similar and also those who might be tempted to score somewhat at random.

All the recent HA public listening tests include in their methodology a method of excluding results for any sample & tester in which the reference sample is rated for impairment. Given that ABC/HR is used, in the case of uncertainty (i.e. non-obvious flaws) a tester should be performing an ABX to verify that a difference is discernible before committing their ranking.

I think it's then obvious that the meaning of Positive Control is a sample that should be obviously inferior to the reference to all listeners, but not necessarily inferior to all the samples under test.
The Positive Control is there principally to catch False Negatives (where people thing it is transparent).

It's difficult to get the idea that 'negative' = 'bad' out of one's mind. In this case 'negative' means 'good' as in 'unable to detect the difference from the reference'.

In some cases, it's a low-pass filtered sample. In the case of the recent speech codec comparisons conducted by Google to evaluate Opus (in its SILK and Hybrid modes) versus other speech codecs, there was both a 3.5 kHz LPF and a 7 kHz LPF in the test which should function as a Positive Control and something of an anchor to provide comparison between different listening tests.

In recent HA tests the low anchor has consistently been scored low by all participants who weren't excluded, if I recall correctly, which tends to indicate that False Negatives (false transparency results) have been excluded. Usually the low anchor is below all the tested codecs on every sample. There may be scope for using an intermediate anchor whose quality should fall consistently in about the range of impairments expected by the codecs under test. The problem may be that the nature of impairment is consistent, making it too easy to detect the anchor.

We usually do plot the low anchor in HA public listening tests, but not the reference, though one or two tests did use a high anchor that was not the original audio and plotted it. Where ranked references result in exclusion from the results, the screened results will obviously place the Negative Control (for False Positives) at the screening level (typically 5.0), making a plot of these values trivial.
Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-27 14:31:32
ETA: Graynol, this is why I hesitate to say anything here. Just like in audiophile forums, it seems that anything you say can and will be used against you, even if you didn't say it.  In case you weren't aware, I'm tired of audio, tired of audio enthusiasts of all sorts, and multiply-tired of the people who like to grind axes.



Fair enough, but based on what I've read of his in other forums, Bob Ohlsson was grinding an anti-HA, anti-skeptic axe in that Gearslutz post your replied to. But that wouldn't have been enough to get me to post about it here.  It was your response that did that, and the realization that HA really doesn't have a formal listening test protocol to point people to, even though we surely must be one of the foremost anti-woo audio forums on the Web.


Title: Should HA promote a more rigorous listening test protocol?
Post by: krabapple on 2012-11-27 14:40:51
I think I understand now. We're talking about Control as in Control Condition in a Controlled Experiment, where the Control is used to compare against the Test Condition.

Negative Control in this case does not refer to negative or positive numbers, but to a Null Condition where no difference should be expected.
This means that the Negative Control is there to catch False Positives (where listeners falsely detect non-transparency)
We are comparing the original sample (or possibly the high anchor) with itself, so should expect no difference. This eliminates testers who claim to discern a difference when they cannot, but might believe they can because of expectation bias or something similar and also those who might be tempted to score somewhat at random.


very simply:
A negative control is a set of experimental conditions that should not show the effect you're interested in.

A positive control is one that should.

You run both of these alongside the 'real' experiments you're doing. If one or both of them go wrong, your other results are questionable.







Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-27 16:43:53
I'm doing no such thing. ABC/hr is doing individual rankings, not confusing things like the ones with 4 anchors, 10 probe conditions, and that asks you to rank the lot of them on one scale?  Not ABC/hr or BS1116, although I do have some questions about some of the evaluations following some 1116 tests.

And how should somebody understand it from your resumed post¡? We're not into all details of thought that one of us can have.
You just resume the big concept that requires a lot of description just in one word ^transitive^.  Here are a lot of people of different disciplines and not everybody will understand this term. I even connected to #hydrogenaudio to ask if it's only me who  don't get in all clear what people (including You) are talking about . Well, there were some other guys who didn't get your idea.  Add to it we are not all native English speakers and out conversation is not face to face. Only small fraction of message actually arrives to destinator.

if You are really serious about it then You can describe your idea with more details.

And  I wish we could get clear something important. People speak here about HA listening tests as it was something that implies an organization of many people here admins, conductor etc. As it was a whole entity involved. You are making the statements  thinking that there _somebody_ at the top of HA that will make a decision for all of us and make wishlist done. 

Now truth, the idea to conduct last  two public listening tests was coming from one single person. And I'm very thankfull for two other guys who actually made it possible.

There is no HA public tests. There is a single person at one particular moment who says "hey, let's do it"  and this is person is trying to say You "hey, we would need to talk a lot about it. Talk to me."
Title: Should HA promote a more rigorous listening test protocol?
Post by: Porcus on 2012-11-27 17:12:24
You just resume the big concept that requires a lot of description just in one word ^transitive^.  Here are a lot of people of different disciplines and not everybody will understand this term. I even connected to #hydrogenaudio to ask if it's only me who  don't get in all clear what people (including You) are talking about . Well, there were some other guys who didn't get your idea.


If anyone needs:

We have that if a>b and b>c, then a>c. That is transitivity for the > relation.
The = relation is also transitive: if a=b and b=c then a=c.

The “approximately equal to” relation is not transitive. Or, put it “not far from”, to make it a bit more obvious: if a is not far from b, and b is not far from c, then that does not rule out that a and c are far from each other. You would expect this with any relation which is “not far from” in the appropriate sense: “statistically tied to”, as we can have a tied to b and b tied to c, yet not necessarily a tied to c. And here's one more: just because you cannot ABX a from b, and you cannot ABX b from c, it might still be that you can actually ABX a from c.

Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.
Title: Should HA promote a more rigorous listening test protocol?
Post by: ExUser on 2012-11-27 19:21:12
There's a concept that might be useful: "transitive within an error margin"? I wonder how you'd prove something like that.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-27 22:03:07
We usually do plot the low anchor in HA public listening tests, but not the reference, though one or two tests did use a high anchor that was not the original audio and plotted it. Where ranked references result in exclusion from the results, the screened results will obviously place the Negative Control (for False Positives) at the screening level (typically 5.0), making a plot of these values trivial.


In a test like you run, you want multiple positive controls, of different degrees of impairment.


That way you can tell how good the subject+setup was.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-27 22:05:08
Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.


Indeed, and this is a known, real problem.

The problem of A~b and B~c but a !~ c is of course also a problem, and as you said, nontrivial to discover.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-28 00:17:09
Let's suppose two separate tests and 3 codecs:
1º test
A - 4.0 (perceptible but not annyoing)
B - 3.0 (slightly annoying)



C- 3.5 (very slightly annoying or a bit annoying (?))
B - 3.0 (slightly annoying)

For one particular listener:
Given he/she applies the same scale (1.0-5.0 - very annoying to impreceptible) to both tests it's not in all invalid to think that A>C for him/her.  A listener with certain experience already has his own criteria which he appplies for all samples. "OK, if it's not that bad  I put 4.0. If a sample has this sort of artifacts I gave it 3.0, but my ears are more tolerable for another type of artifacts  (3.5)" etc...


P.S.
Now if there are more than one listener.
http://www.acourate.com/Download/BiasesInM...teningTests.pdf (http://www.acourate.com/Download/BiasesInModernAudioQualityListeningTests.pdf)
(http://s18.postimage.org/u94p9yx7t/ranking.png)



Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-28 01:12:26
If anyone needs:

We have that if a>b and b>c, then a>c. That is transitivity for the > relation.
The = relation is also transitive: if a=b and b=c then a=c.

The “approximately equal to” relation is not transitive. Or, put it “not far from”, to make it a bit more obvious: if a is not far from b, and b is not far from c, then that does not rule out that a and c are far from each other. You would expect this with any relation which is “not far from” in the appropriate sense: “statistically tied to”, as we can have a tied to b and b tied to c, yet not necessarily a tied to c. And here's one more: just because you cannot ABX a from b, and you cannot ABX b from c, it might still be that you can actually ABX a from c.

Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.

Thank You. I've remebered, have studied it in math.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-28 03:03:39
Let's suppose two separate tests and 3 codecs:
1º test
A - 4.0 (perceptible but not annyoing)
B - 3.0 (slightly annoying)



C- 3.5 (very slightly annoying or a bit annoying (?))
B - 3.0 (slightly annoying)

For one particular listener:
Given he/she applies the same scale (1.0-5.0 - very annoying to impreceptible) to both tests it's not in all invalid to think that A>C for him/her.  A listener with certain experience already has his own criteria which he appplies for all samples. "OK, if it's not that bad  I put 4.0. If a sample has this sort of artifacts I gave it 3.0, but my ears are more tolerable for another type of artifacts  (3.5)" etc...


P.S.
Now if there are more than one listener.
http://www.acourate.com/Download/BiasesInM...teningTests.pdf (http://www.acourate.com/Download/BiasesInModernAudioQualityListeningTests.pdf)
(http://s18.postimage.org/u94p9yx7t/ranking.png)


You're showing a different problem here.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-28 04:00:57
Indeed it's a different one. I took just one  problem to show how it's possible to plan any perfectly logical issue  and how actually it will be hard to solve or at least minimize the same in reality. 

But it's doesn't stop to show, for example, that HE-AAC is superior to Vorbis at 64 kbps in 3 different HA public tests organized by  3 different members which were performed during 3 different times.

Do You really beleive that some extra control will substantially change the results?
If we perform the same test twice (as HA test and another with some extra control)?
It will be interesting to hear your opinion.

And what do You expect from public test performed via internet? Are You familiar with it in real scenario?  Please, no offense. It's important for me.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-28 05:40:50
Do You really beleive that some extra control will substantially change the results?


This would seem that you don't understand what control conditions are for.

Don't you want to know what your test measured?

I don't see why this is so hard. Just add a set of positive controls with different sensitivity, and enough negative controls to confirm that they don't get answered beyond chance.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-28 07:03:49
Got it.
The idea of positive and negative controls  is actually good. It's similiar to MPEG's rules of post-screening:

Quote
Post-screening of listener responses should be applied as follows. If, for any test item in a given test, either of the following criterion are not satisfied:
•   The listener score for the hidden reference is greater than or equal to 90 (i.e. HR >= 90)
•   The listener scores the hidden reference, the 7.0 kHz lowpass anchor and the 3.5 kHz lowpass anchor are monotonically decreasing (i.e.  HR >= LP70 >= LP35).
Then all listener responses in that test are removed from consideration.
 

Only one thing, probably an inclusion  of a hidden reference (negative control) will reduce the amount of testing codecs.
But that's up to open discussion.


Also it's worth to notice that the rules for the last two public tests are hard to name as toothless despite it has only one low anchor and no hidden refence. If You look through the results the listeners who were guessing quite often have picked up a few time reference and by rules all their results were invalidated (probably the same effect as to have an additional hidden reference). 
rules.txt is in folder "Sorted by listener"
http://listening-tests.hydrogenaudio.org/i...ous/results.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/miscellaneous/results.zip)

After all I think your technique is not that far from the one which we were applying. Two low anchors is actually great IMO.

P.S. Went to sleep
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-28 07:15:36
Not really JJ's technique, but that which is commonly used in the industry. When working under contract, my blind tests included the reference and two low anchors.
Title: Should HA promote a more rigorous listening test protocol?
Post by: 2Bdecided on 2012-11-28 11:05:35
I agree that using controls is necessary in a proper listening test. I wouldn't argue against anything that JJ said.

However, "casual" testers (whatever they are!) must remember that using the wrong controls in a not-quite-proper listening test could be worse than nothing. e.g. Using a low anchor that's too low will provide a positive control, but can wreck all the other answers by making them "bunch up" towards the top of the scale.

e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Cheers,
David.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Dynamic on 2012-11-28 14:54:25
Good point, David.

I guess a rough and ready pre-test with only one or two listeners and a few samples is likely to be sufficient to place the Positive Controls (anchors) within the same region as the tested codecs, rather than too far below (such as the 128kbps MP3 test (http://listening-tests.hydrogenaudio.org/sebastian/mp3-128-1/results.htm) where low anchor l3enc was very poor and all contenders tied for quality, though one or two showed more consistent scorings among the samples tested). Perhaps a retest with a better low anchor might un-tie on or two of them?

Probably a few low-pass filters and/or older codecs could be checked in the pre-test to ensure that any anchors are close to the range expected of the codecs under test.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-28 15:51:28
If the contenders are statistically tied, changing the anchors isn't going to magically untie them.  Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data.  When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.

When the anchors are too close, low anchors may get ranked better than the contenders. High anchors may get ranked worse. This is not exactly unreasonable.  What needs to be taken seriously is that judging is subjective; not everyone ranks different artifacts the same way.  It could be that the low anchors actually do sound better or the high anchor actually does sound worse.  It is also not unreasonable to get differing rankings between all stimuli based on the specific clips being auditioned.  What may be unreasonable is to dismiss discrepancies like these from the "expected" results as "wrong".

With this in mind, I only take seriously the clear trends in very large tests (many participants and many worthwhile, typical real-life sample clips).  I somewhat reject the idea that all participants must be trained when there are large numbers of them, however.  While the testers should be able to distinguish and categorize them, they should not be steered into thinking one is less desirable than the other.

Lastly, all too often people treat the results of small tests posted here as definitive.  They really aren't.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-28 16:41:36
e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Exactly my thoughts. But organizations of standarization are  interested to test it because those are widely common bandwithes:  NB telephony (3.5kHz) and WB (7kHz). 

We would probably need two low anchors like  5kHz and 8-10 kHz (?)

P.S. Probably it will be better if we will start to use the same lowpass anchors  for all public tests.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-28 17:32:04
If the contenders are statistically tied, changing the anchors isn't going to magically untie them.  Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.
Quote
Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data.  When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.

Please don't use that test for near-transparent codecs. It's not appropriate. ABX or ABC/hr are appropriate. But you still need both negative and positive controls.
Quote
Not everyone ranks different artifacts the same way.


That is part of my problem with tests that compare many different codecs simultaneously along only one axis (scale).  But it's only part of the problem. There are many others.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-28 17:35:44
e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Exactly my thoughts. But organizations of standarization are  interested to test it because those are widely common bandwithes:  NB telephony (3.5kHz) and WB (7kHz). 

We would probably need two low anchors like  5kHz and 8-10 kHz (?)

P.S. Probably it will be better if we will start to use the same lowpass anchors  for all public tests.


Coded anchors with known codec pairs would be better. You want the impairments in the controls to be similar to the impairments you're testing.
Title: Should HA promote a more rigorous listening test protocol?
Post by: greynol on 2012-11-28 18:00:39
Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.

True, however, if people actually adhere to the descriptions of the rankings, the locations of the low anchors shouldn't affect the scores of the other samples.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Woodinville on 2012-11-28 18:15:12
Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.

True, however, if people actually adhere to the descriptions of the rankings, the locations of the low anchors shouldn't affect the scores of the other samples.


Except that's not how subjects work, and creating any kind of intellectual confusion during a test only makes it worse.
Title: Should HA promote a more rigorous listening test protocol?
Post by: Dynamic on 2012-11-28 19:08:15
If the contenders are statistically tied, changing the anchors isn't going to magically untie them.  Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.


My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).

I was suggesting that before the main test (which still has a lot of testers and a lot of samples), appropriately close anchors could be chosen by a short test on only a few samples which rules out anchors that are vastly superior or vastly inferior to the codecs under test. I don't think Woodinville believes it is essential that the anchors must be outside the range of the codecs under test (i.e. consistently lower and higher) but could be fairly consistenty towards the low end and fairly consistently towards the high end, assuming we used two anchors.

I think Woodinville mentioned the trickiest thing to get right. If we presume that the nature of low-pass filter quality degradation is too different to the nature of typical codec flaws (warbling, sparklies, tonal problems, transient smear, pre-echo, stereo image problems etc.) then we'd be looking for anchors instead among other encoders and settings not under test, or from consistent distortions of a similar nature. For example, we might choose a prior generation codec, even at a slightly higher bitrate as a lowish anchor. Maybe Lame -V7, for example, or l3enc at 160 kbps -hq or 192 kbps, or toolame at 128 kbps, perhaps, or FhG fastenc MP3 at a setting with Intensity Stereo rather than safe joint stereo. Perhaps a high anchor could be a previous test-winner at a slightly higher bitrate where some flaws are still evident (so that it still acts as a Positive Control - i.e. distinguishable from the original audio). There are certain encoders so badly flawed that some testers will immediately identify them, so I suppose Xing (old) with no short blocks or BLADEenc would not be good choices.

It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test.

Another potential use of the anchors would be to calibrate and normalize the quality scales used by different listeners, though the validity of this is questionable as some people find pre-echo more annoying than tonal problems, or find stereo collapse less objectionable than high-frequency sparklies for example, while others have the reverse preferences. The preferences here are part of the reason that results can be intransitive.

Once or twice, anchors have also been used to address a common claim or myth (e.g. that WMA 64 kbps is as good as MP3 and 128 kbps). For guruboolez, some 80-96 kbps tests used lame at about 128 kbps as an anchor to assess where the truth lay at the time to his ears, for example.

I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.

These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.

We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.
Title: Should HA promote a more rigorous listening test protocol?
Post by: IgorC on 2012-11-29 00:04:16
My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).

Well, during the last public test we have picked up the low quality low anchor but still got very wide range for final rating. So it hadn't any noticeable impact as one could expect. 

It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test.


...

I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.

These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.

We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.

Completely agree.

As I was one of the main organizators of  two last HA listening tests I will try to bring (from time to time) an  information here to get the idea where we are now and what are possiblities and limitations.
First of all we've barely got enough results for the last test at 96 kbps.  Link (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=92490&view=findpost&p=780137)
The main source of listeners whose results were accepted as valid (have met the rules) was HA iteslf and some members  from Doom9 forum.
The rest of internet community has failed grossly. Multiple ranking of a reference and basically nescience of what ABX is, despite all information and guidelines.
Shortly, it was HA itself  and people who are closely familiar or aware of it.

The best scenario is to intoduce an improvements (additional controls with updated guideline ) while  keeping the listeners  and don't do their life even more difficult. If not, we will stay without listeners and results.

Two low anchors doesn't cost anything. It's just one additional low anchor and a participant with a normal hearing won't get into troubles.
I have fears of a hidden reference. Now not only a listener should rank all codecs but also keep in mind that one of them is a hidden reference given that codecs are getting pretty close to transparent at 96 kbps.