Should HA promote a more rigorous listening test protocol?

Topic: Should HA promote a more rigorous listening test protocol? (Read 27496 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Should HA promote a more rigorous listening test protocol?

Reply #25 – 2012-11-27 14:31:32

Quote from: Woodinville on 2012-11-27 01:27:56

ETA: Graynol, this is why I hesitate to say anything here. Just like in audiophile forums, it seems that anything you say can and will be used against you, even if you didn't say it. In case you weren't aware, I'm tired of audio, tired of audio enthusiasts of all sorts, and multiply-tired of the people who like to grind axes.

Fair enough, but based on what I've read of his in other forums, Bob Ohlsson was grinding an anti-HA, anti-skeptic axe in that Gearslutz post your replied to. But that wouldn't have been enough to get me to post about it here. It was your response that did that, and the realization that HA really doesn't have a formal listening test protocol to point people to, even though we surely must be one of the foremost anti-woo audio forums on the Web.

Should HA promote a more rigorous listening test protocol?

Reply #26 – 2012-11-27 14:40:51

Quote from: Dynamic on 2012-11-27 14:05:13

I think I understand now. We're talking about Control as in Control Condition in a Controlled Experiment, where the Control is used to compare against the Test Condition.

Negative Control in this case does not refer to negative or positive numbers, but to a Null Condition where no difference should be expected.
This means that the Negative Control is there to catch False Positives (where listeners falsely detect non-transparency)
We are comparing the original sample (or possibly the high anchor) with itself, so should expect no difference. This eliminates testers who claim to discern a difference when they cannot, but might believe they can because of expectation bias or something similar and also those who might be tempted to score somewhat at random.

very simply:
A negative control is a set of experimental conditions that should not show the effect you're interested in.

A positive control is one that should.

You run both of these alongside the 'real' experiments you're doing. If one or both of them go wrong, your other results are questionable.

Should HA promote a more rigorous listening test protocol?

Reply #27 – 2012-11-27 16:43:53

Quote from: Woodinville on 2012-11-27 01:27:56

I'm doing no such thing. ABC/hr is doing individual rankings, not confusing things like the ones with 4 anchors, 10 probe conditions, and that asks you to rank the lot of them on one scale? Not ABC/hr or BS1116, although I do have some questions about some of the evaluations following some 1116 tests.

And how should somebody understand it from your resumed post¡? We're not into all details of thought that one of us can have.
You just resume the big concept that requires a lot of description just in one word ^transitive^. Here are a lot of people of different disciplines and not everybody will understand this term. I even connected to #hydrogenaudio to ask if it's only me who don't get in all clear what people (including You) are talking about . Well, there were some other guys who didn't get your idea. Add to it we are not all native English speakers and out conversation is not face to face. Only small fraction of message actually arrives to destinator.

if You are really serious about it then You can describe your idea with more details.

And I wish we could get clear something important. People speak here about HA listening tests as it was something that implies an organization of many people here admins, conductor etc. As it was a whole entity involved. You are making the statements thinking that there _somebody_ at the top of HA that will make a decision for all of us and make wishlist done.

Now truth, the idea to conduct last two public listening tests was coming from one single person. And I'm very thankfull for two other guys who actually made it possible.

There is no HA public tests. There is a single person at one particular moment who says "hey, let's do it" and this is person is trying to say You "hey, we would need to talk a lot about it. Talk to me."

Should HA promote a more rigorous listening test protocol?

Reply #28 – 2012-11-27 17:12:24

Quote from: IgorC on 2012-11-27 16:43:53

You just resume the big concept that requires a lot of description just in one word ^transitive^. Here are a lot of people of different disciplines and not everybody will understand this term. I even connected to #hydrogenaudio to ask if it's only me who don't get in all clear what people (including You) are talking about . Well, there were some other guys who didn't get your idea.

If anyone needs:

We have that if a>b and b>c, then a>c. That is transitivity for the > relation.
The = relation is also transitive: if a=b and b=c then a=c.

The “approximately equal to” relation is not transitive. Or, put it “not far from”, to make it a bit more obvious: if a is not far from b, and b is not far from c, then that does not rule out that a and c are far from each other. You would expect this with any relation which is “not far from” in the appropriate sense: “statistically tied to”, as we can have a tied to b and b tied to c, yet not necessarily a tied to c. And here's one more: just because you cannot ABX a from b, and you cannot ABX b from c, it might still be that you can actually ABX a from c.

Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.

Should HA promote a more rigorous listening test protocol?

Reply #29 – 2012-11-27 19:21:12

There's a concept that might be useful: "transitive within an error margin"? I wonder how you'd prove something like that.

Should HA promote a more rigorous listening test protocol?

Reply #30 – 2012-11-27 22:03:07

Quote from: Dynamic on 2012-11-27 14:05:13

We usually do plot the low anchor in HA public listening tests, but not the reference, though one or two tests did use a high anchor that was not the original audio and plotted it. Where ranked references result in exclusion from the results, the screened results will obviously place the Negative Control (for False Positives) at the screening level (typically 5.0), making a plot of these values trivial.

In a test like you run, you want multiple positive controls, of different degrees of impairment.

That way you can tell how good the subject+setup was.

Should HA promote a more rigorous listening test protocol?

Reply #31 – 2012-11-27 22:05:08

Quote from: Porcus on 2012-11-27 17:12:24

Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.

Indeed, and this is a known, real problem.

The problem of A~b and B~c but a !~ c is of course also a problem, and as you said, nontrivial to discover.

Should HA promote a more rigorous listening test protocol?

Reply #32 – 2012-11-28 00:17:09

Let's suppose two separate tests and 3 codecs:
1º test
A - 4.0 (perceptible but not annyoing)
B - 3.0 (slightly annoying)

2º
C- 3.5 (very slightly annoying or a bit annoying (?))
B - 3.0 (slightly annoying)

For one particular listener:
Given he/she applies the same scale (1.0-5.0 - very annoying to impreceptible) to both tests it's not in all invalid to think that A>C for him/her. A listener with certain experience already has his own criteria which he appplies for all samples. "OK, if it's not that bad I put 4.0. If a sample has this sort of artifacts I gave it 3.0, but my ears are more tolerable for another type of artifacts (3.5)" etc...

P.S.
Now if there are more than one listener.
http://www.acourate.com/Download/BiasesInM...teningTests.pdf

Should HA promote a more rigorous listening test protocol?

Reply #33 – 2012-11-28 01:12:26

Quote from: Porcus on 2012-11-27 17:12:24

If anyone needs:

We have that if a>b and b>c, then a>c. That is transitivity for the > relation.
The = relation is also transitive: if a=b and b=c then a=c.

The “approximately equal to” relation is not transitive. Or, put it “not far from”, to make it a bit more obvious: if a is not far from b, and b is not far from c, then that does not rule out that a and c are far from each other. You would expect this with any relation which is “not far from” in the appropriate sense: “statistically tied to”, as we can have a tied to b and b tied to c, yet not necessarily a tied to c. And here's one more: just because you cannot ABX a from b, and you cannot ABX b from c, it might still be that you can actually ABX a from c.

Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.

Thank You. I've remebered, have studied it in math.

Should HA promote a more rigorous listening test protocol?

Reply #34 – 2012-11-28 03:03:39

Quote from: IgorC on 2012-11-28 00:17:09

Let's suppose two separate tests and 3 codecs:
1º test
A - 4.0 (perceptible but not annyoing)
B - 3.0 (slightly annoying)

2º
C- 3.5 (very slightly annoying or a bit annoying (?))
B - 3.0 (slightly annoying)

For one particular listener:
Given he/she applies the same scale (1.0-5.0 - very annoying to impreceptible) to both tests it's not in all invalid to think that A>C for him/her. A listener with certain experience already has his own criteria which he appplies for all samples. "OK, if it's not that bad I put 4.0. If a sample has this sort of artifacts I gave it 3.0, but my ears are more tolerable for another type of artifacts (3.5)" etc...

P.S.
Now if there are more than one listener.
http://www.acourate.com/Download/BiasesInM...teningTests.pdf

You're showing a different problem here.

Should HA promote a more rigorous listening test protocol?

Reply #35 – 2012-11-28 04:00:57

Indeed it's a different one. I took just one problem to show how it's possible to plan any perfectly logical issue and how actually it will be hard to solve or at least minimize the same in reality.

But it's doesn't stop to show, for example, that HE-AAC is superior to Vorbis at 64 kbps in 3 different HA public tests organized by 3 different members which were performed during 3 different times.

Do You really beleive that some extra control will substantially change the results?
If we perform the same test twice (as HA test and another with some extra control)?
It will be interesting to hear your opinion.

And what do You expect from public test performed via internet? Are You familiar with it in real scenario? Please, no offense. It's important for me.

Should HA promote a more rigorous listening test protocol?

Reply #36 – 2012-11-28 05:40:50

Quote from: IgorC on 2012-11-28 04:00:57

Do You really beleive that some extra control will substantially change the results?

This would seem that you don't understand what control conditions are for.

Don't you want to know what your test measured?

I don't see why this is so hard. Just add a set of positive controls with different sensitivity, and enough negative controls to confirm that they don't get answered beyond chance.

Should HA promote a more rigorous listening test protocol?

Reply #37 – 2012-11-28 07:03:49

Got it.
The idea of positive and negative controls is actually good. It's similiar to MPEG's rules of post-screening:

Quote

Post-screening of listener responses should be applied as follows. If, for any test item in a given test, either of the following criterion are not satisfied:
• The listener score for the hidden reference is greater than or equal to 90 (i.e. HR >= 90)
• The listener scores the hidden reference, the 7.0 kHz lowpass anchor and the 3.5 kHz lowpass anchor are monotonically decreasing (i.e. HR >= LP70 >= LP35).
Then all listener responses in that test are removed from consideration.

Only one thing, probably an inclusion of a hidden reference (negative control) will reduce the amount of testing codecs.
But that's up to open discussion.

Also it's worth to notice that the rules for the last two public tests are hard to name as toothless despite it has only one low anchor and no hidden refence. If You look through the results the listeners who were guessing quite often have picked up a few time reference and by rules all their results were invalidated (probably the same effect as to have an additional hidden reference).
rules.txt is in folder "Sorted by listener"
http://listening-tests.hydrogenaudio.org/i...ous/results.zip

After all I think your technique is not that far from the one which we were applying. Two low anchors is actually great IMO.

P.S. Went to sleep

Should HA promote a more rigorous listening test protocol?

Reply #38 – 2012-11-28 07:15:36

Not really JJ's technique, but that which is commonly used in the industry. When working under contract, my blind tests included the reference and two low anchors.

Should HA promote a more rigorous listening test protocol?

Reply #39 – 2012-11-28 11:05:35

I agree that using controls is necessary in a proper listening test. I wouldn't argue against anything that JJ said.

However, "casual" testers (whatever they are!) must remember that using the wrong controls in a not-quite-proper listening test could be worse than nothing. e.g. Using a low anchor that's too low will provide a positive control, but can wreck all the other answers by making them "bunch up" towards the top of the scale.

e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Cheers,
David.

Should HA promote a more rigorous listening test protocol?

Reply #40 – 2012-11-28 14:54:25

Good point, David.

I guess a rough and ready pre-test with only one or two listeners and a few samples is likely to be sufficient to place the Positive Controls (anchors) within the same region as the tested codecs, rather than too far below (such as the 128kbps MP3 test where low anchor l3enc was very poor and all contenders tied for quality, though one or two showed more consistent scorings among the samples tested). Perhaps a retest with a better low anchor might un-tie on or two of them?

Probably a few low-pass filters and/or older codecs could be checked in the pre-test to ensure that any anchors are close to the range expected of the codecs under test.

Should HA promote a more rigorous listening test protocol?

Reply #41 – 2012-11-28 15:51:28

If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data. When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.

When the anchors are too close, low anchors may get ranked better than the contenders. High anchors may get ranked worse. This is not exactly unreasonable. What needs to be taken seriously is that judging is subjective; not everyone ranks different artifacts the same way. It could be that the low anchors actually do sound better or the high anchor actually does sound worse. It is also not unreasonable to get differing rankings between all stimuli based on the specific clips being auditioned. What may be unreasonable is to dismiss discrepancies like these from the "expected" results as "wrong".

With this in mind, I only take seriously the clear trends in very large tests (many participants and many worthwhile, typical real-life sample clips). I somewhat reject the idea that all participants must be trained when there are large numbers of them, however. While the testers should be able to distinguish and categorize them, they should not be steered into thinking one is less desirable than the other.

Lastly, all too often people treat the results of small tests posted here as definitive. They really aren't.

Should HA promote a more rigorous listening test protocol?

Reply #42 – 2012-11-28 16:41:36

Quote from: 2Bdecided on 2012-11-28 11:05:35

e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Exactly my thoughts. But organizations of standarization are interested to test it because those are widely common bandwithes: NB telephony (3.5kHz) and WB (7kHz).

We would probably need two low anchors like 5kHz and 8-10 kHz (?)

P.S. Probably it will be better if we will start to use the same lowpass anchors for all public tests.

Should HA promote a more rigorous listening test protocol?

Reply #43 – 2012-11-28 17:32:04

Quote from: greynol on 2012-11-28 15:51:28

If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.

Quote

Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data. When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.

Please don't use that test for near-transparent codecs. It's not appropriate. ABX or ABC/hr are appropriate. But you still need both negative and positive controls.

Quote

Not everyone ranks different artifacts the same way.

That is part of my problem with tests that compare many different codecs simultaneously along only one axis (scale). But it's only part of the problem. There are many others.

Should HA promote a more rigorous listening test protocol?

Reply #44 – 2012-11-28 17:35:44

Quote from: IgorC on 2012-11-28 16:41:36

Quote from: 2Bdecided on 2012-11-28 11:05:35
e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.

Exactly my thoughts. But organizations of standarization are interested to test it because those are widely common bandwithes: NB telephony (3.5kHz) and WB (7kHz).

We would probably need two low anchors like 5kHz and 8-10 kHz (?)

P.S. Probably it will be better if we will start to use the same lowpass anchors for all public tests.

Coded anchors with known codec pairs would be better. You want the impairments in the controls to be similar to the impairments you're testing.

Should HA promote a more rigorous listening test protocol?

Reply #45 – 2012-11-28 18:00:39

Quote from: Woodinville on 2012-11-28 17:32:04

Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.

True, however, if people actually adhere to the descriptions of the rankings, the locations of the low anchors shouldn't affect the scores of the other samples.

Should HA promote a more rigorous listening test protocol?

Reply #46 – 2012-11-28 18:15:12

Quote from: greynol on 2012-11-28 18:00:39

Quote from: Woodinville on 2012-11-28 17:32:04
Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.

True, however, if people actually adhere to the descriptions of the rankings, the locations of the low anchors shouldn't affect the scores of the other samples.

Except that's not how subjects work, and creating any kind of intellectual confusion during a test only makes it worse.

Should HA promote a more rigorous listening test protocol?

Reply #47 – 2012-11-28 19:08:15

Quote from: greynol on 2012-11-28 15:51:28

If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).

I was suggesting that before the main test (which still has a lot of testers and a lot of samples), appropriately close anchors could be chosen by a short test on only a few samples which rules out anchors that are vastly superior or vastly inferior to the codecs under test. I don't think Woodinville believes it is essential that the anchors must be outside the range of the codecs under test (i.e. consistently lower and higher) but could be fairly consistenty towards the low end and fairly consistently towards the high end, assuming we used two anchors.

I think Woodinville mentioned the trickiest thing to get right. If we presume that the nature of low-pass filter quality degradation is too different to the nature of typical codec flaws (warbling, sparklies, tonal problems, transient smear, pre-echo, stereo image problems etc.) then we'd be looking for anchors instead among other encoders and settings not under test, or from consistent distortions of a similar nature. For example, we might choose a prior generation codec, even at a slightly higher bitrate as a lowish anchor. Maybe Lame -V7, for example, or l3enc at 160 kbps -hq or 192 kbps, or toolame at 128 kbps, perhaps, or FhG fastenc MP3 at a setting with Intensity Stereo rather than safe joint stereo. Perhaps a high anchor could be a previous test-winner at a slightly higher bitrate where some flaws are still evident (so that it still acts as a Positive Control - i.e. distinguishable from the original audio). There are certain encoders so badly flawed that some testers will immediately identify them, so I suppose Xing (old) with no short blocks or BLADEenc would not be good choices.

It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test.

Another potential use of the anchors would be to calibrate and normalize the quality scales used by different listeners, though the validity of this is questionable as some people find pre-echo more annoying than tonal problems, or find stereo collapse less objectionable than high-frequency sparklies for example, while others have the reverse preferences. The preferences here are part of the reason that results can be intransitive.

Once or twice, anchors have also been used to address a common claim or myth (e.g. that WMA 64 kbps is as good as MP3 and 128 kbps). For guruboolez, some 80-96 kbps tests used lame at about 128 kbps as an anchor to assess where the truth lay at the time to his ears, for example.

I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.

These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.

We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.

Should HA promote a more rigorous listening test protocol?

Reply #48 – 2012-11-29 00:04:16

Quote from: Dynamic on 2012-11-28 19:08:15

My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).

Well, during the last public test we have picked up the low quality low anchor but still got very wide range for final rating. So it hadn't any noticeable impact as one could expect.

Quote from: Dynamic on 2012-11-28 19:08:15

It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test.

...

I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.

These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.

We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.

Completely agree.

As I was one of the main organizators of two last HA listening tests I will try to bring (from time to time) an information here to get the idea where we are now and what are possiblities and limitations.
First of all we've barely got enough results for the last test at 96 kbps. Link
The main source of listeners whose results were accepted as valid (have met the rules) was HA iteslf and some members from Doom9 forum.
The rest of internet community has failed grossly. Multiple ranking of a reference and basically nescience of what ABX is, despite all information and guidelines.
Shortly, it was HA itself and people who are closely familiar or aware of it.

The best scenario is to intoduce an improvements (additional controls with updated guideline ) while keeping the listeners and don't do their life even more difficult. If not, we will stay without listeners and results.

Two low anchors doesn't cost anything. It's just one additional low anchor and a participant with a normal hearing won't get into troubles.
I have fears of a hidden reference. Now not only a listener should rank all codecs but also keep in mind that one of them is a hidden reference given that codecs are getting pretty close to transparent at 96 kbps.

Notice