ETA: Graynol, this is why I hesitate to say anything here. Just like in audiophile forums, it seems that anything you say can and will be used against you, even if you didn't say it. In case you weren't aware, I'm tired of audio, tired of audio enthusiasts of all sorts, and multiply-tired of the people who like to grind axes.
I think I understand now. We're talking about Control as in Control Condition in a Controlled Experiment, where the Control is used to compare against the Test Condition.Negative Control in this case does not refer to negative or positive numbers, but to a Null Condition where no difference should be expected.This means that the Negative Control is there to catch False Positives (where listeners falsely detect non-transparency)We are comparing the original sample (or possibly the high anchor) with itself, so should expect no difference. This eliminates testers who claim to discern a difference when they cannot, but might believe they can because of expectation bias or something similar and also those who might be tempted to score somewhat at random.
I'm doing no such thing. ABC/hr is doing individual rankings, not confusing things like the ones with 4 anchors, 10 probe conditions, and that asks you to rank the lot of them on one scale? Not ABC/hr or BS1116, although I do have some questions about some of the evaluations following some 1116 tests.
You just resume the big concept that requires a lot of description just in one word ^transitive^. Here are a lot of people of different disciplines and not everybody will understand this term. I even connected to #hydrogenaudio to ask if it's only me who don't get in all clear what people (including You) are talking about . Well, there were some other guys who didn't get your idea.
We usually do plot the low anchor in HA public listening tests, but not the reference, though one or two tests did use a high anchor that was not the original audio and plotted it. Where ranked references result in exclusion from the results, the screened results will obviously place the Negative Control (for False Positives) at the screening level (typically 5.0), making a plot of these values trivial.
Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.
If anyone needs: We have that if a>b and b>c, then a>c. That is transitivity for the > relation.The = relation is also transitive: if a=b and b=c then a=c.The “approximately equal to” relation is not transitive. Or, put it “not far from”, to make it a bit more obvious: if a is not far from b, and b is not far from c, then that does not rule out that a and c are far from each other. You would expect this with any relation which is “not far from” in the appropriate sense: “statistically tied to”, as we can have a tied to b and b tied to c, yet not necessarily a tied to c. And here's one more: just because you cannot ABX a from b, and you cannot ABX b from c, it might still be that you can actually ABX a from c.Also, you might encounter another issue: If you compare a and b and b is subjectively better, and you compare b and c and c is subjectively better, then you should have that c is better than a, right? Not always so in real-world experiments. That's one thing you might want to test for.
Let's suppose two separate tests and 3 codecs:1º testA - 4.0 (perceptible but not annyoing)B - 3.0 (slightly annoying)2ºC- 3.5 (very slightly annoying or a bit annoying (?))B - 3.0 (slightly annoying)For one particular listener:Given he/she applies the same scale (1.0-5.0 - very annoying to impreceptible) to both tests it's not in all invalid to think that A>C for him/her. A listener with certain experience already has his own criteria which he appplies for all samples. "OK, if it's not that bad I put 4.0. If a sample has this sort of artifacts I gave it 3.0, but my ears are more tolerable for another type of artifacts (3.5)" etc...P.S. Now if there are more than one listener.http://www.acourate.com/Download/BiasesInM...teningTests.pdf
Do You really beleive that some extra control will substantially change the results?
Post-screening of listener responses should be applied as follows. If, for any test item in a given test, either of the following criterion are not satisfied:• The listener score for the hidden reference is greater than or equal to 90 (i.e. HR >= 90)• The listener scores the hidden reference, the 7.0 kHz lowpass anchor and the 3.5 kHz lowpass anchor are monotonically decreasing (i.e. HR >= LP70 >= LP35).Then all listener responses in that test are removed from consideration.
e.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.
If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.
Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data. When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.
Not everyone ranks different artifacts the same way.
Quote from: 2Bdecided on 28 November, 2012, 06:05:35 AMe.g. a 3.5kHz LPF anchor in a test of substantially transparent audio codecs would be idiotic - IMO.Exactly my thoughts. But organizations of standarization are interested to test it because those are widely common bandwithes: NB telephony (3.5kHz) and WB (7kHz). We would probably need two low anchors like 5kHz and 8-10 kHz (?)P.S. Probably it will be better if we will start to use the same lowpass anchors for all public tests.
Actually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.
Quote from: Woodinville on 28 November, 2012, 12:32:04 PMActually, having too low an anchor can make things tie by changing the listeners' scaling of the test results.True, however, if people actually adhere to the descriptions of the rankings, the locations of the low anchors shouldn't affect the scores of the other samples.
My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).
It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test....I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.