Why does a single sample generate only one test trial?

Topic: Why does a single sample generate only one test trial? (Read 4159 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Why does a single sample generate only one test trial?

2008-12-02 21:44:24

So I participated briefly in Sebastian's test (thanks!) - I picked castanets (by accident). Once I found like three different artifacts in a single encode, a really curious thing happened: the test was no longer blind, and my ratings were 99% subjective. Because I ABX'd everything, ABC/HR trusted everything I did, and I wound up trying to average ratings across all the artifacts to yield a final rating for that encode. There was real listening behind the numbers, but I'm not terribly happy with the results. And I probably missed listening to a few artifact/encoder combinations in some places, etc.

What my experience suggests to me is that, for encoder configurations which have "pointlike" artifacts - they do not continuously distort, but rather have a countable set of audible distortions - the tests could be reoriented to test those specific positions in the files, rather than the entire files. This happens for all modern codecs above 96kbps. In practice, this means telling listeners exactly where the artifacts are supposed to be, and running ABX tests and ratings on an artifact-by-artifact basis rather than on a sample-by-sample basis.

I see the following advantages to this scheme:

Far larger number of trials available from the same number of samples for improved statistical power
Lower total listening time for the same power of results
Gentler learning curve for newbie listeners - "the artifact's right here, all you have to do is ABX it and rate it"
Because the listener is no longer required to subjectively aggregate different artifact ratings into a single sample rating, statistical power should further improve.

My blinders are on right now as to exactly what the disadvantages to this scheme might be. Comments?

Why does a single sample generate only one test trial?

Reply #1 – 2008-12-02 23:48:19

I've thought about this sometimes, since it would be a perfect companion for blind tests.

In the broadest sense, the "no-difference" can't be ABX'd.
If the tools could give a guidance on where differences exist, the work gets reduced to identify it and rate it already. No need to listen to the whole sample again and again looking for a bad part.

What are the problems?

If this is a manual process, it requires a person or people doing it, and as such, making the ABX test before the real test. The manual process implies that the artifacts spotted are those important to the ones finding them and as such each type of artifact may not necessarily get the same attention.

If this is an automatic process, the algorithm has to be tweaked as to not act as a bit comparator (useless for lossy codecs), neither just as an "human ear" (what was the name of that program that ranks codecs, again?)

I am confident that an algorithm could be made specific for each type of identified artifact that we have, but it would probably be more difficult to make one that just decides what sounds right and what sounds wrong.

Why does a single sample generate only one test trial?

Reply #2 – 2008-12-03 02:12:42

Quote

' date='Dec 3 2008, 00:48' post='602569']
If this is an automatic process, the algorithm has to be tweaked as to not act as a bit comparator (useless for lossy codecs), neither just as an "human ear" (what was the name of that program that ranks codecs, again?)

There's EAQUAL and also PEAQ, I believe, which can be useful for developers while tuning or debugging codecs, and I guess it may point to the more probable artifact positions for the guidance of inexperienced human ABC/HR users? Dunno, I've never used it, but I guess it's an objective measure that may be worth something when true transparency has been abandoned in looking for lower typical bitrates.

This approach may skew the ratings to point out artifacts that would otherwise go unnoticed, and would clearly have been unannoying, so that codecs tend to receive lower ratings.

I guess a third approach is to incorporate the hints into ABC/HR so that in the event of failing the initial ABX, the user could then ask for a hint and try again, which could be recorded in the log, or used to force a minimum score of 4.0 (perceptible but not annoying). For each test sample or each of many artifacts expected within any sample (see OP), it might be possible to incorporate cue points to set start and end and isolate the section around the suspected artifacts.

If it makes it easier and quicker for newbies to take part in listening tests, it could improve the statistical validity of them through reduction of random error (though it might impart a systematic, non-random error by lowering people's ratings). It could also be use in an example listening test used to train oneself in conducting trials within a public listening test, again to make it easier and provide encouragement among learners.

Why does a single sample generate only one test trial?

Reply #3 – 2008-12-03 03:35:50

Personally, I don't like making the tests easier by giving hints on where to look, or shortening the samples to just a couple seconds of killer artifact. I think if the test is hard, then so be it. To make it easier, we should be testing a lower bitrate. I think the current methodology is a good cross representation (as ff123 said a long time ago, I think on his site), and it also is helped that to even do a test you must be at least interested enough to really try to hear something.

But if hints were implemented, maybe instead of assinging a minimum of 4.0, it would be 4.9 and the description is "barely perceptible and needed hint".

the ff123 site has a nice section to train you on how to hear artifacts and some sample, IIRC.

Notice