Title: listening tests help
Post by: legg on 2005-06-18 19:51:55
How can I rate how good or bad will the audio codec be on average?

I'm thinking on doing at least 21 ABX trials on each file and if the subject managed to tell a difference and the guess probability is low I'd also take into account a 1-5 point scale to measure the perceived quality.

So far I have analized one subject, he did pretty well on most tests, except one, where the results where these:
15 out of 29, pval = 0.500

AFAIK, p<0.05 is good to be certain that he didn't know by chance, what about the rest?
How do I interpret that?, is the codec transparent for him at that test?
What are the ranges of the guess probabilities and how should I interpret them?

Post by: guruboolez on 2005-06-18 20:18:24
You should read Pio2001's explanation: it's very clear and complete.
http://www.hydrogenaudio.org/forums/index....howtopic=16295& (http://www.hydrogenaudio.org/forums/index.php?showtopic=16295&)
Post by: Digga on 2005-06-18 20:28:38
if you say he did pretty well except this one, the means that he could tell the difference right? for the codec it would be more complimenting (i.e. better) if he couldn't tell the difference which means it's transparent for him and he's guessing if x is a or b.

anyway, either alpha=0,01 or alpha=0,05 are generally chosen.
btw, you are not certain, but this is just (given, very low) probability that one is guessing, i.e. considered not guessing.
a pval of 0.5 would be a reference for guessing.

for further more in depth info, look here. (http://www.hydrogenaudio.org/forums/index.php?showtopic=16295)

Post by: legg on 2005-06-19 16:28:07
for further more in depth info, look here. (http://www.hydrogenaudio.org/forums/index.php?showtopic=16295)

Yes, I have read that thread, but it doesn't mention how to deal with the results when p>0.05, which is this case. An interval like 0.05<p<0.25 how is to be dealt with? It certainly doesn't say much of the codec, and I would hardly classify it into transparency. IMO p>0.5 means transparency, but I wanted to check with you gurus about this.

I thought that anyone could rate the quality, but what if the subject merely rated the correct file by chance, I need more certainty, and I'm using ABX as an indicator of the trustworthy of the subject ratings. Is this correct?

Btw, I'm more interested in rating the subjective quality of the codec.

Thanks again.
Post by: ff123 on 2005-06-19 17:34:18
If it were me, I'd just choose a whole bunch of different samples (say 30), and then rate each one against the reference using abc-hr.  Then I'd plug the results into a statistical calculator (http://ff123.net/friedman/stats.html) to determine first if you found a significant difference from the reference and second how much that difference is.  No ABX'ing is involved, plus you get a better indicator of codec quality by sampling a lot of different music.

BTW, I would also keep the samples where you rate the reference, rather than throw these cases out.  So if you make some mistakes, the reference will average something less than 5.0.

ff123
Post by: guruboolez on 2005-06-19 18:55:39
If it were me, I'd just choose a whole bunch of different samples (say 30), and then rate each one against the reference using abc-hr.
I'm going off-topic, but I've an important question.
I'm trying to build a complete set of classical music sample, in order to replace the usual suit of 15 samples I'm using now for 18 months. My purpose is to obtain 100 samples, including many instruments, solo, chamber, orchestral, lyrical, noisy or not noisy, quiet and loud, etc... But I'm realizing that making ABX comparisons with so many samples would be a Herculean task.
What would be the best thing in your opinion:
- 100 samples rated in ABC/HR without ABX
- 15...20 samples rated in ABC/HR + ABX confirmation?
Title: listening tests help
Post by: ff123 on 2005-06-19 19:27:51
For an experienced listener like you, I'd definitely dump the ABX and just go with the ratings.
Post by: guruboolez on 2005-06-19 19:39:09
Good point
I have to be more specific. Are such tests valid (I mean: statistically) or, more precisely, have both kind of tests the same level of validity?

I'm asking because I'm used to publish the results of my test, and always try to avoid criticism. I just fear that a big listening test including 50 or 100 samples without ABX confrontation will be contested. Should I keep this kind of test for private and favour ABX for public one, or would you consider more interesting the publication of ABC/HR only listening test involving much more samples?
Post by: ff123 on 2005-06-19 19:45:32
Individual results will have increased uncertainty, but overall the results will be more representative of a codec's true worth, even if you make mistakes on a few samples.

And yes, you can still say something statistically about the codecs.

ff123
Post by: guruboolez on 2005-06-19 20:06:54
That's encouraging. A listening test involving a lot of samples (and therefore introducing much greater diversity) without the necessity of listening each one ~50 times is greatly less boring for the listener. Less stressing too (I'm often frighten when I have to click on the "view results" button).

Thank you for your precious assistance about statistic and listening tests methodology

Luis G> Could I possibly ask you what kind of codec are you developing?
Post by: legg on 2005-06-20 01:52:05
ff123, currently I'm using 8 samples: castanets, finger snaps, french horns, timpani, triangle, trumpets 1, trumpets 2, violins 1 and violins 2. I'm also planning to put male and female voice samples, perhaps a rock/pop/jazz test would also be included. How many subjects should I use to get a significant result?

guruboolez, is a transform codec (MDCT) with 25 bands corresponding to each Bark. The data rate is variable and is always adjusted according to signal demands to achieve good quality. Expected data rates range from 60 to 340kbps.

Greetings.
Post by: ff123 on 2005-06-20 02:22:26
I would double the number of samples from 8 to 16.

ff123
Title: listening tests help
Quote
I would double the number of samples from 8 to 16.

ff123
And how many people should take the test?
I was thinking somewhere between 20 and 50.
Post by: ff123 on 2005-06-20 05:53:38
Heheheh.  Yes, if you can.  Good luck.

ff123