Skip to main content

Topic: Intepretation of results of blind test (Read 4922 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • IgorC
  • [*][*][*][*][*]
Intepretation of results of blind test
I'm curious about how it's correct to interpret the results of blind tests.

For example average scores of two competitors are  A=4.5 and  B=4.7
First of all statistcly speaking they are tight. But let's suppose they were enough samples as to discard this statistic interval.
I know it's not about mathematics but more about practical approach wich filosofy is way different but there are my thoughts:

Mathematical interpetation
On one hand A - 4.5/5 = 90% of "quality" and B - 4.7/5= 94%.
1st type of interpretation. 94/90 = 1.0444... 
Codec B is better than A by 4.444...%

2st type of interpretation. The codec A has 90& of quality (full transparency) then it has 10% of perceptible difference (artifacts) while codec B has 94% of  quality and 6% of perceptible difference.
Then  Artifacts(A)/Artifiacts(B) 10%/6% = 1.666... .
Codecs B is better than A by 1.666..x times.

In my opinion 2st type of interpretation is more close to real reflection of listeting experience as people remeber mostly artifacts and don't notice a good job of codecs.

Psychological interpretation
When can I say quality is transparent? 5.0? Or less? 4.9,4.8...?
I can speak only about me in this case. I will put a X mark  for one particular sample
1. 5.0 - undistinguish from original
2. 4.9 - I doubt if I ever will hear the difference again or if I actually hear it at all.
3. 4.8 - Well I hear the difference, but it was too hard and requiered a lot of concetration
4. 4.7 - The sample isn't transparent at all but the quality is high. And it's point for me to say "not transparent"

So this way codec is transparent (or extremely close to transparent) for me if the average score is at least 4.8 (Psychological interpretation)

While for mathematical (and/or statistioc)  intepretation 95% is a good approximation. So 5 * 95% = 4.75 is minimal score for imaginary transparency.

Any comments and thoughts about  your personal experience are welcomed.

If someone has already discussed it then please give the link.

  • MichaelW
  • [*][*][*][*][*]
Intepretation of results of blind test
Reply #1
I've got no practical experience with ABC/HR, nor am I competent in statistics, but I spent a working life making subjective evaluations (and comparing results and worrying about their validity).

From that, I'd wonder if we can really read this scale in a simple linear fashion. I would be surprised if people could, reliably and accurately, use more than a 10-point scale on a task like this. Perhaps even the 5 integer points are the only ones that really count, for any individual making a judgment.

Further, are our perceptions of "quality" based on a simple arithmetic scale of number of artifacts, or whatever technical measure of goodness of compression might be appropriate? Most of life seems to be logarithmic.

I, therefore, tend to read a score of 4.5 as meaning "Half the time, the testers couldn't tell this from the reference," and 4.1 as meaning "Didn't annoy people, and transparent for a few."  This, too, isn't quite right, as it ignores variability of scores, but it seems a bit closer to what the tests mean.

I'm really grateful to the testers, and especially Sebastian who has organised a lot of tests. I'm conscious that the results are being read in two ways. Some people take them as a guide to usage (as, for instance, in the latest case, we can say that a number of modern MP3 encoders are very good indeed at 128 kbps; if you have specific, critical needs, you need to do your own tests). Others are interested in absolute rankings of encoders, a kind of MP3 Olympics. I doubt if ABC/HR scores can really support such an order of merit.

Once more, thanks to everybody who does this stuff, and big ups to Sebastian.

  • ExUser
  • [*][*][*][*][*]
  • Read-only
Intepretation of results of blind test
Reply #2
Another possibility for interpretation of the values is to consider them as a total order. I'm not enough of a math nerd to know whether that would have any significant statistical repercussions, but perhaps it could.
  • Last Edit: 28 November, 2008, 03:22:51 AM by Canar

  • MichaelW
  • [*][*][*][*][*]
Intepretation of results of blind test
Reply #3
Now I'm retired, I'm trying to teach myself high-school maths, so what do I know?

But, I suspect that the scores in tests like this might not be in a transitive relationship, if that's the right way of putting it.

Because a > b, and b > c, it doesn't necessarily mean a > c (where > is to be read as "is preferred to").

Better stop now, before I get totally out of my depth (before??).

  • ExUser
  • [*][*][*][*][*]
  • Read-only
Intepretation of results of blind test
Reply #4
I see where you're coming from. I'm just hypothesizing that if we can assert that the codec ratings are in a total order, we can manipulate them mathematically with more validity.

I don't even know if there is any validity to this at all.
  • Last Edit: 28 November, 2008, 03:24:28 AM by Canar

  • muaddib
  • [*][*][*][*]
  • Developer
Intepretation of results of blind test
Reply #5
Are you searching for an interpretation of a private or a public listening test?

First of all statistcly speaking they are tight. But let's suppose they were enough samples as to discard this statistic interval.

You can not discard intervals. It just might happen that having enough listeners (or repetitions from 1 person in different days), intervals get so small that they don't overlap anymore.

1. 5.0 - undistinguish from original
2. 4.9 - I doubt if I ever will hear the difference again or if I actually hear it at all.
3. 4.8 - Well I hear the difference, but it was too hard and requiered a lot of concetration
4. 4.7 - The sample isn't transparent at all but the quality is high. And it's point for me to say "not transparent"

It would be good to include this in
http://www.hydrogenaudio.org/forums/index....c=67547&hl=
There is also nice recommendation in this thread for ABC/HR grades.

  • Alexxander
  • [*][*][*][*]
Intepretation of results of blind test
Reply #6
What you try to do IgorC, is getting to solid conclusions based on cold numbers. This is only possible if all the variables are under control and if everybody rates the same way using the same rating system.

A blind test like ABC/HR only tells how the participants rated the samples and almost nothing about whether Codec A is more transparent or has more artifacts than Codec B has. Each individual has his own hearing and way of rating (and both vary in time). The end conclusion depends completely on the selection of participants (I discard controllable parameters like listening environment and used tools).

So, even ignoring error margins (which are always there!), I cannot agree with your suggestions of mathematical interpretations as the cold ABX results mean very little. Personally I would avoid calculating mathematical relationships between results.
  • Last Edit: 28 November, 2008, 06:29:27 AM by Alexxander

  • muaddib
  • [*][*][*][*]
  • Developer
Intepretation of results of blind test
Reply #7
The end conclusion depends completely on the selection of participants (I discard controllable parameters like listening environment and used tools).

Depends also on a time when each participant gave its grade, because grade for the same sample from the same participant may vary a LOT between two trials. Even order of encoder ratings can differ. There is a proof for this in results of public listening tests conducted so far where low (or was it high anchor) were the same in different tests. Sorry, I don't have enough time to search for this example.