New Public Multiformat Listening Test (Jan 2014)

Topic: New Public Multiformat Listening Test (Jan 2014) (Read 151995 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #125 – 2013-12-11 17:57:17

Quote from: Garf on 2013-12-11 17:27:03

Quote from: Kamedo2 on 2013-12-11 16:30:25
It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'.

I'm not sure what you are talking about here, but I think you completely misunderstood what I pointed out. If you squash all results per sample *before doing the analysis*, you have *20* results, not *280* as your graph shows. This is exactly the same input as if one person had taken the test. All the information about variability that you get from multiple listeners is forever gone. You might get lucky in that there is now less variability than with an actual test with one person, but how can you even tell?

Yes, it's *20* results, but the average result is far more accurate than the result of one person, which comes from the fact it was tested many times. Humans are whimsical, but less so if the test was conducted multiple times. Even less whimsical if the another test was conducted by another person.

In case really one person had taken the test, the accuracy is gone, the result is dirty.

After squashing all (average:14) results per sample *before doing the analysis*, indeed, the accuracy is improved by the squashing.

New Public Multiformat Listening Test (Jan 2014)

Reply #126 – 2013-12-11 18:22:04

Alright, nice. So the variability does already drop a load due to that. What's the analysis you used for analyzing samples and listeners separately, i.e. the original graph you posted? multi-way ANOVA? I'd be curious to see the (corrected for multiple comparisons) p-values then. I agree they're overstated in the original results. I have my reservations about ANOVA as well, due to the clipping at 5.0, but doing a bootstrap with dependent samples is out of my league so I think it's the best we can do for now.

New Public Multiformat Listening Test (Jan 2014)

Reply #127 – 2013-12-11 18:25:14

If somebody is interested here is also an IRC channel irc://irc.freenode.net/hydrogenaudio

P.S. I will update the list with codecs later.

New Public Multiformat Listening Test (Jan 2014)

Reply #128 – 2013-12-11 18:38:02

It's simply a bootstraped confidence interval estimation of the averaged, squashed data below.

Code: [Select]

Nero CVBR TVBR FhG CT low_anchor
3.64 4.22 4.69 4.23 3.71 1.60
4.05 4.47 4.13 4.52 3.46 1.41
3.30 3.51 3.24 3.34 3.20 1.60
3.57 4.52 4.55 4.73 4.41 2.42
4.04 4.53 4.54 3.97 4.43 1.33
4.19 4.58 4.59 4.62 4.65 1.52
3.65 4.10 4.32 4.53 3.85 1.47
3.83 4.62 4.41 4.49 4.18 1.67
3.62 4.27 4.26 4.72 3.91 1.60
3.66 4.30 4.34 4.24 4.26 1.72
3.82 4.28 4.21 3.96 4.13 1.58
3.48 4.67 4.37 4.35 3.81 1.48
4.13 4.54 4.64 4.08 4.24 1.50
3.42 4.32 4.40 4.29 4.10 1.34
3.60 4.54 4.72 4.18 3.69 1.51
3.92 4.70 4.52 3.98 4.26 1.44
3.85 4.41 4.55 4.49 4.57 1.32
3.67 4.79 4.37 5.00 4.83 1.42
3.08 4.26 3.78 4.11 3.96 1.25
3.34 4.72 4.65 3.43 3.88 1.27

New Public Multiformat Listening Test (Jan 2014)

Reply #129 – 2013-12-11 18:58:09

Interesting.

I'd like to restate the (EDIT: sarcastic) comment I made earlier:

Quote from: greynol on 2013-12-10 01:26:54

the graphical representation of the overall results of [the] test doesn't do justice to the test data

Not to be a pain, but I must question once again whether Apple actually "won", since some appear to be basing it on a p figure from an analysis that seems to have been drawn into question.

EDIT: With the analysis by Kamedo2, I don't feel terribly inclined to believe a p figure over the graphs where all the error bars indicate a statistical tie.

New Public Multiformat Listening Test (Jan 2014)

Reply #130 – 2013-12-11 20:06:25

Quote from: Garf on 2013-12-11 18:22:04

Alright, nice. So the variability does already drop a load due to that. What's the analysis you used for analyzing samples and listeners separately, i.e. the original graph you posted? multi-way ANOVA? I'd be curious to see the (corrected for multiple comparisons) p-values then. I agree they're overstated in the original results. I have my reservations about ANOVA as well, due to the clipping at 5.0, but doing a bootstrap with dependent samples is out of my league so I think it's the best we can do for now.

I tried the blocked bootstrapping confidence interval estimation, using the 280 raw results.

It's almost the same as the squashed version. You've said that "All the information about variability that you get from multiple listeners is forever gone", but I can say that data is not lost by the squashing.
As for p-value, the program would be way harder than the CI estimation, but shouldn't be very different from the ANOVA of the squashed version.

Code: [Select]

FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 20
Critical significance:  0.05
Significance of data: 3.91E-013 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total               99          18.63
Testers (blocks)    19           6.48
Codecs eval'd        4           6.87    1.72   24.74  3.91E-013
Error               76           5.28    0.07
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.166

Means:

CVBR     TVBR     FhG      CT       Nero
  4.42     4.36     4.26     4.08     3.69

---------------------------- p-value Matrix ---------------------------

         TVBR     FhG      CT       Nero
CVBR     0.523    0.068    0.000*   0.000*
TVBR              0.229    0.001*   0.000*
FhG                        0.028*   0.000*
CT                                  0.000*
-----------------------------------------------------------------------

CVBR is better than CT, Nero
TVBR is better than CT, Nero
FhG is better than CT, Nero
CT is better than Nero

New Public Multiformat Listening Test (Jan 2014)

Reply #131 – 2013-12-11 20:45:04

Quote from: IgorC on 2013-12-11 17:20:57

...Great. Please, inform yourself how the samples were picked for the last HA public test and then propose how You can improve that. ...

I guess you think I'm criticizing that test. I really don't.
What I was talking about is intrinsic limitations in generalizing the test results (of any listening test) to the universe of music and listeners, especially if the test's outcome is measured by overall statistics - same thing what Greynol said. But I don't want to continue this discussion as IMO everything was said about it.

New Public Multiformat Listening Test (Jan 2014)

Reply #132 – 2013-12-11 21:25:00

Quote from: halb27 on 2013-12-11 20:45:04

I guess you think I'm criticizing that test. I really don't.

No, I don't think You're criticizing.

Please, understand me correctly. All I'm asking is stop "shooting to the air" and start to elaborate some possible solutions, work on some particular parts as now Kamedo2 now provides real numbers. He makes a real deal.

"Look I have made some researchment and have found that we should include those and these samples because of that and this. We should include x number of samples with p, q and r charactersitics. Acording that paper... " You know, make a real call.

So, I ask You, have You figured out how a sample selection was done during the last? It would be a good start to begin with.

New Public Multiformat Listening Test (Jan 2014)

Reply #133 – 2013-12-12 00:06:49

Quote from: Kamedo2 on 2013-12-11 12:50:20

Quote from: Serge Smirnoff on 2013-12-11 02:26:43
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.

New Public Multiformat Listening Test (Jan 2014)

Reply #134 – 2013-12-12 01:07:22

Quote from: Garf on 2013-12-11 10:08:32

Quote
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency.

Is this demonstrable or is it your suspicion? I would worry that non-hidden reference adds loads of noise to the result, and makes it harder to draw conclusions, because of people ranking fake differences. Of course this is less of a factor if you have very many listeners.

Quote from: Kamedo2 on 2013-12-11 11:43:13

What happens if 50% of people distinguished and preferred the non-reference? It happens; http://slashdot.org/story/09/03/11/153205/...s-of-mp3-format

Exactly this happened in SE @96 test, tables with submitted grades show increased number of 6-grades (confused reference) which are discarded. IMO this should not affect final scores, just prolongs testing period.

New Public Multiformat Listening Test (Jan 2014)

Reply #135 – 2013-12-12 01:42:51

Quote from: Serge Smirnoff on 2013-12-12 00:06:49

Quote from: Kamedo2 on 2013-12-11 12:50:20
Quote from: Serge Smirnoff on 2013-12-11 02:26:43
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.

(1) true
(2) We won't know the variance of means before the test. Instead, imagine how much accuracy we need. 3.0=Slightly Annoying 4.0=Perceptible but not annoying 5.0=Imperceptible, so I feel it's accurate enough when we determine the average score by only 0.1 of error margin. (Can we imagine the difference between 3.3 and 3.4?)
(3) You mean the post-test evaluation?
(4) Rather, we want the SEM to be small enough to fill the requirement.
The rough estimation is done this way. First, score is between 1.0 and 5.0. So the Standard Deviation(SD) can't be more than 2.0. SD being 2.0 is highly unlikely because the score would be either 1.0 or 5.0, both 50% of the time and in that case, 1.0=Very Annoying so that the developers would get tons of bug reports. Let's say SD = 1.0. Standard Error of the Mean (SEM) = SD/sqrt(sample size). If we get independent 100 results, SEM=1.0/sqrt(100) = 0.1, which is small enough.

New Public Multiformat Listening Test (Jan 2014)

Reply #136 – 2013-12-12 03:59:35

Quote from: halb27 on 2013-12-11 16:19:59

If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

There were some quality verification tests of MPEG formats (MP3, AAC) where a different kind of signals were generally joined into 3 big groups. Transients (1), tonal (2), stereo (3). Those are the most important groups.

Here is the example of a representative set of samples. All three groups have similar number of samples.

Here is the set of samples from public test (2011) and its clasification

It's also an option to enrich a set with some aditional samples like apllause, mixed material like different combinations (on top and/or sequences) of speech/singing + music.

New Public Multiformat Listening Test (Jan 2014)

Reply #137 – 2013-12-12 11:12:44

Igor, I have understood that very well, and I have no doubt sample selection of that test was done with care according to the very reasonable principles you show up. So no problem at all with test preparation and conducting the test.

The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here. There's more statistical stuff like the clipping of the values at 5.0 which was mentioned here but which is ignored for the sake of getting simple test results. But to me it's all over-simplification. And there's much more. For instance the judgements of the listeners are certainly not invariant in space and time, especially when the deviation from the original is perceptible, but close to nothing, that is for judgements clearly better than 4.0. I can definitely say that for me. I'm certainly not the perfect listener, but as this applies to me, I'm pretty sure that there are listeners out there to whom this applies as well - maybe to a much minor degree.

And there's nothing really bad about it: listening tests give important information about strengths and weaknesses of encoders. Quality just can't be put simply into a simple overall result of just one number, and things like confidence intervals are more than questionable here.

And to bring things back on topic: FhG AAC is a good candidate for your test, as is Apple AAC.

New Public Multiformat Listening Test (Jan 2014)

Reply #138 – 2013-12-12 12:05:11

Quote from: halb27 on 2013-12-12 11:12:44

The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here.

With your argument, no medicine is possible, as well as public healthcare, investigation of industrial pollution, nor even public transportation. Black swan may exist, but must be less than 1/sample_number to remain undetected, and it must be extremely unpleasant to affect the overall user experience, since it's rare. If it's extremely unpleasant, why the developers haven't got any report like that?

New Public Multiformat Listening Test (Jan 2014)

Reply #139 – 2013-12-12 13:21:52

I bootstrapped the last 2011 public listening test of AAC encoders @ 96kbps (280 donated results, 20 samples) to plan this upcoming test.
The past data may not be precisely applicable to an another future test, but you may get a 'sense' of 'How much effort do we need to bring the error margin down?' or 'Which plan is likely to make better use of the precious donated time?'. Enjoy!

New Public Multiformat Listening Test (Jan 2014)

Reply #140 – 2013-12-12 13:33:56

Taking the last test as a baseline, even if ones assumes that the total number of votes does not increase when we increase the number of samples, there is still a benefit (smaller error) in doing so. E.g. doubling the number of samples and halving votes/sample still yields a smaller error. The case where votes/sample is also taken constant is even better. As long as we're on the steep part of that curve there is no harm in increasing the number of samples.

But of course minimizing the error is only one part of the whole picture. As stated earlier, how to select the samples is a major point of debate.

New Public Multiformat Listening Test (Jan 2014)

Reply #141 – 2013-12-12 15:10:41

Thank You for your effort, Kamedo2. We should need an extra time to analyse statistics. It's all on todo list.

As for now the list of candidates was updated
Most of members are interested in testing VBR mode then a main goal of test will be comparison in this mode. In other words, the question is "how certain codecs perform (quality wise) in VBR mode at ~ 96 or 80 kbps ".

Until now the list of votes:
1.Apple AAC - 17
2. Opus - 17
3. Vorbis - 8
4. MP3@ 128 - 8

Possible:
Fraunhofer AAC - 7
MP3@96 - 7

Probably won't be tested:
MPC - 2
WMA Pro - 1
WMA Standard - 0

Bitrate (kbps) :
96 - 13
80 - 8
48 - 1

December 18 is a limit date to submit codecs. Then we will move to bitrate verification, sample selection and especially stuff that Kamedo2 and halb27 have rised lately.

New Public Multiformat Listening Test (Jan 2014)

Reply #142 – 2013-12-12 15:38:10

I bootstrapped the last 2011 public listening test of Multiformat encoders @ 64kbps (531 donated results, 30 samples) as well.
The raw data is from here: http://www.hydrogenaudio.org/forums/index....showtopic=88033
Like I said, this data may not be precisely applicable to this new test, but maybe you can get the 'sense'.

Thank you IgorC for updating and maintaining the table.

For people who voted for 80kbps, I gently ask you to rethink.
80kbps is somewhere too low for an AAC-LC (but too high for a HE-AAC), and like the past 64kbps test and this test, Opus is likely to win.
http://listening-tests.hydrogenaudio.org/igorc/results.html
http://www.hydrogenaudio.org/forums/index....showtopic=97913

New Public Multiformat Listening Test (Jan 2014)

Reply #143 – 2013-12-12 16:54:38

Kamedo2, i don't know if you've noticed, given your skills and experience what you do is co-organizing. Great.

New Public Multiformat Listening Test (Jan 2014)

Reply #144 – 2013-12-12 17:04:13

IgorC, after the reply of Chris my vote no longer goes to Fraunhofer/fhgaacenc but only to Fraunhofer/fdkaac. Why fdk is not in the list? Did I miss anything?

Updated:
AAC/Apple 96 VBR
Opus 1.1 96 VBR
Fraunhofer/fdkaac 96 VBR

Thanks.

New Public Multiformat Listening Test (Jan 2014)

Reply #145 – 2013-12-12 18:18:31

Eahm,

Please, be patient. I'm updating it from time to time.
It would be hard to include every codec at once, it would be a mess.
You mention it, it goes there.

Don't want to influence on your choice, but it'worth to mention that AFAIRC there was a comment stating that Winamp flavor of FhG has the most optimal quality comparing to other. Anyway it still has value to test the open source flavor too. It's up to you.

New Public Multiformat Listening Test (Jan 2014)

Reply #146 – 2013-12-12 18:19:08

Quote from: Kamedo2 on 2013-12-12 01:42:51

Quote from: Serge Smirnoff on 2013-12-12 00:06:49

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.

(1) true
(2) We won't know the variance of means before the test. Instead, imagine how much accuracy we need. 3.0=Slightly Annoying 4.0=Perceptible but not annoying 5.0=Imperceptible, so I feel it's accurate enough when we determine the average score by only 0.1 of error margin. (Can we imagine the difference between 3.3 and 3.4?)
(3) You mean the post-test evaluation?
(4) Rather, we want the SEM to be small enough to fill the requirement.
The rough estimation is done this way. First, score is between 1.0 and 5.0. So the Standard Deviation(SD) can't be more than 2.0. SD being 2.0 is highly unlikely because the score would be either 1.0 or 5.0, both 50% of the time and in that case, 1.0=Very Annoying so that the developers would get tons of bug reports. Let's say SD = 1.0. Standard Error of the Mean (SEM) = SD/sqrt(sample size). If we get independent 100 results, SEM=1.0/sqrt(100) = 0.1, which is small enough.

Assuming SD = 1.0 and results = 100 we can go a bit further and calculate confidence interval of mean M for sound samples, which is [M - 2*SEM, M + 2*SEM]. So width of this 95% interval is 0.4 unit (of score). Such interval allows to reliably discern means that differ >= 0.3 unit (allowing 25% overlap).

Using Cohen tables for determining number of samples gives even higher min.discernable distance between means: >= 0.46 (assumptions are as follows: SD = 1.0, results = 100, signif.level = 0.05, power of test = 0.8, Cohen table is for the case of two-group t-test)

In order to determine (representative) number of sound samples we should choose at least the size of conf.interval of mean of sound samples' means. Should it be approximately equal to conf.intervals of samples' means? In other words, should the accuracy of estimating sample means (variance of listeners) be equal to accuracy of estimating mean of those sample means (variance of sound samples)? In general, how to address uncertainty about overall means caused by variance of samples?

New Public Multiformat Listening Test (Jan 2014)

Reply #147 – 2013-12-12 19:10:49

Quote from: IgorC on 2013-12-12 18:18:31

Winamp flavor of FhG has the most optimal quality comparing to other. Anyway it still has value to test the open source flavor too. It's up to you.

Yes, I understand the Winamp flavor (let me call it fhgaacenc for simplicity) is "better" than others but I want to see a real test of how much effort they really put on an open source one.

I thought it was removed for some reason I missed because if I remember I wasn't the only one asking for it and I didn't see it in the list.

Thank for YOUR patience

New Public Multiformat Listening Test (Jan 2014)

Reply #148 – 2013-12-12 19:28:48

Although a test @80kbps would be interesting, testintg at @96kbps is more useful. So my votes are:
- Opus 1.1 @96kbps VBR
- Apple AAC-LC @96kbps TVBR
- FhG AAC-LC @96kbps VBR

New Public Multiformat Listening Test (Jan 2014)

Reply #149 – 2013-12-13 13:33:17

Quote from: halb27 on 2013-12-12 11:12:44

The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here.

I have no idea where you get this from. Not even remotely. "We can discard physics because I say so!"

What makes you believe this analysis is concerned with exceptionally rare events?

Quote

There's more statistical stuff like the clipping of the values at 5.0 which was mentioned here but which is ignored for the sake of getting simple test results.

This is a patently false claim, which just illustrates you haven't looked at past discussions and you have no actual idea what you're talking about. We use bootstrap analysis in addition to ANOVA exactly because of this.

Please, give actual arguments. Right now you're just hand-waving with wrong assertions, and I'm not waving back.

Notice