New Public Multiformat Listening Test (Jan 2014)

Topic: New Public Multiformat Listening Test (Jan 2014) (Read 152005 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #150 – 2013-12-13 14:19:27

Quote from: Kamedo2 on 2013-12-11 20:06:25

It's almost the same as the squashed version. You've said that "All the information about variability that you get from multiple listeners is forever gone", but I can say that data is not lost by the squashing.

This is a strange result, to me. The multiple listeners per sample give you information on how stable the sample score is, i.e. they tell you the uncertainty on the rating of the samples. So you are concluding this information does not help to establish the uncertainty of the final scores? We know that squashing the scores means the error on those ratings becomes lower, but why does knowing the *distribution of the error* not help you in the conclusion?

Imagine I gave you a list of 30 samples and each sample had been listened to by 1M listeners, i.e. the error on the score would be extremely small. I give you another list of 30 samples and each one has only been listened to by 1 listener. Your confidence on the (eventual) means of both examples is the same as long as the mean values are the same? This is weird.

In the calculation of the variability of the eventual mean, I would expect a weighting term related to the per-sample error. The variability of the eventual mean (i.e. the spread of all samples over the codec average) should not increase as much if we're adding a sample that has a mean that could be pretty far off from reality, compared to when we're adding a sample that we know we measured pretty accurately. I would also expect to weight the mean towards measurements with more certainty. (This is pure intuition speaking - maybe there's a mathematical result that firmly explains why this isn't needed or correct).

Maybe it doesn't end up mattering because the variability for the listeners per sample and the resulting variance is actually fairly equal over all samples?

I want to think a bit more about this and play with some simulations because it seems so strange.

New Public Multiformat Listening Test (Jan 2014)

Reply #151 – 2013-12-13 15:10:33

Quote from: Kamedo2 on 2013-12-12 13:21:52

I bootstrapped the last 2011 public listening test of AAC encoders @ 96kbps (280 donated results, 20 samples) to plan this upcoming test.
The past data may not be precisely applicable to an another future test, but you may get a 'sense' of 'How much effort do we need to bring the error margin down?' or 'Which plan is likely to make better use of the precious donated time?'. Enjoy!

If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.

New Public Multiformat Listening Test (Jan 2014)

Reply #152 – 2013-12-13 15:44:52

Since my FLACs are generally stored on *nix systems and I transcode there, I'm more interested in seeing how the Open Source encoders compare, so: my vote goes to:

- Opus 1.1 96 VBR
- Fraunhofer/fdkaac 96 VBR
- Vorbis -q2
- Apple AAC 96 VBR

New Public Multiformat Listening Test (Jan 2014)

Reply #153 – 2013-12-13 15:54:24

Quote from: Garf on 2013-12-13 15:10:33

If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.

Yes, you are doing it right. If you consider it mind-blowing, consider in the opposite direction. 28 donators/sample with 10 samples, 56 donators/sample with 5 samples.... 56 donators would be slightly more accurate than the 14 donators, but if you randomly re-pick 5 samples, you can easily imagine over-picking transients, or tonal samples, and it makes the final result unstable.

New Public Multiformat Listening Test (Jan 2014)

Reply #154 – 2013-12-13 15:59:16

I have tried Fraunhofer/fdkaac -m 3 on a bunch of albums and couldn't get ~96 ~ 100 kbps. The real bitrate is ~110 kbps. ~100 kbps is ok for a test. ~110 kbps isn't.

I will ask people to run Fraunhofer/fdkaac to see if it hits 96-100 kbps on a different albums. if not, only CBR is an option.

Apple AAC VBR mode hits a target bitrate on a bunch of albums. LAME, Vorbis and Opus have a fine bitrate settings so here is no issue. FhG Winamp hits ~100 kbps, no issue.

New Public Multiformat Listening Test (Jan 2014)

Reply #155 – 2013-12-13 17:49:51

Quote from: Garf on 2013-12-13 15:10:33

If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.

The point is that you get the same accurate result regarding the variance of the total sample (all samples together), so using 2 listeners makes you actually lose significant information on a per-sample basis. But if the only question is "how can I minimize the error of the overall result", i.e. find the best encoder on average, you can easily disregard that information. So, semi-intuitively this result seems to be understandable to me, but still mind-blowing, indeed. That's statistics. :-)

That also means that it's important to settle the question what the aim of this test should be before settling the number of samples or how to select them. Improve encoders by identifying problem samples? Or find the currently best encoder for a large variety of songs? Or both?!

New Public Multiformat Listening Test (Jan 2014)

Reply #156 – 2013-12-13 17:58:00

Quote from: IgorC on 2013-12-13 15:59:16

I have tried Fraunhofer/fdkaac -m 3 on a bunch of albums and couldn't get ~96 ~ 100 kbps. The real bitrate is ~110 kbps. ~100 kbps is ok for a test. ~110 kbps isn't.

I will ask people to run Fraunhofer/fdkaac to see if it hits 96-100 kbps on a different albums. if not, only CBR is an option.

Using '-vbr 2 -cutoff 14k' with ffmpeg should match better.

I don't know if using non-default settings like this is acceptable in a listening test. But maybe they are better than CBR.

New Public Multiformat Listening Test (Jan 2014)

Reply #157 – 2013-12-13 18:01:18

Quote from: Garf on 2013-12-13 14:19:27

I want to think a bit more about this and play with some simulations because it seems so strange.

It's a tolerated procedure. I simulated. There are 32 independent samples on the left. Two random samples were paired and squashed, to be 16 samples. Again, squashed to be 8 samples.
Notice the confidence interval doesn't change.

New Public Multiformat Listening Test (Jan 2014)

Reply #158 – 2013-12-13 18:28:39

fdkaac 0.5.1 --help prints this:

Quote

VBR mode is not officially supported, and works only on a certain combination of parameter settings, sample rate, and channel configuration

(fdkaac was built using fdkaac_autobuild script from https://sites.google.com/site/qaacpage/cabinet )

New Public Multiformat Listening Test (Jan 2014)

Reply #159 – 2013-12-13 19:47:54

Quote from: Serge Smirnoff on 2013-12-12 18:19:08

Assuming SD = 1.0 and results = 100 we can go a bit further and calculate confidence interval of mean M for sound samples, which is [M - 2*SEM, M + 2*SEM]. So width of this 95% interval is 0.4 unit (of score). Such interval allows to reliably discern means that differ >= 0.3 unit (allowing 25% overlap).

Listening test results are typically highly correlated. You can typically discern more, but not always, as having 0.3 difference doesn't guarantee having an observed mean of 0.3 difference. Could be less.

Quote from: Serge Smirnoff on 2013-12-12 18:19:08

Using Cohen tables for determining number of samples gives even higher min.discernable distance between means: >= 0.46 (assumptions are as follows: SD = 1.0, results = 100, signif.level = 0.05, power of test = 0.8, Cohen table is for the case of two-group t-test)

I'm skeptical that Cohen table is useful in here.

Quote from: Serge Smirnoff on 2013-12-12 18:19:08

In other words, should the accuracy of estimating sample means (variance of listeners) be equal to accuracy of estimating mean of those sample means (variance of sound samples)?

It's the goal of your statistical analysis, so it's up to you. My recommendation is that, rather than to chase "Codec A and rival Codec B, which performs better for sure?" and look for the statistical analysis that guarantees low p-value, "Is Codec A true to the original?" and think of the tolerable error margin of the answer. The margin you tolerate. You can do the same for the Codec B.

New Public Multiformat Listening Test (Jan 2014)

Reply #160 – 2013-12-13 22:40:24

Quote from: Kohlrabi on 2013-12-13 17:49:51

Quote from: Garf on 2013-12-13 15:10:33
If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.
The point is that you get the same accurate result regarding the variance of the total sample (all samples together), so using 2 listeners makes you actually lose significant information on a per-sample basis. But if the only question is "how can I minimize the error of the overall result", i.e. find the best encoder on average, you can easily disregard that information. So, semi-intuitively this result seems to be understandable to me, but still mind-blowing, indeed. That's statistics. :-)

That's why I advocate to do statistics only for each sample, and leave the interpretation towards overall quality to the user. I think this represents reality best, especially as the outcome of the various samples doesn't have the same meaning to every user. A person who is very sensitive towards transients for instance will give these samples a much stronger weight than a person who is pretty insensitive here. I love the diagrams where the samples are shown on the x-axis and their average (and maybe more statistical) outcome on the y-axis, the outcome for each encoder shown in a different color. It shows it all on one glance without any over-simplification. Even for readers who don't want to go much into detail this diagram shows which encoders are attractive to use and which are not.
Most important: this way important information on sample performance is kept and not extremely aggregated into just one average plus additional statistical information the exact meaning of which is hardly understood by anybody turning us all to beleivers.

New Public Multiformat Listening Test (Jan 2014)

Reply #161 – 2013-12-13 22:56:32

Quote from: halb27 on 2013-12-13 22:40:24

That's why I advocate to do statistics only for each sample, and leave the interpretation towards overall quality to the user...

All previous HA tests had an average scores with statistics.
This will have it too.

Period.

New Public Multiformat Listening Test (Jan 2014)

Reply #162 – 2013-12-13 23:26:48

For my vote I would like to finally, if possible, end the debate of CVBR vs TVBR in AAC, preferrably Apple's encoder. The last test showed only a tendency for CVBR to be rated higher with no clear winner.

Maybe the bitrate should be pushed lower to do this.

New Public Multiformat Listening Test (Jan 2014)

Reply #163 – 2013-12-13 23:51:43

Concerning number of sound samples. Here is my findings.

There are several ways to find this number in order the test to be powerful. All of them require the knowledge of standard deviation of scores in population and heavily depend on it. Kamedo2 proposed SD = 1 giving some basic reasoning. As the value of SD is important for our calculations it should be grounded a bit better. Imagine ideal listening test when a codec processed all population of music; resulting audio was cut, say, by 15s peaces and subjectively evaluated by some panel of listeners. What would be SD of the scores? Knowing distribution of those scores would help to find it. Can we guess the distribution? In order to have some idea here is distribution of scores for all samples and codecs in HA @96 listening test:

It seems to me reasonable to assume that all scores in population distributed from 5 to1 normally with m=5 and sd=1:

Then 99.99% of all scores fall into [1, 5] interval. Scores which are close to 1 and 2 belong to killer samples. Using this pure speculative but reasonable model we get SD = 0.6, which substantially reduces number of required samples.

As I already mentioned rough estimation of the number of samples (n) could be made using simple formula for estimating of population mean:

n = 16*SD^2/W^2 , where W is width of 95% confidence interval.

For W = 0.4 and SD = 0.6 n = 36. In other words, using 36 randomly chosen from population sound samples we can estimate population mean with the following accuracy – width of 95% confidence interval will be 0.4. This will allow to reliably discern means which differ more than 0.3 units of score (D = 0.3). This rough estimation can be considered as the most optimistic for n.

More realistic estimation can be obtained with Cohen tables for multiple comparisons of means. Lets consider the simple case of comparing two means. We have SD = 0.6, p = 0.05, Power of test = 0.8 and the same distance between means D = 0.3. For these inputs table value for n = 64 (Effect Size f = D/(2*SD) = 0.25). In other words, using 64 sound samples we can't reliably discern two means which differ less than 0.3 units. For comparison of 5 means n = 84. And all this using very optimistic SD = 0.6.

Taking into account that n = 40 is almost max. realistic number of samples in any listening test, there is a choice between two cases:

(1) For the sake of possibility to generalize results to the whole population of music we seriously loose the power of test. In this case overall means and confidence intervals are calculated for sound samples. Even if confidence intervals turns to be small for selected samples the test will not be valid because of inappropriate sampling of sound samples from population.

(2) Dropping that generalization we increase power because we no longer account variability of sound samples. In this case overall means and confidence intervals are calculated for grades (squashed for all sound samples). Results of such test are much more accurate but biased against particular sample set.

All listening tests that I saw was of (2) design. And sound samples are chosen to be representative not for the population of music but for the population of artifacts produced by codecs (problem samples). Exactly because of this bias developers of codecs (knowingly or unknowingly) are so serious about sample selection for listening tests. It matters. So, this is a real bias and it is unavoidable.

My conclusions. Number of sound samples can be arbitrary, more samples help to reveal more features of codecs. Having some limited number of listeners, the number of samples should provide at least 10-15 grades per sample. Maintaining equal number of samples from test to test is a plus. Practice of using the same samples from test to test are vulnerable to intentional tuning of codecs. It's a good idea to maintain some big bank of problem samples divided by types of artifacts and choose randomly samples from it for each next test. Along with overall means of codecs results should include per-sample graphs because they contain valuable info about codecs behavior. So, nothing new, except absence of the false generalization.

New Public Multiformat Listening Test (Jan 2014)

Reply #164 – 2013-12-14 11:12:57

Glad to see I'm not alone.
I can live with giving an overall average when those statistical 'proofs' that encoder 'A' is better than 'B' for the universe of music dissapear.

New Public Multiformat Listening Test (Jan 2014)

Reply #165 – 2013-12-14 11:50:05

@halb27: A statistical analysis never says "A is always better than B". It says "There's a high enough degree of probability that A is better than B that taking A is expected to be the correct solution most of the time".

In other words, there's as much simplification in saying only "A is better than B", than in saying "for the universe of music".

New Public Multiformat Listening Test (Jan 2014)

Reply #166 – 2013-12-14 12:18:44

Quote from: halb27 on 2013-12-14 11:12:57

Glad to see I'm not alone.
I can live with giving an overall average when those statistical 'proofs' that encoder 'A' is better than 'B' for the universe of music dissapear.

New Public Multiformat Listening Test (Jan 2014)

Reply #167 – 2013-12-14 12:56:18

Quote from: [JAZ] on 2013-12-14 11:50:05

@halb27: A statistical analysis never says "A is always better than B". It says "There's a high enough degree of probability that A is better than B that taking A is expected to be the correct solution most of the time". ...

Sure. I didn't care to be that precise because my point is on the the claim part 'for the universe of music'. I'd even agree with that if we were randomly choosing track snippets from the universe of music. But we all know that the results of such a listening test would be very dull. So we use a selection of problem samples, at least to a significant percentage of all samples. That's what I called a world of more or less 'Black Swans' where it is not possible to generalize judgements for the individual samples to the universe of music.

Sure the more technical detail oriented people will prefer Serge's arguments.

New Public Multiformat Listening Test (Jan 2014)

Reply #168 – 2013-12-14 13:41:53

Quote from: halb27 on 2013-12-14 12:56:18

Sure. I didn't care to be that precise because my point is on the the claim part 'for the universe of music'. I'd even agree with that if we were randomly choosing track snippets from the universe of music. But we all know that the results of such a listening test would be very dull. So we use a selection of problem samples, at least to a significant percentage of all samples.

I believe the reason for the heavier representation of problem samples in a listening test is that they can spoil the whole user experience.

New Public Multiformat Listening Test (Jan 2014)

Reply #169 – 2013-12-15 03:19:17

Quote from: darkbyte on 2013-12-12 19:28:48

Although a test @80kbps would be interesting, testintg at @96kbps is more useful. So my votes are:
- Opus 1.1 @96kbps VBR
- Apple AAC-LC @96kbps TVBR
- FhG AAC-LC @96kbps VBR

FhG Winamp or libfdk?

Quote from: lozenge on 2013-12-13 15:44:52

Since my FLACs are generally stored on *nix systems and I transcode there, I'm more interested in seeing how the Open Source encoders compare, so: my vote goes to:

- Opus 1.1 96 VBR
- Fraunhofer/fdkaac 96 VBR
- Vorbis -q2
- Apple AAC 96 VBR

Vorbis aoTuv or reference?

New Public Multiformat Listening Test (Jan 2014)

Reply #170 – 2013-12-15 08:09:51

IgorC, I vote for FAAC -b 96 as a low anchor (please, complete my vote). Lower bitrate will make differences hearable for deaf even, but we don't need them in our test

New Public Multiformat Listening Test (Jan 2014)

Reply #171 – 2013-12-15 08:49:20

If my above arguments are convincing, then it is reasonable to preserve previous design of the test: 20samples/5codecs+low anchor and choose another 20 samples of the same pattern of artifacts representation (or correct the pattern if necessary). This will help to preserve aprx. equal reliability of results from test to test. Also it seems that this design is on the edge of participation capability. So contenders could be for example as follows:
[blockquote]1. Opus
2. Apple AAC TVBR or CVBR
3. Fhg commercial
4. Fhg free
5. Vorbis or MP3[/blockquote]

Nero@64 from previous @64 test as low anchor?

New Public Multiformat Listening Test (Jan 2014)

Reply #172 – 2013-12-15 09:12:59

Quote from: Serge Smirnoff on 2013-12-15 08:49:20

Nero@64 from previous @64 test as low anchor?

FAAC@96 is better. Fair, not misleading and the explanation is easier.

New Public Multiformat Listening Test (Jan 2014)

Reply #173 – 2013-12-15 09:39:28

Quote from: Kamedo2 on 2013-12-15 09:12:59

FAAC@96 is better. Fair, not misleading and the explanation is easier.

OK.

New Public Multiformat Listening Test (Jan 2014)

Reply #174 – 2013-12-15 11:39:19

Out of curiosity I listened to some AAC@96 encoded (Apple AAC, Winamp FhG) tracks, and I'm really impressed. That's why AAC encoders are most prominent on my wish list.

For this reason I'd like to see a ~96 kbps listening test (target bitrate for a test set of regular music may deviate a bit from 96 kbps, because otherwise deviation from 96 kbps may be too big for some VBR settings) with

Opus
Apple AAC
Winamp FhG
FhG free version
mp3@128

participating.
I'm about to do some investigation about the exact details how to use these encoders.

Other than that I support Serge's suggestion for the number of samples and number of encoders.