Skip to main content
Topic: New Public Multiformat Listening Test (Jan 2014) (Read 86026 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #100
There was progress in codec development since the last test, wasn't it?
lame3995o -Q1.7
opus --bitrate 140

New Public Multiformat Listening Test (Jan 2014)

Reply #101
You can download two versions of Winamp and run ABX at least on 2-3 samples encoded by  Fraunhofer AAC encoder .

Now that will really help to organize a test.

I'm not mad about Fraunhofer, it's the excelent encoder in my opionion. And I really mean it.
I'm desperate because nobody want to corroborate. If  there were quality changes at 96 kbps, what to expect etc.


P.S. hey, why don't we "re-rrrun" to see if some of MP3 encoders from  here  http://listening-tests.hydrogenaudio.org/s...8-1/results.htm  could flip out?

New Public Multiformat Listening Test (Jan 2014)

Reply #102
Well, sorry, but:
Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?

[quote author=eahm link=msg=0 date=]Chris, where were the tuning performed? In which bit rate range, if I can say this?[/quote]
Counter-question: was the quality of Apple's AAC encoder improved over the last two years, and on which samples? Do you understand why I'm asking this?

Answers to the original question were given here a few months back and here a few weeks back. These post say it all. The remaining doubts were cleared up by Igor's further comment (the one with the Quasimodo sample). I already re-tuned the encoder while the 2011 HA test was still running.

Why should I tell you which samples were improved by my tunings? Judging from my (unfortunately unpleasant) experience with the preparation of the 2011 test (where I mentioned samples on which the Fraunhofer encoder does well), I fear this would have an influence on whether these samples would be considered for inclusion in the test or not. After all, we apparently don't know on which samples Apple's encoder improved, knowledge which is necessary for fairness and which brings me back to the above question. Edit: Actually, I don't think such questions should be asked at all in a discussion of selection of codecs.

Anyway, the 64-kbps and 96-kbps SoundExpert tests give you a hint which samples the Winamp encoder handles quite well and which the Opus encoder doesn't.

By the way, the overall ranking of the 64-kbps and 96-kbps SoundExpert tests is nearly identical, which indicates that it can't be that wrong. Of course their sample selection is debatable and radically different from the HA tests', but concluding that "the SE test must be wrong" is a bit unfair IMHO. For the record, I got relatively similar results in internal MUSHRA tests with the same samples (Opus scored a bit better due to different bit-rate calibration).

Honestly, I really don't know what to make of this discussion, and I seriously considered leaving it after greynol - a moderator - was addressed with "grow up" and something like "Mr. know-it-all" yesterday (Edit: apparently deleted now, but the deletion doesn't undo it). I only give this last reply because Igor and eahm directly addressed me with a question.

Now my personal wish list, in case anybody cares: I have absolutely no interest in seeing a comparsion between two AAC encoders at 96 kbps, and I certainly don't care which AAC encoder should be used (since, like I replied to Garf's statement, I'm sure of what the result would be). I'd rather like to see Opus 1.0.? compared against 1.1 as backup for the claim of "significantly improved encoding quality, especially for variable-bitrate (VBR)". I think HA is the logical place for such a test.

Chris
If I don't reply to your reply, it means I agree with you.

New Public Multiformat Listening Test (Jan 2014)

Reply #103
...P.S. hey, why don't we "re-rrrun" to see if some of MP3 encoders from  here...

Nobody has ever asked for a re-run of a listening test.
You want to organize a new listening test and you asked for codec proposals. Sure @96 kbps AAC plays a major role, and IMO the last AAC listening test does not imply that only Apple AAC is worth testing. That's what my contribution (not only mine) here was about.

But in the end it's best just to collect wishes from HA users, and leave any personal background for or against certain codecs aside.
In case HA users want FhG AAC to participate, it should be done IMO. In case there's no interest for it, it should be left out.
lame3995o -Q1.7
opus --bitrate 140

New Public Multiformat Listening Test (Jan 2014)

Reply #104
Why should I tell you which samples were improved by my tunings?

What's wrong with that?  The sample selection is automatic and human-independent anyway.

Judging from my (unfortunately unpleasant) experience with the preparation of the 2011 test (where I mentioned samples on which the Fraunhofer encoder does well)

If we would accept your samples then Apple developers would be yelling at us.
But anyway You can punch me. I understand You. It's your encoder that came up second.

When we will finish with a future public tests at 64-96 kbps then more probably we'll go to lower rates like 32-48 kbps, where most probably (HE)-AAC family  will beat Opus, Vorbis. Then Xiph developers will start to punch me.
It's kinda already fun for me.

Anyway, the 64-kbps and 96-kbps SoundExpert ...

Haaa ...?

New Public Multiformat Listening Test (Jan 2014)

Reply #105
Kamedo2,

Thank You for posting the graphs. What should we look at?

New Public Multiformat Listening Test (Jan 2014)

Reply #106
The type of listeners is already a bias as stated in previous posts in this thread, and one which neither of us can get around. If the selection of samples has an influence, that means a bad bias in their selection that invalidates the test (and it's the exact problem I have with your test!).

I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music, especially taking into account the usual practice of using killer samples in codec listening tests. Population of music is extremely huge and diverse. So, selection of samples is also a bias like selection of listeners to my opinion.

The practice of targeting bitrate of VBR codecs using big music library is not ideal as well. The bitrates depend on proportion of various music genres the library consists of. Different codecs react differently with their resulting bitrates upon changes of those proportions. As a result the choice of codec settings is also arbitrary to some extent. The more so, this choice turned out to be completely unrelated to the sound material actually used in listening test. This was discussed a lot and other solutions (including SE one) have drawbacks of their own and it was a consensus that this approach is reasonable and valid, but it is not the only one and there are no indications that it is the best.

I'm pretty sure there is no such listening test design that produces some final results. Because of those assumptions, conventions and compromises any listening test shows only a part of the whole picture. Any such test could be perfectly repeatable if it follows the same methodology and corresponding test design. There are simply valid variations of the same methodology which could affect the result. I think if you repeat that HA @96 listening test with different samples (I'm not sure in representativeness of any such set of samples) and different way of calculating target bitrates (having different pros and cons) the results will be not the same - some tied contenders could easily change their places. But actually I don't recommend to do this, quite the contrary, I have the impudence to give you advice - follow your methodology which is well established, valid, elaborated inside HA community and thus accepted by its majority. But, please, stay away from claiming your results a final word in comparison of codecs. Such claims are ungrounded and unproductive. Exactly because of this hard-edged approach the initial discussion turned into hysterical defense of HA sacred cow - listening test results. These results are not ideal but perfectly useful and I am very interested in them because they help both verify SE results (indirectly though) and better understand the limits of SE methodology. Conducting of listening tests with strict design never was a goal of SE. SE moves from the opposite side - first of all it offers a version of blind listening tests which are designed to be as simple as possible for ordinary listeners and afterwards derives as much as possible information from collected grades. So I'm perfectly aware of shortcomings of SE methodology and yet I still think it produces helpful results, less accurate but valid.

Quote
The way the stimuly are presented shouldn't affect the result. If it does, that's a flaw again. But you're not amplifying artifacts any more, right?
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency. Amplification of artifacts was never used for codecs below 100kbit/s, not a single time starting from the beginning in 2001, just there is no need for it at low bitrates.

@IgorC
I think that your listening test agenda should not depend on external and unrelated factors such as SE, its results and possible advocates.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #107
Kamedo2,

Thank You for posting the graphs. What should we look at?

The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis. So I retract my past post that says the likelihood of FhG beating Apple is very small.

There's a minor possibility that FhG wins over Apple. Still, it's a multiformat listening test and I'd rather prefer to see the AAC-AAC battle in a separate public test than in this one.

And the 20 sample indeed is a statistical bottleneck. The sample number is small and it is likely to improve if we double the sample number.

New Public Multiformat Listening Test (Jan 2014)

Reply #108
Honestly, I really don't know what to make of this discussion, and I seriously considered leaving it after greynol - a moderator - was addressed with "grow up" and something like "Mr. know-it-all" yesterday (Edit: apparently deleted now, but the deletion doesn't undo it). I only give this last reply because Igor and eahm directly addressed me with a question.
FWIW, I moved offensive statements which didn't consider the original topic into the recycle bin, so nothing was really deleted. I just tried to sanitize this thread and wanted to avoid people picking up on these offensive statements, which wouldn't further the discussion.
It's only audiophile if it's inconvenient.

New Public Multiformat Listening Test (Jan 2014)

Reply #109
So far as I understand, how far we can "generalize" things depends on what is called external validity

New Public Multiformat Listening Test (Jan 2014)

Reply #110
What's wrong with that?  The sample selection is automatic and human-independent anyway.

Yes, but I meant that the samples I thought of weren't even included in the pool from which the samples were randomly drawn.

Quote
If we would accept your samples then Apple developers would be yelling at us.

Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention. So I won't tell you. And no, Igor, I'm not punching you.

Quote
... lower rates like 32-48 kbps, where most probably (HE)-AAC family  will beat Opus, Vorbis. Then Xiph developers will start to punch me.

Why should they? Opus already won by some margin at 64 kbps, why are you so sure that HE-AAC would win there? That's why I would like a listening test at 48 kbps: to show me which coder wins (or is tied to another)!

Quote
Haaa ...?

[quote author=Serge Smirnoff link=msg=0 date=]I think that your listening test agenda should not depend on external and unrelated factors such as SE, its results and possible advocates.[/quote]
True, true. So forget what I said about the SE test. We're at HA.

Chris
If I don't reply to your reply, it means I agree with you.

New Public Multiformat Listening Test (Jan 2014)

Reply #111
The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis.


I agree here BTW. The past tests had an issue that the results were merged per-sample before doing the analysis, but this loses the information on the variability of the listeners and makes the test lose all power (it's the same as if one person would take the test). The fix was to keep all results, but this conflates the variability of the listeners and the samples. The bootstrap tool should be fixed to block over both samples and listeners instead of sample-listeners to give correct results with our test format.

New Public Multiformat Listening Test (Jan 2014)

Reply #112
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music, especially taking into account the usual practice of using killer samples in codec listening tests. Population of music is extremely huge and diverse. So, selection of samples is also a bias like selection of listeners to my opinion.


You're claiming classical statistics is wrong?

That said, I agree on the concerns regarding *our* sample selection. We use problem samples so it's clearly biased.

More practically, we don't have an entire library of music available on which we can make a truly random choice. Ideally, we draw random numbers out of the entire (for example) Spotify catalog and test those samples.

Maybe we can come close to that: We get a list of all songs from musicbrainz (for example), someone makes a program which outputs a list of randomly picked songs + 30s excerpts (musicbrainz has duration info so it's possible), publicizes the list, and we start looking from the top if anyone actually has the CD so we can get the sample.

This would still bias towards more popular music, but a) we can probably live with that as it's arguably a wanted bias b) it's better than what we do now.

New Public Multiformat Listening Test (Jan 2014)

Reply #113
I think if you repeat that HA @96 listening test with different samples (I'm not sure in representativeness of any such set of samples) and different way of calculating target bitrates (having different pros and cons) the results will be not the same


Agree on that wrt samples. The way of calculating the target bitrate is a choice. I believe it's the correct one if we don't assume people will reconfigure their encoder for every specific song they encode. If you agree with that assumption, I'd like to see a concrete proposal of another methodology that would be valid or an argument why ours isn't.

Quote
But, please, stay away from claiming your results a final word in comparison of codecs. Such claims are ungrounded and unproductive.


We understand that our tests have flaws which influence the result and introduce error. I think we've done a lot to eliminate them as much as possible.

The problem is people arguing: if you repeat a test you get a different result anyway. This is wrong thinking. This is only true if the test has flaws. That should be the goal of the discussion: to point out and figure out how to eliminate as many flaws as possible. If you can point out a flaw, you have an argument why a repeat test will give a different result and the result that was posted isn't definite. If you just say you will get a different result without a valid reason, you're misunderstanding statistics.

The main valid point I've seen rised here was sample selection. That's good. We can try to move to the next level there.

Quote
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency.


Is this demonstrable or is it your suspicion? I would worry that non-hidden reference adds loads of noise to the result, and makes it harder to draw conclusions, because of people ranking fake differences. Of course this is less of a factor if you have very many listeners.


New Public Multiformat Listening Test (Jan 2014)

Reply #115
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music ...

This assumption is false. Any of developers or people involved in tests can say that.

New Public Multiformat Listening Test (Jan 2014)

Reply #116
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.

New Public Multiformat Listening Test (Jan 2014)

Reply #117
Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention...

Agree

Quote
why are you so sure that HE-AAC would win there?

Well, HE-AAC is very efficient at 32-48 kbps. While I'm not 100% sure what can happen in a public tests, personally I prefer HE-AAC at this range of bitrate.

New Public Multiformat Listening Test (Jan 2014)

Reply #118
For the latest posts of Kamedo2, IgorC, Serge Smirnoff:

I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

As soon as you accept that a listening test has an important but necessarily limited meaning everything is fine. Of course the test should be conducted with best effort to do things right. But I've always hated to rate encoders according to just statistical analysis and think if this is done correctly (not always the case as we have seen in this thread) we know with scientific precision that encoder A is better than B.

For encoder choice the formal statistics of average and confidence interval often have no meaning. I'm thinking of the last mp3@128kbps test. In the light of overall average and confidence intervals all the encoders were tied. But looking at the outcome for the individual samples Lame 3.97, iTunes and - to a minor degree - Fraunhofer showed some noticeable weaknesses for some samples. So without information from outside the test it is not very reasonable to choose one of these encoders. From the listening test alone only Lame 3.98.2 and Helix remain as the practical candidates for encoder choice.
lame3995o -Q1.7
opus --bitrate 140

New Public Multiformat Listening Test (Jan 2014)

Reply #119
The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis.


I agree here BTW. The past tests had an issue that the results were merged per-sample before doing the analysis, but this loses the information on the variability of the listeners and makes the test lose all power (it's the same as if one person would take the test).

It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'. The errorbar is about +/- 0.2 in size, which is enough to get the rough idea of the quality.

New Public Multiformat Listening Test (Jan 2014)

Reply #120
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

I believe you are too anxious. I tend to spend a lot of time listening to encoded music, rather than the collections of wav in my HDD. The reason is to report the defect to the developer(s) if anything go wrong. I've already sent a dozen of problematic samples to a developer of FFmpeg's native AAC encoders. You don't get any report, because nothing have gone wrong.
If you're still worrying, read this: http://scienceblogs.com/cognitivedaily/200...-dont-understa/

New Public Multiformat Listening Test (Jan 2014)

Reply #121
Quote
If we would accept your samples then Apple developers would be yelling at us.

Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention. So I won't tell you. And no, Igor, I'm not punching you.

Well, accepting those samples should surely make the test dubious in terms of fairness which is indeed a bad thing, but do codec developers really yell at it?
I guess samples where company A performs worse than others will be more useful to company A's developer than samples where company A performs quite well, and even imagine that developer might be able to "steel" something from others when they are same codec... but of course I'm not a codec developer and I could be completely wrong.

New Public Multiformat Listening Test (Jan 2014)

Reply #122
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

Great. Please, inform yourself how the samples were picked for the last HA public test and then propose how You can improve that.

Make a study on these 20 samples per:
- type of content
- type of possible artifact
- music style if it was a music sample
- ...

It wasn't just a casual choice. 

Thank You. That will help.

New Public Multiformat Listening Test (Jan 2014)

Reply #123
It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'.


I'm not sure what you are talking about here, but I think you completely misunderstood what I pointed out. If you squash all results per sample *before doing the analysis*, you have *20* results, not *280* as your graph shows. This is exactly the same input as if one person had taken the test. All the information about variability that you get from multiple listeners is forever gone. You might get lucky in that there is now less variability than with an actual test with one person, but how can you even tell?

New Public Multiformat Listening Test (Jan 2014)

Reply #124
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection.


Why not? A random sample drawn from the population without bias is fine and sufficient. This isn't a case of "failure no matter how hard you try". Why would it? Statistical sampling isn't magic. It's well-understood but just non-trivial to pull off. You're now the second person to make this claim even though it runs directly contrary to well-established mathematics. You *can* correctly infer population statistics from a random, non-biased sample. There's no point in claiming something else. If you want to show it's not possible, you should go collect your Fields Medal in the process.

Quote
It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.
...
For encoder choice the formal statistics of average and confidence interval often have no meaning. I'm thinking of the last mp3@128kbps test. In the light of overall average and confidence intervals all the encoders were tied. But looking at the outcome for the individual samples Lame 3.97, iTunes and - to a minor degree - Fraunhofer showed some noticeable weaknesses for some samples. So without information from outside the test it is not very reasonable to choose one of these encoders. From the listening test alone only Lame 3.98.2 and Helix remain as the practical candidates for encoder choice.


You're making an argument here that the best encoder isn't the one which gives the best quality result on average, but which is least prone to make a bad encoding. You can estimate it by looking at the indicated bounds, selecting the codec with the highest upper bound: it's the one that's least likely to give you bad outliers. I have no idea why you claim they have no meaning as they indicate directly what you want.

The idea that the best encoder to use is one that is determined by that reasoning, rather than the one that gives the highest quality on average, is entirely on you BTW.

 
SimplePortal 1.0.0 RC1 © 2008-2020