Flawed conclusions on WMA

Topic: Flawed conclusions on WMA (Read 27536 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Flawed conclusions on WMA

2007-04-17 04:36:55

Hello forum

I have been reading this forum and it is used as a reference by people all over the internet to come to conclusions about which audio codec to use. Many of the times, the decision of the ordinary person comes down to WMA vs MP3 as that is where the majority of hardware support lies.

In giving their advice, many people cite http://www.rjamorim.com/test/multiformat128/results.html and conclude that MP3 (encoded with LAME) is superior to WMA in audio quality, and therefore people should use that.

Looking at the test results, I see that other than atrac, the results merely reflect the different bit-rates being played. As in, there is actually no conclusion at all from the results other than that higher bit rates sound better. Furthermore, I cannot seem to find the actual details of encoder settings used, but the top ones, including LAME MP3 appear to be using variable bit-rate. Comparing this to constant bit-rate WMA files is absolutely ridiculous. Considering how close WMA comes to the others, it is almost possible to conclude that WMA is superior to them when factoring in the difference between constant and variable bit-rates.

Of course on top of this is the ability for the test, due to its public nature, to have been tainted by anyone who has an agenda against a particular codec. It also has significant error due to the variability of equipment used by the listeners. Another factor is that people who are used to a particular sound, such as that of iTunes encoded AAC files, may be more likely to rate that above the other codecs. Furthermore, there are a number of other tests conducted by others in both controlled and public environments that appear from my understanding to have come to the opposite conclusion to the anti-WMA sentiment in this forum.

I would be happy for someone to point out where my analysis is wrong, otherwise it would seem to me that when people are criticising WMA they should state this as their personal opinion and stop pointing to the 128 kpbs listening test as some kind of proof that WMA is inferior to even MP3. I 'thought' the goal of many people in this forum was to enable people to make informed decisions about various codecs, but from my reading of this forum, I do not get that impression.

Flawed conclusions on WMA

Reply #1 – 2007-04-17 04:55:58

Um, no.

From (http://www.rjamorim.com/test/multiformat128/presentation.html)

Quote

The encoders and parameters tested are:

* LAME encoder 3.96 -V5 --athaa-sensitivity 1
* Apple iTunes 4.2 128kbps AAC
* Ogg Vorbis aoTuV tuning b2 -q 4.35
* Musepack 1.14b --quality 4.15 --xlevel
* Sony Atrac3132kbps
* Microsoft WMA9 Std Bitrate VBR 128kbps

Clearly, VBR is used for wma.

Also if you look at the end of the link you provided, there is the ANOVA analysis of the results

Quote

Vorbis aoTuV is tied to Musepack at first place, Lame MP3 is tied to iTunes AAC at second place, WMA Standard is in third place and Atrac3 gets last place.

Also, since ABC/HR was used to conduct the test, there is no question of bias or pushing agendas.

Flawed conclusions on WMA

Reply #2 – 2007-04-17 05:11:11

Why isn't the differing bit-rate the reason for the differing scores?

Flawed conclusions on WMA

Reply #3 – 2007-04-17 05:45:50

This has been discussed way too many times. Look up the test-thread and read it. You will find an answer to your question.

Short version: the varying bitrates are perfectly valid. Thats the whole point of VARIABLE BIT-RATE. A good VBR-encoder should raise the bitrate on difficult samples. If it doesnt, the encoder is actually worse. The encodersettings used all result in a similiar bitrate AVERAGE across a multi-genre large music collection. The overall bitrate is NOT higher or lower for any codec. It is just that some encoders were clever enough to raise the bitrate for the TEST-SAMPLES while others were too stupid to do that. Thus, the perfect (utopic) encoder would result in an average bitrate across a large music collection, of 128kbit, yet would recognize the difficulty of the test-samples, pump up the bitrate to something like 200kbit, and achieve a perfect 5.0 score.

Flawed conclusions on WMA

Reply #4 – 2007-04-17 06:11:06

I do not agree with you at all, otherwise there would be no settings for target bit-rate when doing VBR encoding. There is also no way that a codec developer could know what the average music collection will look like now or in the future. Therefore they cannot determine the complexity of music that will be encoded. A better codec will make sacrifices to achieve the desired bit-rate, otherwise it may as well just go off on a tangent and output any bit-rate it likes.

Flawed conclusions on WMA

Reply #5 – 2007-04-17 06:24:38

Are you trolling?

Quote from: astroboy on 2007-04-17 06:11:06

I do not agree with you at all, otherwise there would be no settings for target bit-rate when doing VBR encoding.

WTF? What do the available settings have to do with the validity and logic of the test?

Quote

There is also no way that a codec developer could know what the average music collection will look like now or in the future.

What does the codec developer have to do with that? Determining an similiar average bitrate for the test is a job for the test-organizer, and IT WAS DONE.

Quote

A better codec will make sacrifices to achieve the desired bit-rate, otherwise it may as well just go off on a tangent and output any bit-rate it likes.

Boy, get a clue about what VARIABLE BITRATE is about, and why the basic VBR-setting is not a "bitrate-target" but instead a "QUALITY-target". VBR does not target any bitrate at all. Thats what ABR is for (for some reason meaning "average bitrate"). After you have educated yourself about the basics of how lossy codecs work, you can write statements about it.

It is kind of weird that you claim to be a long-time forum reader, yet you act as if you do not know those basic things about how lossy codecs work.

- Lyx

Flawed conclusions on WMA

Reply #6 – 2007-04-17 06:29:51

There is a difference between ABR and VBR. When using VBR WMA with a target bitrate, you are in fact using ABR. A true VBR codec will not look to maintain a bitrate close to the one specified, but will try to maintain quality as high as possible. The problem with WMA is that its VBR engine, when used, does not come close to the bitrate of the tests - it's either way too high or way too low. Instead of using CBR, Roberto decided to use ABR.

In my test, I used CBR for the WMA Professional codec and according to the results at 128 kbps, the Professional codec is quite competitive.

Lyx was faster... >_<

Flawed conclusions on WMA

Reply #7 – 2007-04-17 06:32:45

@ astroboy: I'm a WMA user, and I think the test was pretty fair. Simply put, all current generation codecs perform their best at VBR, but they don't all encode to the same bitrate at those settings. As such, it is a bit tough to create an absolutely level playing field, but the testers did their best.

Flawed conclusions on WMA

Reply #8 – 2007-04-17 07:08:45

So even though it says that WMA is being used at VBR 128kpbs on the presentation page for that test, it should actually say WMA ABR?

In which case, MP3 VBR and WMA ABR cannot be compared meaningfully.

Eventually I will just have to test for myself anyway.. I think these public listening tests are meaningless and the only testing that should be encouraged is personal testing. Why would you choose a codec based on a test done by someone else when you can do it yourself? I need to decide between WMA and MP3 because my car doesn't play anything else and I like both formats over the competitors.

Flawed conclusions on WMA

Reply #9 – 2007-04-17 07:40:03

Quote from: astroboy on 2007-04-17 07:08:45

In which case, MP3 VBR and WMA ABR cannot be compared meaningfully.

Why not? MS is constantly claiming that WMA at 64 Kbps is better than MP3 at 128 Kbps. Here is a controlled test that shows at around the same bitrate, LAME MP3 is better. Surely the conclusions of this controlled test are more valid than uncontrolled tests / marketing spin.

Quote from: astroboy on 2007-04-17 07:08:45

Eventually I will just have to test for myself anyway.. I think these public listening tests are meaningless and the only testing that should be encouraged is personal testing. Why would you choose a codec based on a test done by someone else when you can do it yourself? I need to decide between WMA and MP3 because my car doesn't play anything else and I like both formats over the competitors.

Where on this forum does it say that public listening tests are the only form of evidence one can use to determine codec use?

These public tests are simply examples of controlled testing that are a good guide as to how specific encoders perform. They are a lot better than uncontrolled tests based on subjective opinion, that are completely the result of placebo.

Flawed conclusions on WMA

Reply #10 – 2007-04-17 08:58:01

Quote from: astroboy on 2007-04-17 07:08:45

... I think these public listening tests are meaningless and the only testing that should be encouraged is personal testing. ...

No, public listening tests are not meaningless, but they have a necessarily restricted relevance. Most crucial to me is the fact that an encoder can have a flaw which shows up in not only extremely rare situations but which isn't reflected in the test samples. (But if there's conciousness about the flaw it's also a problem how this flaw should be adequately weighted within the samples).
So this is all approximation to the truth, more exactly: valuable approximation to truth.

IMO the best way to deal with the outcome of listening tests is not to be nitpicking with the resulting numbers. As for that looking at the test you mentioned to me WMA is absolutely on par with Lame. But it also shows as was mentioned before that Microsoft's claim of WMA (standard)'s superiority is ridiculous. May be this is the background for the anti-WMA attitude in this forum.

Thinking practically you can use mp3 with your car hifi like you use WMA standard and thus use the most compatible format (now and in the near future).

As for your VBR remarks: I agree with you that small differences in test outcome may be due to corresponding small bitrate differences that come out with the tested samples. This is not in contradiction with the fact that VBR mode is chosen carefully by the test organizers. It's a general fair comparison problem with VBR mode. But there is also a more general problem of which encoder setting to use. It's not always a matter of fact that the encoder setting used is really the optimal setting. So there's quite some uncertainty in this area, and the best thing to deal with it is again not to be nitpicking with the results.

And yes, I'd say public listening tests point into the right direction, but if you want to make sure what this exactly means to you you should do listening tests on your own.
But not everybody wants to do that and there's even a big danger doing so: you will become oversensitive to problems. As I know from experience this doesn't make life easier.

Flawed conclusions on WMA

Reply #11 – 2007-04-17 09:35:02

Quote from: halb27 on 2007-04-17 08:58:01

It's a general fair comparison problem with VBR mode. But there is also a more general problem of which encoder setting to use.

It is not a fair comparision problem at all, because of the scenario to which we want to extrapolate such tests. Or in other words: the reason why we actually test. We do not test because of killer samples. We dont even test because of small samples in general. Who cares about that? What we use lossy codecs for is listening to music. We want to extrapolate the testresults (after accounting for the increased difficulty of the test-samples) to real-world scenarios. In the real world, users dont encode small killer-samples for listening. In the real world, users encode entire music collections. The target for extrapolation is entire music collections - or in other words: music overally. Our hypothetical target-scenario looks like this:

- we have a 100gb hard-drive and N music albums.
- we want to encode them all to a lossy format, filling up all the available space
- now we want to know which codec will give us the most bang for the buck, quality/space wise

Even though, this scenario is a bit contrieved, it is valid, because it is the only reasonable (and thus fair) way to test lossy VBR-codecs. VBR codecs dont target a specific bitrate. They target a certain quality. In an utopian world, tests would work the other way around: we would already have the perceptual quality of each codec and would just have to do the maths to see which codec can achieve a given quality over an entire collection, with the least diskspace used. Unfortunatelly, the real world works a bit different. However, it is also clear, that VBR is a very efficient and reasonable concept, and that it is therefore illogical and stupid to make them all use the same average bitrate for the testsamples - that would beat the purpose of VBR - part of what makes a VBR codec better than another VBR codec, is that it should recognize difficult to encode parts, and increase the bitrate to maintain a constant quality. So, our only choice left is a sort of compromise: take a music collection as large and varied enough as possible, then make it a testrule that all encoders used, must encode this collection at the same overall bitrate. Even though this is against what VBR is about, it leaves the encoder a lot of freedom - enough freedom, to for example recognize difficult to encode music genres and easy to encode genres. It also leaves them alot of freedom to do what they are best at with the difficult testsamples: adjust their bitrate.

Comparing all testsamples at the same overall bitrate, is what would be really stupid and illogical. First it would fail to test how good a VBR codec is at being VBR. Second the results of the test would be totally meaningless, because it would have no relevance to the real world anymore. This has been discussed to death in the past. Get educated or STFU, damnit.

- Lyx

Flawed conclusions on WMA

Reply #12 – 2007-04-17 10:14:57

Quote from: Lyx on 2007-04-17 09:35:02

Quote from: halb27 on 2007-04-17 08:58:01

It's a general fair comparison problem with VBR mode. But there is also a more general problem of which encoder setting to use.

It is not a fair comparision problem at all...

As you describe in detail it's the nature of VBR that makes the bitrate used on a specific sample using different VBR encoders not well comparable.
As I said this doesn't invalidate the test at all, and the tests are certainly done with all the skills necessary.
The conclusion should be not to disbeleive in these tests but not to be nitpicking at the resulting numbers. For the test mentioned: WMA standard was on par IMO with Lame (and most of the encoders tested).
I wasn't talking about problem samples, but as you are talking about sample selection: the selection of (normal) samples of course has an influence on test outcome especially if the outcome of some contenters is close. Same goes for the listeners (but is covered by the usual analysis). This all doesn't invalidate the tests at all but we should be aware of the restrictions.

My solution to all this is simply: don't care about small differences in test results (and because of this I very dislike the 'zoomed' view of test results which inadequately blows up small differences). My interpretation of the results of the beforementioned test is: Vorbis and MPC are best, Lame and iTunes and also WMA std. are second, ATRAC is worst. To me the differences of a more detailed differentiation are within the natural volatility of test results - volatility not due to variation in listeners' peceived quality but due to mentioned side conditions of the test itself.
My classification was only for clarification. More interesting are the practical implications: use MPC (if your DAP supports Rockbox) or Vorbis (if your DAP supports it) for the best quality. Don't care using mp3 as it provides for very good quality. Same goes when using iTunes on an iPod. Microsoft's claims regarding WMA std. are BS but on the other hand there's nothing wrong for people who want to use it. Sony's ATRAC3 isn't real bad but not really attractive.
That's how I would interpret things in practice.

Flawed conclusions on WMA

Reply #13 – 2007-04-17 10:55:24

Maybe i misunderstood you *unsure*. What i meant was that the varying bitrates in the testsamples, between the different encoders, are not a sign of unfairness. More like the opposite. What makes VBR difficult to compare - the main problem - goes into the opposite direction: the music collection with which the encoders get calibrated. You cannot have an infinitely large collection with every music genre in existance in it. However, the larger the collection, the lower this incertainity becomes. And as usual with dimishing returns. At some point - and i think we reached it already - this sideeffect becomes overshadowed by other effects like error margins, etc. - so, in that regard, i agree with you: as with so many other things in life, perfection is impossible here - but "good enough" is good enough.

I also agree that the testresults are often wrongly extrapolated and interpreted. First we have the zoomed images, which subjectively appear to inflate the differences. Then we have to take into account, that those are results for "killer-samples". And lastly, the testing-metology makes the listeners spent way more focus and attention at the music than is usual in the real world (including multiple repeats of the same sample in a short amount of time). Thus, in the real world the encoders would actually perform much better, than in such tests. Listening tests put codecs at way more pressure, than is usual in everyday-listening. The phrase "LAME V5 is transparent to most people" is actually an understatement under normal listening conditions.

- Lyx

Flawed conclusions on WMA

Reply #14 – 2007-04-17 11:25:23

Quote from: Lyx on 2007-04-17 10:55:24

... so, in that regard, i agree with you: as with so many other things in life, perfection is impossible here - but "good enough" is good enough. ...

I'm glad the day has come that we agree on something.

Flawed conclusions on WMA

Reply #15 – 2007-04-17 12:44:26

Quote from: ShowsOn on 2007-04-17 07:40:03

Why not? MS is constantly claiming that WMA at 64 Kbps is better than MP3 at 128 Kbps

To be fair, they've stopped claiming that years ago. And while 64k WMA is definitely not better than VBR LAME anno 2007, I suspect stacks up quite well to 128k Blade anno 1995.

Flawed conclusions on WMA

Reply #16 – 2007-04-17 13:17:38

Also notice that Microsoft seems to favor CBR and not VBR - at least that is my impression. Even in the HE-AAC vs. WMA Professional 10 test Microsoft recommended NSTL to use CBR.

Flawed conclusions on WMA

Reply #17 – 2007-04-17 15:54:55

Quote from: astroboy on 2007-04-17 04:36:55

Furthermore, there are a number of other tests conducted by others in both controlled and public environments that appear from my understanding to have come to the opposite conclusion to the anti-WMA sentiment in this forum.

Are you referring to the folly of the majority, the folly of obeying placebo in non-double-blind tests, and the existence of "paid-for" audiophile tests?

HA.org is a forum in which 99% of the time I do not have to take conclusions with teaspoonfuls of salt.

Flawed conclusions on WMA

Reply #18 – 2007-04-17 15:59:23

Quote from: halb27 on 2007-04-17 10:14:57

My solution to all this is simply: don't care about small differences in test results (and because of this I very dislike the 'zoomed' view of test results which inadequately blows up small differences). My interpretation of the results of the beforementioned test is: Vorbis and MPC are best, Lame and iTunes and also WMA std. are second, ATRAC is worst. To me the differences of a more detailed differentiation are within the natural volatility of test results - volatility not due to variation in listeners' peceived quality but due to mentioned side conditions of the test itself.

For the samples tested, the group who listened, and the codecs/settings tested, Roberto's conclusion is correctly worded:

"Vorbis aoTuV is tied to Musepack at first place, Lame MP3 is tied to iTunes AAC at second place, WMA Standard is in third place and Atrac3 gets last place."

WMA should not be lumped in with the second tier.

ff123

Flawed conclusions on WMA

Reply #19 – 2007-04-17 16:43:31

Quote from: ff123 on 2007-04-17 15:59:23

... Roberto's conclusion ...
"Vorbis aoTuV is tied to Musepack at first place, Lame MP3 is tied to iTunes AAC at second place, WMA Standard is in third place and Atrac3 gets last place."
....

I agree grouping results as I did and as Roberto did is a problem as the numerical results represent a floating scale of qualitative results where it's subjective where to draw the exact borders.
I was aware of that when I wrote it, that's why I added the practical implications which are more of concern anyway.

But this is only a side effect of what I wanted to say: it's best not to care about small differences in test results.

Of course if codec A is only insignificantly better than codec B, codec B insignificantly better than codec C, codec C insignificantly better than codec D, and so on up to codec F, this does not mean that codec A is on par with codec F.

Luckily the practical implications usually are not so prone to misunderstanding and subjective judgement, and that's what counts.

Flawed conclusions on WMA

Reply #20 – 2007-04-17 17:24:34

Quote from: halb27 on 2007-04-17 16:43:31

Quote from: ff123 on 2007-04-17 15:59:23

... Roberto's conclusion ...
"Vorbis aoTuV is tied to Musepack at first place, Lame MP3 is tied to iTunes AAC at second place, WMA Standard is in third place and Atrac3 gets last place."
....

I agree grouping results as I did and as Roberto did is a problem as the numerical results represent a floating scale of qualitative results where it's subjective where to draw the exact borders.
I was aware of that when I wrote it, that's why I added the practical implications which are more of concern anyway.

But this is only a side effect of what I wanted to say: it's best not to care about small differences in test results.

Of course if codec A is only insignificantly better than codec B, codec B insignificantly better than codec C, codec C insignificantly better than codec D, and so on up to codec F, this does not mean that codec A is on par with codec F.

Luckily the practical implications usually are not so prone to misunderstanding and subjective judgement, and that's what counts.

Um, then we actually don't agree.

Roberto correctly grouped the results. The vertical bars represent the 95% confidence intervals. So you can say with 95% confidence that Lame mp3/iTunes AAC was rated better than WMA standard. The difference was not "small" but rather "significant" (in the statistical sense).

Of course there is variability in the ratings. However, that variability did not obscure the significance of the results.

ff123

Quote from: astroboy on 2007-04-17 04:36:55

Of course on top of this is the ability for the test, due to its public nature, to have been tainted by anyone who has an agenda against a particular codec. It also has significant error due to the variability of equipment used by the listeners. Another factor is that people who are used to a particular sound, such as that of iTunes encoded AAC files, may be more likely to rate that above the other codecs. Furthermore, there are a number of other tests conducted by others in both controlled and public environments that appear from my understanding to have come to the opposite conclusion to the anti-WMA sentiment in this forum.

Anybody with an "agenda against a particular codec" would have had to have broken the encryption used by abchr-java. And even if he managed to do so, that would only represent one result.

The variability of equipment is not a bug, it's a feature! The fact that significant results were achieved, in spite of the equipment variability, makes it that much more believable. Plus it's more representative of real life.

Show me the evidence that if I am familiar with a codec's sound, I am likely to rate it higher. If anything, being familiar with its shortcomings, I am likely to rate it lower.

Show me the test results which come to the opposite conclusion of Roberto's test (as regards WMA standard).

ff123

Flawed conclusions on WMA

Reply #21 – 2007-04-17 17:49:15

wasnt sebastian mares more recent test?

http://www.listening-tests.info/mf-128-1/results.htm

Flawed conclusions on WMA

Reply #22 – 2007-04-17 18:22:33

Quote from: kwanbis on 2007-04-17 17:49:15

wasnt sebastian mares more recent test?

http://www.listening-tests.info/mf-128-1/results.htm

Yes, but that tests wma-pro. The test linked in the first message tested wma-std

Flawed conclusions on WMA

Reply #23 – 2007-04-17 19:01:38

Quote from: ff123 on 2007-04-17 17:24:34

... So you can say with 95% confidence that Lame mp3/iTunes AAC was rated better than WMA standard. The difference was not "small" but rather "significant" (in the statistical sense). ....

This is about the reliability of the listening results due to the listening and judging part of the listeners.
This is not my point.
I'm into side conditions of the test itself: choice of samples which can favor specific codecs to a certain degree especially when VBR is used (it can happen that a codec for instances chooses an unusual high bitrate for a certain sample which in this case favors this codec on this sample), choice of encoder settings which can be disadvantegous for a certain codec (the question VBR or ABR or even CBR for Lame for instance which at least for higher bitrates is not so obvious as many people think. It was quite interesting to see people's reaction on the 64 kbps test where WMA pro in CBR mode came out second place), and so on.
After all there is a certain error margin as Lyx called it in the results which has absolutely nothing to do with the usual statistical analysis which addresses reliabilitiy of the judgements of different listeners.
And that's why IMO it's best to ignore small differences in the outcome of a test.
Forget about the quality grouping which has its own issues.
Think directly about the practical implications from a test. These do not depend on small differences on the quality scale.

Flawed conclusions on WMA

Reply #24 – 2007-04-17 19:08:42

Quote from: halb27 on 2007-04-17 19:01:38

I'm into side conditions of the test itself: choice of samples which can favor specific codecs to a certain degree especially when VBR is used (it can happen that a codec for instances chooses an unusual high bitrate for a certain sample which in this case favors this codec on this sample), choice of encoder settings which can be disadvantegous for a certain codec (the question VBR or ABR or even CBR for Lame for instance which at least for higher bitrates is not so obvious as many people think...

Oh really? Do you have any evidence to substantiate this outside of the 3 or 4 killer samples that you continuously mention?

Notice