Multiformat listening test (July 2014): Results and discussion

Topic: Multiformat listening test (July 2014): Results and discussion (Read 35345 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Multiformat listening test (July 2014): Results and discussion

2014-09-15 11:10:15

Thank you all for participating in this 2014 public multiformat listening test!

The test is finished and now the results are available.
Per track results are available, too.

Summary: Opus won, Apple AAC came second, and MP3 came third, although using ~30 kbps extra bitrate. Ogg Vorbis was the same.

Multiformat listening test (July 2014): Results and discussion

Reply #1 – 2014-09-15 15:30:59

Finally I can see my precise results, thanks a lot! I seem to have some problems with recognizing issues in stereo field, so maybe Vorbis isn't actually as good as I made it look at some samples.

Multiformat listening test (July 2014): Results and discussion

Reply #2 – 2014-09-15 16:06:14

I posted this in the test thread as well, but maybe it will be caught quicker here. In the Donators and Contributions table (donators_and_contributions_table2.png) you have my nickname as yourload instead of yourlord. In the actual individual results section my nickname is correct.

Multiformat listening test (July 2014): Results and discussion

Reply #3 – 2014-09-15 16:06:59

Kamedo2, I have noticed that contributors' comments on results page is cut when newline occurs.
Much of my comments (maybe others too) consist of two or more paragraphs e.g. ' [Braindead] ' on first line then 'comment' on next line.
Can you please rebuild the page to bring it back as it was written?

Multiformat listening test (July 2014): Results and discussion

Reply #4 – 2014-09-15 18:26:04

Kamedo2, please, correct "Post-screening":

Quote

If you rank the low anchor at 5.0, your result of the sample will be invalid.
If you rank the mid-low anchor at 5.0, your result of the sample will be invalid.
If you rank the low anchor higher than the mid-low anchor, your result of the sample will be invalid.
If you rank the reference worse than 4.5, your result of the sample will be invalid.
If you rank the reference worse than 5.0 on 25% or more of submitted results, all of your results will be invalid.
If you submit 25% or more invalid results, all of your results will be invalid.

Multiformat listening test (July 2014): Results and discussion

Reply #5 – 2014-09-15 18:33:38

Thank you very much for carrying out the test! It's nice to see that Opus offers both high and consistent quality.

Also kudos for reporting the results so soon after the end of the test.

Multiformat listening test (July 2014): Results and discussion

Reply #6 – 2014-09-16 01:38:11

Nice to see the final version of the graph, Kamedo2.

In this test, testers could score the samples by giving a value "almost freely" between 1.0 and 5.0.

However, I think that it is better for testers to have only 5 choices. The criterion seems straightforward enough.
For example, score 4.0 means "perceptible but not annoying", and 5.0 "imperceptible".

Then how can we interpret scores ranging between 4.0 and 5.0?
Assume that sample X is given a score of 4.5, and Y 4.7. Then, the number of occuring artifacts is greater in the sample X than that in the sample Y?
Though I am not a professional statistician, but in my opinion, such values are meaningful as an average of multiple values
(the closer to 5.0, the more people think the sample considered this sample "imperceptible") but are ambiguous as individual choices.

I am curious about other's opinion for the current scoring system.

Multiformat listening test (July 2014): Results and discussion

Reply #7 – 2014-09-16 09:32:19

It's all about perception and everything is relative. My 4,7 is not the same as your 4,7. Even worse: my 4,7 on day 1 in the morning might be different then my 4,7 on day 2 at the evening, although I try to avoid it.

I first spotted possible problem spots. In this case, the mid anchor often gave good hints. The easier it was to find problem spots, the lower the inicial rate.

Secondly, I ABX. The easier it was to ABX problem spots succesfully, the lower I pushed the rate.

Thirdly, when all samples finished, I compared between listening to the complete sample and adjusted the ratings. So I think it's really necessary to use decimals in rating.

Whether it's perceptible or annoying: I take into account if the problem spot would be perceptible on speakers, in car, etc. I always ABX with headphones and am very sure that my ratings would be higher if I would have ABXed through speakers.

Multiformat listening test (July 2014): Results and discussion

Reply #8 – 2014-09-16 18:17:00

Is there an easy way I could get my hands on all of the samples, preferably in their native sampling rate? I think they are an interesting base for own listening tests.

Btw: why hasn't this made it to the front page?

Multiformat listening test (July 2014): Results and discussion

Reply #9 – 2014-09-16 18:41:58

Answering a couple of posts above:
• I think if one artifact is slightly more audible than another yet both are not-annoying, it is necessary to be able to rank one above the other within the 4.0 to 5.0 range, so having at least two or three steps between 4.0 and 5.0 is necessary.
• I think the low anchor and mid-low anchor do a lot to reduce the variation in ranking among listeners, and are probably about as good as we're likely to get. They seemed to be ranked with remarkable consistency in this listening test and seemed to both be consistently worse than the contenders.

Thanks to everyone who took part - listeners, organisers and those who have helped develop the robust methodology and analysis in the past. I think the test, its exclusion criteria and methods of analysis (both established in advance) and its organisation were among the best I've seen.

Trying to look for methodological flaws as devil's advocate, the only scintilla of doubt I could cast is whether FAAC's artifacts as the low-mid anchor (aside from its low-pass filter) are capable of priming people to notice artifacts in AppleAAC more than in non-AAC competitors. I really doubt this idea quite strongly given that they are such different encoders, not fundamentally flawed, and that all the competitors are capable of transparency at higher bitrates (which VBR enables, if the signal analysis is smart enough to spend the extra bits where they're needed), so I believe the test is about as scrupulously fair as it's possible to be.

From my perusal I'd imagine that a condorcet analysis would have a lot more ties for first place than the 2011 64kbps Multiformat test (which had a lot of win results for Opus), but it would probably reflect slightly better on the codec with the narrowest spread of results (which turns also out to be Opus) as it's usually in the not-annoying range, and less likely to be lagging too far behind AppleAAC than a codec with a wider spread.

It would be quite nice to draw up some general conclusions based on this test and a modicum of other recent knowledge.

Can we add to or critique the following list:

Tentative conclusions and notes:

General quality of the best encoders at 96 kbps

Opus 1.1 is the clear winner on average: opusenc --bitrate 96
Apple AAC (iTunes 11.2.2) is clear in second place: qaac --cvbr 96 (or 96 kbps, VBR enabled in iTunes)
Ogg Vorbis (aoTuV Beta6.03) is third of the 96kbps contenders: vorbisenc -q2.2
The higher bitrate LAME MP3 (about 128-140 kbps) is tied with Ogg Vorbis for joint third place. lame -V 5
The best mature 128-140kbps MP3 encoder (significantly better than early 128kbps MP3) was used as a well-known quality reference. The three more modern contenders matched or beat it with around 25% lower bitrate. While LAME MP3 itself has come a long way since early MP3 encoders and is very mature within the limitations of the format, newer technology has come a long way indeed by 2014. LAME degrades noticeably before it averages as low as 96 kbps as noted in previous tests.
These formats are not reliably transparent at 96 kbps when subjected to close critical listening on headphones in ideal conditions.
All of these formats can reach transparency at higher bitrate settings. However, getting statistically significant listening test results becomes extremely difficult as we approach transparent settings. Any such test is likely to be a statistical tie. The results of this test are not necessarily applicable to different bitrates, however.
Broadly speaking, Perceptible but not annoying (4.0 or better) indicates that music should remain enjoyable, and for listeners being less critical (e.g. listening on speakers, at low volumes or in noisy environments, or untrained in codec artifact spotting) differences compared to the original may go unnoticed. Many testers said this test was difficult, because most differences were subtle (with the exception of the low anchors).
The difficulty of detecting differences at such modest bitrates demonstrates the maturity of the best encoders of today (2014) and the advances made over the last decade or so.
All contenders except the FAAC anchors are well-tuned Variable Bitrate (Constant Quality) encoders. Do not worry about unusually low bitrates or unusually high bitrates within your music collection. In such mature encoders these are the result of smart decision-making, not errors. The bitrate versus quality scatter-plots show no appreciable bias to support the idea that low-bitrate is chosen in error. On average over a large collection, the target bitrate will still be close to the target.
The results are specific to the encoder and settings used, not the format in general. This is very clear from FAAC performing much worse than Apple AAC. Ogg Vorbis has less variation among encoders and Opus has only two versions to date: Opus 1.0 was well tuned at launch, and Opus 1.1 is largely an incremental improvement.
There is no significant evidence in this test of any encoder being more suitable for a certain genre of music.
Other considerations may be important such as compatibility with many devices (MP3 is near-universal, AAC is very widely supported too) or patent-free status (Vorbis and Opus are widely considered patent-free and thus suitable for projects like Wikimedia or for use in games). The balance may shift over time, e.g. as new standards like WebRTC become adopted, Opus, being Mandatory to Implement may become more readily playable on more devices.
Note that LC-AAC is capable of transparency. Apple HE-AAC may well not be capable of transparency, and in any case is worse than Apple LC-AAC above about 64 kbps, so you should not force HE mode assuming that if it has higher efficiency it must have higher quality. The default behaviour is usually well-chosen and HE-AAC's extra features are used to simulate high-frequency content when bitrate is very limited.
Opus won with the encoder from the standard's developers. There are no poor-quality Opus encoders at present. Poor AAC encoders do exist, and in certain areas such as some video transcoders and video-streaming sites, are used more widely than is desirable for music encoding.
This whole test was done using double-blind testing rather than sighted testing, so that deliberate or unintended bias cannot influence results. The results surprised a number of the listeners who took part, in fact, when un-blinded after the conclusion of the test.

This is by no means definitive. Your comments and additions would be welcome.

Multiformat listening test (July 2014): Results and discussion

Reply #10 – 2014-09-16 21:00:06

Sadly i couldn't participate in the test, because i got the cold and could barely hear anything for a week, then i didn't have time to start reporting Next time.

Honeslty I didn't expected Opus to stand out as a winner, i thought QAAC will win this round. Fortunately i was wrong. I think we can safely say we have a new recommended codec for portable use. (I really hope Poweramp's creator now puts the decoder into his player.)
It would be interesting to see how Opus competes in 64kbps now, for me it seems it improved a lot since the last 64kbps listening test.

Multiformat listening test (July 2014): Results and discussion

Reply #11 – 2014-09-16 21:27:52

One observation I would like to point out:

When looking at the abx-rate versus bitrate graphs, one sees clearly the Constraint VBR behaviour of Apple AAC. Opus samples had a wider range of kbps. Could it be a bit unfair to compare QAAC CVBR quality with Opus as we are doing now?

About 2 years ago I did some personal ABXing through foobar2000 with QAAC CVBR vs TVBR at around 128 kbps. My conclusion was that there was no perceptible difference, so since then I always use CVBR as it has a more predictable rate. But that's my personal experience at 128 kbps and there might be a difference between CVBR and TVBR at 96 kbps.

Multiformat listening test (July 2014): Results and discussion

Reply #12 – 2014-09-16 23:36:43

In a number of preliminary tests in the last few years CVBR was shown to beat TVBR albeit by a small margin. I recall this being the case at both 96 kbps and (with HE-AAC) at 64 kbps using Apple AAC. It may vary according to personal preference and at different bitrate.

Multiformat listening test (July 2014): Results and discussion

Reply #13 – 2014-09-17 02:32:14

QT CVBR won the public AAC listening test @ 96 kbps (July 2011)

Multiformat listening test (July 2014): Results and discussion

Reply #14 – 2014-09-17 08:57:17

That 2011 result was not a clear victory over TVBR (p = 0.333) but the CVBR beat FhG while TVBR tied with FhG.

A few other personal tests by various users seem to have backed up the slight preference for CVBR. I believe Kamedo2 was one of them.

Multiformat listening test (July 2014): Results and discussion

Reply #15 – 2014-09-17 10:23:46

Quote from: Dynamic on 2014-09-16 18:41:58

Broadly speaking, Perceptible but not annoying (4.0 or better) indicates that music should remain enjoyable, and for listeners being less critical (e.g. listening on speakers, at low volumes or in noisy environments, or untrained in codec artifact spotting) differences compared to the original may go unnoticed. Many testers said this test was difficult, because most differences were subtle (with the exception of the low anchors).

My experience (a statistically significant sample of one ) was that initially everything apart from the low anchors sounded fine. Then I learnt what the artefacts sounded like. Then I could hear them on most (not all) samples.

Other people have reported the following before, but this is the first time I've really experienced it: I could eventually hear the artefacts much more easily when just listening than when attempting to ABX. Initially, ABX made the artefacts go away. Not because they were placebo (I eventually successfully ABXed them), but because trying to ABX with immediate repetition dulled my senses. I had to slow down, leave gaps, and go for guesses rather than certainties; In previous tests, I haven't clicked "X or A" or "X is B" until I feel sure - but that takes a lot of listening per trial, and this time I found it worked better to just go with a hunch after listening maybe once, and take more trials. Even if I got a few wrong, by doing more trials the stats would determine whether I really heard a difference or not. Strangely, my hunches were just as good as my "don't click until you're sure".

I think, once you learn the sound of the artefacts, subtle though they are, they're sometimes not-too-difficult to hear in normal listening. Specifically, they're easier to hear than a tricky ABX session would imply.

In "professional" listening tests, people are trained to hear the artefacts before taking the test. This doesn't match real-world codec use, but it does reduce the number of 5.0s.

Cheers,
David.

Multiformat listening test (July 2014): Results and discussion

Reply #16 – 2014-09-17 17:16:05

Would you suggest a rewording or adjustment to the tone of what I said to summarise that experience, David?

Especially if it has become a little annoying now you've learnt to discern it or if you wish to say it's very likely to go un-noticed when not comparing to the lossless original.

I've certainly had occasions when I felt that when ABXing, although I've done far less that most of the contributors to this listening test (I wasn't one of them). Sometimes I did OK when I went with my hunch, and other times the statistics revealed that I couldn't reliably hear it either way. I specifically remember it with one of NickC's test builds of lossyWAV that allowed settings worse than the normal lowest to help me try to learn the likely locations of noise artifacts. As soon as I got the quality as high as extraportable I was still left with these fairly plausible-feeling hunches but they then came out as random guessing in the statistics.

Multiformat listening test (July 2014): Results and discussion

Reply #17 – 2014-09-17 18:40:45

Quote from: Dynamic on 2014-09-16 18:41:58

General quality of the best encoders at 96 kbps
7. All of these formats can reach transparency at higher bitrate settings. However, getting statistically significant listening test results becomes extremely difficult as we approach transparent settings. Any such test is likely to be a statistical tie. The results of this test are not necessarily applicable to different bitrates, however.

Thank you Dynamic for the nicely written observations. Only one point: If I were the writer, I would add "All of these formats can reach transparency at higher bitrate settings on most music tracks". MP3+castanet or MP3+fatboy are thought to be impossible to achieve transparency under ideal listening conditions within the standard. Aside from such extremely rare events, it's safe to say "All of these formats can reach transparency at higher bitrate settings".

Multiformat listening test (July 2014): Results and discussion

Reply #18 – 2014-09-17 18:57:39

Kamedo2, http://listening-test.coresv.net/results.htm - very good job. A lot of useful information.

After looking to my results I could see that Apple AAC and Opus were both on par for me. While Opus was superior for most of people. Possible explanation could be that I performed a lot of tests with Opus during last few years and have got very familiar with its artifacts.
Still quality is more consistent for Opus. It hadn’t somewhat more killer samples as other encoders had. I’m very glad that there is still progress in area of audio compression.

All right, Opus is superior to LC-AAC at 96 kbps. But what about some post-AAC encoders? Like USAC? I had some files encoded by USAC from MPEG verification tests. USAC is somewhat/slightly superior (without statistically significant difference) to AAC at 96 kbps but there are extremely low chances that there will be any new codec that will be superior to Opus in 5+ years (if at all). Right now there are a few new formats in pipeline but they pursue very specific targets like 3-D audio (efficient coding of i.e. 22.2 channels) or only speech/low/ very low bitrate encoders.

Multiformat listening test (July 2014): Results and discussion

Reply #19 – 2014-09-17 20:19:51

In the future I would like to see some quality improvements for bitrates below 64Kbps for Opus (if it is possible) to compete with HE-AACv2. Obviously it won't be transparent at these rates.
Also the sample that is closest to 4.0 might be a great place to start optimizing Opus even further

Multiformat listening test (July 2014): Results and discussion

Reply #20 – 2014-09-17 22:46:38

Thanks for testing Kamedo2 and everyone else involved, I didn't bother because I don't really care or use lossy codecs anymore.

But ...I can't wait to see how Opus will compare to this one: http://www.iis.fraunhofer.de/en/ff/amm/for...multi/usac.html

Multiformat listening test (July 2014): Results and discussion

Reply #21 – 2014-09-18 01:39:55

Thanks for spotting that Kamedo2. This wording might be better.

7. All of these formats can reach transparency at higher bitrate settings for the vast majority of music. However, getting statistically significant listening test results becomes extremely difficult as we approach transparent settings. Any such test is likely to be a statistical tie. The results of this test are not necessarily applicable to different bitrates, however.

Multiformat listening test (July 2014): Results and discussion

Reply #22 – 2014-09-18 14:58:49

Another aspect that has to be taken into account when it comes to transparency is that the Direct Sound output in Windows 7 alters the output of the lossy codecs additionally with its limiter, especially when you play it directly from the source. The only way to get around this is to use something like WASAPI exclusive mode.

Multiformat listening test (July 2014): Results and discussion

Reply #23 – 2014-09-19 03:29:01

I only managed to submit one sample (I got allocated Anonymous8), but it's quite interesting to see how others rated the same sample.
I've generally found LC-AAC not to perform particularly well at 96Kbps and am probably the only person in the world who thinks that 80Kbps HE-AAC actually sounds better. Last time I checked this was years ago though, so it may be different now.

Thanks to everyone who made this test possible!

Multiformat listening test (July 2014): Results and discussion

Reply #24 – 2014-10-03 13:49:37

http://listening-test.coresv.net/results.htm now has .mp4, .opus, .mp3, .ogg as well as original .wav files.

Notice