Multiformat listening test (July 2014): Results and discussion
Reply #9 – 2014-09-16 18:41:58
Answering a couple of posts above: • I think if one artifact is slightly more audible than another yet both are not-annoying, it is necessary to be able to rank one above the other within the 4.0 to 5.0 range, so having at least two or three steps between 4.0 and 5.0 is necessary. • I think the low anchor and mid-low anchor do a lot to reduce the variation in ranking among listeners, and are probably about as good as we're likely to get. They seemed to be ranked with remarkable consistency in this listening test and seemed to both be consistently worse than the contenders. Thanks to everyone who took part - listeners, organisers and those who have helped develop the robust methodology and analysis in the past. I think the test, its exclusion criteria and methods of analysis (both established in advance) and its organisation were among the best I've seen. Trying to look for methodological flaws as devil's advocate, the only scintilla of doubt I could cast is whether FAAC's artifacts as the low-mid anchor (aside from its low-pass filter) are capable of priming people to notice artifacts in AppleAAC more than in non-AAC competitors. I really doubt this idea quite strongly given that they are such different encoders, not fundamentally flawed, and that all the competitors are capable of transparency at higher bitrates (which VBR enables, if the signal analysis is smart enough to spend the extra bits where they're needed), so I believe the test is about as scrupulously fair as it's possible to be. From my perusal I'd imagine that a condorcet analysis would have a lot more ties for first place than the 2011 64kbps Multiformat test (which had a lot of win results for Opus), but it would probably reflect slightly better on the codec with the narrowest spread of results (which turns also out to be Opus) as it's usually in the not-annoying range, and less likely to be lagging too far behind AppleAAC than a codec with a wider spread. It would be quite nice to draw up some general conclusions based on this test and a modicum of other recent knowledge. Can we add to or critique the following list:Tentative conclusions and notes: General quality of the best encoders at 96 kbps Opus 1.1 is the clear winner on average: opusenc --bitrate 96 Apple AAC (iTunes 11.2.2) is clear in second place : qaac --cvbr 96 (or 96 kbps, VBR enabled in iTunes)Ogg Vorbis (aoTuV Beta6.03) is third of the 96kbps contenders: vorbisenc -q2.2 The higher bitrate LAME MP3 (about 128-140 kbps) is tied with Ogg Vorbis for joint third place . lame -V 5 The best mature 128-140kbps MP3 encoder (significantly better than early 128kbps MP3) was used as a well-known quality reference. The three more modern contenders matched or beat it with around 25% lower bitrate . While LAME MP3 itself has come a long way since early MP3 encoders and is very mature within the limitations of the format, newer technology has come a long way indeed by 2014. LAME degrades noticeably before it averages as low as 96 kbps as noted in previous tests. These formats are not reliably transparent at 96 kbps when subjected to close critical listening on headphones in ideal conditions. All of these formats can reach transparency at higher bitrate settings . However, getting statistically significant listening test results becomes extremely difficult as we approach transparent settings. Any such test is likely to be a statistical tie. The results of this test are not necessarily applicable to different bitrates, however. Broadly speaking, Perceptible but not annoying (4.0 or better) indicates that music should remain enjoyable , and for listeners being less critical (e.g. listening on speakers, at low volumes or in noisy environments, or untrained in codec artifact spotting) differences compared to the original may go unnoticed. Many testers said this test was difficult, because most differences were subtle (with the exception of the low anchors). The difficulty of detecting differences at such modest bitrates demonstrates the maturity of the best encoders of today (2014) and the advances made over the last decade or so. All contenders except the FAAC anchors are well-tuned Variable Bitrate (Constant Quality) encoders. Do not worry about unusually low bitrates or unusually high bitrates within your music collection. In such mature encoders these are the result of smart decision-making, not errors. The bitrate versus quality scatter-plots show no appreciable bias to support the idea that low-bitrate is chosen in error. On average over a large collection, the target bitrate will still be close to the target. The results are specific to the encoder and settings used , not the format in general. This is very clear from FAAC performing much worse than Apple AAC. Ogg Vorbis has less variation among encoders and Opus has only two versions to date: Opus 1.0 was well tuned at launch, and Opus 1.1 is largely an incremental improvement. There is no significant evidence in this test of any encoder being more suitable for a certain genre of music. Other considerations may be important such as compatibility with many devices (MP3 is near-universal, AAC is very widely supported too) or patent-free status (Vorbis and Opus are widely considered patent-free and thus suitable for projects like Wikimedia or for use in games). The balance may shift over time, e.g. as new standards like WebRTC become adopted, Opus, being Mandatory to Implement may become more readily playable on more devices. Note that LC-AAC is capable of transparency. Apple HE-AAC may well not be capable of transparency, and in any case is worse than Apple LC-AAC above about 64 kbps, so you should not force HE mode assuming that if it has higher efficiency it must have higher quality. The default behaviour is usually well-chosen and HE-AAC's extra features are used to simulate high-frequency content when bitrate is very limited. Opus won with the encoder from the standard's developers. There are no poor-quality Opus encoders at present. Poor AAC encoders do exist, and in certain areas such as some video transcoders and video-streaming sites, are used more widely than is desirable for music encoding. This whole test was done using double-blind testing rather than sighted testing, so that deliberate or unintended bias cannot influence results. The results surprised a number of the listeners who took part, in fact, when un-blinded after the conclusion of the test. This is by no means definitive. Your comments and additions would be welcome.