[span style=\'font-size:14pt;line-height:100%\']Preliminary notes[/span]
Two years ago I performed and published my two first listening tests. Both included different formats and encoders at ~130 kbps and involved a dozen of samples: classical music only. My purpose was to see which encoder was able to produce the best encoding at a friendly bitrate (friendly for portable players), and for a specific kind of music. iTunes AAC & WMAPro appeared to be the best encoders (for myself), and the absolute quality of both encoders at such bitrate surprised me. Last year (December 2004) I performed two similar tests: the first was dedicated to AAC (Nero, Apple, old and new encoders) and the second was a match between the best AAC encoder (Nero Digital “fast” VBR) and the most advanced Vorbis one (aoTuV beta 3). Quality and enjoyment were even higher!
This year I performed a fresh multiformat listening test at 130 kbps. This new test is very different from their predecessors from a methodological point of view. I progressively improved my approach of listening tests and tried to answered to all criticism addressed in the past to previous tests (and not necessary mine). Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology; to sum them up:
• diversity : the following test is not only based on “classical” music, and will also include several (fifty!) samples of “modern” music.
• grading : described once as “temperamentic” I decided to stick all marks between two anchors, a low and a high one. It will decrease the contrast between different encoders and increase at the same time the difficulty of the full exercise but it should also ensure a more accurate grading. The low anchor is vital to prevent an excessively harsh grading; the high anchor is essential to temper enthusiasm: a very good encoding at 130 kbps should be marked in regard to an excellent and high bitrate one. The presence of both anchors should guarantee a right grading: not too low, not too high.
• complexity : people reproached to some listening tests to focus only “critical” or “complex” samples. It may be a problem with some VBR implementations, which sometimes decrease too much the bitrate on “non-complex” samples. In my opinion, a listening test should include both types of samples, at least to verify that non-complex/low bitrate parts are as well encoded as complex/high bitrate ones. Usually, VBR encoders handle very well non-complex part. Usually… The complexity range of my gallery of samples is wide enough to represent all situations (from ultra-low bitrate to ultra high ones) and to check the strength of VBR implementations.
• abundance : a bunch of 12…15 samples is maybe not enough to give an accurate idea of the strength and weakness of different encoders. I experienced it myself in the past: my previous tests didn’t reveal some problems I only noticed after on real usage, and more important, they were unable to expose the recurrence of the detected problems. Detecting one problem (like rumbling or ringing) is one thing, measuring the periodicity of this problem is another thing. My test is based on 200 samples; this number should be enough to expose all common problems plus several uncommon ones and is also sufficient to get an idea of their redundancy. This is in my opinion the biggest advantage of my personal listening test over collective ones (which must stay friendly to avoid discouragement and attract a lot of testers).
• statistical analysis : it might appear as trivial to mention this, but statistical analysis of results and confidence bars are presents (they were not used last year and the year before).
• “Apples and Oranges” : no need to recall the problem. This test only mobilizes VBR encoders. No debate this time.
[span style=\'font-size:14pt;line-height:100%\']THE TEST: CHOICE OF ENCODERS [/span]
The market of audio encoders is ruled by a Darwinian process: the stronger only survive. Between my first test (October 2003) and this one (november 2005), only few encoders really progressed. Most other (some of them are still in use) are unchanged or only changed once: MPC, WMA (Standard and Pro), faac, all MP3 encoders (excepted LAME). Another one appeared and disappeared in the meantime (Compaact!).
On the hardware side, the situation is now very different from the one I lived two years ago. With the exception of one or two devices, AAC and Vorbis support in hardware players were more a dream than reality. Testing different audio formats was useful for a virtual and opened future, rich in dreams and promises. Now, the concrete situation is more interesting than dreams. MP3 and WMA (Std) are still the two well-established formats, but Vorbis now benefits from a growing interest of several manufacturers and if AAC still looks like an Apple monopoly the iPod market has at least mutated into several form (flash memory players, Microdrive™ based jukebox). One victim of reality is WMAPro, still not supported; and the growing popularity of WMA labeled as PlaysForSure (based on WMA Std) seems to sentence WMAPro to a long exile.
For all these considerations, I restricted the test to the most usable and interesting encoders: AAC-LC (highly developed by Apple and Nero Digital), MP3 (vigorous as ever, thanks to LAME devs), Vorbis (saved from inertia by Aoyumi). Besides these four encoders, I add two anchors. More precisely:
• Apple AAC: I used iTunes 6.0.0.18 (based on QuickTime 7.03), at 128 kbps and with the recently added VBR mode . I test Apple AAC in VBR for the first time. I sadly discovered that this encoder use the same trick as the MP3 encoder included in iTunes: the minimal size of the frames are not inferior to the targeted bitrate (apart maybe digital silence). In other words, for 128 VBR encodings the bitrate starts at 128 kbps and is increased with complexity. No need to precise that if average bitrate stays close to the target, the variations are necessary limited. One advantage: this restricted mode prevents the VBR engine to use inadequately low bitrate frames, and should guarantee quality from bad surprises compared to a CBR encoding.
• Nero Digital AAC: I used the very new encoder released two weeks ago (aac.dll v.3 and aacenc32.dll v.4.2.1.0 ), in VBR mode too. –internet profile is the closed to 128 kbps (slightly inferior with classical music, but higher with non-classical. I didn’t use the “fast” mode, which is now pretty similar but probably inferior to the “high” one.
• LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.
• Vorbis: I used aoTuV beta 4 (4.5 was released during the testing phase) instead of official 1.1.1 which corresponds to the 18 months old aoTuV beta 2 version. I used –q4,25 for the first group and –q4,00 for the second.
• As low anchor, I looked for something really low and also usable in batch mode. I found a very old AAC encoder on ReallyRareWares called mbaacencoder version 0.3: it’s awfully slow, quality is terrific and is as anecdote ideal to get an idea of all progress made around AAC between 1999 (release date of mbaaencoder) and 2005 (Apple and Nero Digital). I tried to get joint stereo and LC profile in batch mode, but the encoder apparently stayed in default mode (Main Profile, 128 kbps and dual stereo).
• As high anchor, I didn’t hesitate and used LAME 3.97 beta 1 –V2 --vbr new (or --preset standard) which is a reference for efficient, high quality and universal encodings. Furthermore, it would be interesting to evaluate the remaining gap between modern implementation of AAC and Vorbis at ~128 kbps to HQ MP3 at ~192 kbps.
[span style=\'font-size:14pt;line-height:100%\']SAMPLES [/span]
The test hinges on two big groups of samples: 150 for “classical” music group and 50 for “non-classical” (or “various”, or “modern”, or “popular”… choose your own) group. I already used the first group in three different tests in the past (80 kbps, 96 kbps, and LAME –V5). The complete collection is available for download. The 2nd group consist on all (35) non-classical samples used in previous collective listening tests; they’re all available on rarewares. To decrease the gap between the first and the second group I’ve add 15 other samples, all recently submitted for the postponed 64 kbps listening test of Sebastian Mares. Most of these last files may still be available.
[span style=\'font-size:14pt;line-height:100%\']THE BITRATE [/span]
The bitrate comparison is more accurate for the first group: it’s based on full tracks (6min 30 sec. per file on average) instead of short samples (10 sec. on average), and the complete collection is last but not least very representative of my entire library. For the second group of samples, I proceeded differently and I based the bitrate calculation on the 50 samples (which are longer: 24 sec. on average) and on external data (bitrate table for LAME posted by someone else). This way to evaluate the bitrate is not very precise, but I don’t have enough material to build a more accurate bitrate table. That’s why I tried to lower at maximum the difference in bitrate for all settings, and changed the command line for Vorbis (from –q4,25 to –q4,00) and LAME (--athaa-sensitivity 1 was removed).
To sum up the datas (a complete bitrate table will follow in the next days):
CLASSICAL (full tracks)
low anchor 128,00 kbps (estimated)
AAC iTunes 133,33 kbps [+4,16 %]
AAC Nero 125,71 kbps [-1,79 %]
MP3 LAME 130,81 kbps [+2,20 %]
Vorbis aoTuV 131,69 kbps [+2,88 %]
high anchor 181,46 kbps [+41,77 %]
NON-CLASSICAL (short samples)
low anchor 128,00 kbps (estimated)
AAC iTunes 137,31 kbps [+7,27 %]
AAC Nero 134,10 kbps [+4,76 %]
MP3 LAME¹ 137,82 kbps [+7,67 %]
Vorbis aoTuV² 133,42 kbps [+4,23 %]
high anchor 196,28 kbps [+53,34 %]
¹ with --athaa-sensitivity 1 bitrate reaches 139,38 kbps
² with –q4,25 bitrate reaches 140,21 kbps
[span style=\'font-size:14pt;line-height:100%\']TESTING CONDITIONS [/span]
The full test consists on pure ABC notation. The double blind test conditions are ensured by schnofler ABC/HR 0.5 beta (2005.08.31) software. All samples were decoded by CLI decoded within ABC/HR; offset were removed each times and minor differences in gain were systematically corrected (the highest difference reached 1.2 dB). Small mention for Vorbis: all files were decoded with foobar2000 (I still can’t make ABC/HR decode Vorbis files). There are no ABX comparisons: it’s a luxury I can’t afford with 1200 files awaiting for evaluation (200 x 6). If a difference is really unsure, I don’t rank the file. I finally ranked 16 times the reference instead of the encoded one (and 6 mistakes concern the high anchor). The error is inferior to 1.5%. I didn’t discard the errors from the final results (they don’t have a significant impact).
My hardware setting: Beyerdynamic DT-531 headphone; Audigy2 soundcard; Onkyo A-5 amp.
[span style=\'font-size:14pt;line-height:100%\']DREAM AND REALITY…[/span]
Last words before posting the results: I planed to write a complete review, including a complete synthesis on most common problems encountered in this test. Different encoders have different problems, and some of them are recurrent. As example, LAME produce often weird kind of rumbling (noise in low frequencies) and smearing; Vorbis has still issues with what I called “microdetails” (blurred and replaced by noise) and sometimes coarseness; iTunes suffers sometimes from a form of ringing I can’t define; Nero Digital has serious troubles on tonal passage and poor pre-echo performance.
I didn’t compile this memento yet, which should interest developers more than users. But I publish the results yet, because I feel that it’ time for me to close this test (honestly, seeing ABC/HR running somewhere drives me mad or sick).
Results are published as big png files; file size is not an issue (only 111 kb) but the image size may cause issue on small display resolution (800x600). I apologize for inconvenience. Small comments are ending the graphs. Here again, I planned to write more detailed comments, but until I achieve what I planed to do I fear that the week-end and maybe the month will be over. I postponed several activities during the two last weeks to perform and present this test, but I can’t continue anymore. If I remember correctly there’s a life outside ABC/HR I also suspect that most people are not reading comments or details and are more interested by the final ranking. That’s why my results I’ll post today are a bit in “raw” form. I sincerely apologize, and will try to (slowly) give more textual substance in the next days. Now, results