Personal evaluation at ~130..135 kbps, 200 samples

Topic: Personal evaluation at ~130..135 kbps, 200 samples (Read 160471 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Personal evaluation at ~130..135 kbps, 200 samples

2005-11-15 08:14:29

[span style=\'font-size:14pt;line-height:100%\']Preliminary notes[/span]

Two years ago I performed and published my two first listening tests. Both included different formats and encoders at ~130 kbps and involved a dozen of samples: classical music only. My purpose was to see which encoder was able to produce the best encoding at a friendly bitrate (friendly for portable players), and for a specific kind of music. iTunes AAC & WMAPro appeared to be the best encoders (for myself), and the absolute quality of both encoders at such bitrate surprised me. Last year (December 2004) I performed two similar tests: the first was dedicated to AAC (Nero, Apple, old and new encoders) and the second was a match between the best AAC encoder (Nero Digital “fast” VBR) and the most advanced Vorbis one (aoTuV beta 3). Quality and enjoyment were even higher!

This year I performed a fresh multiformat listening test at 130 kbps. This new test is very different from their predecessors from a methodological point of view. I progressively improved my approach of listening tests and tried to answered to all criticism addressed in the past to previous tests (and not necessary mine). Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology; to sum them up:

• diversity : the following test is not only based on “classical” music, and will also include several (fifty!) samples of “modern” music.
• grading : described once as “temperamentic” I decided to stick all marks between two anchors, a low and a high one. It will decrease the contrast between different encoders and increase at the same time the difficulty of the full exercise but it should also ensure a more accurate grading. The low anchor is vital to prevent an excessively harsh grading; the high anchor is essential to temper enthusiasm: a very good encoding at 130 kbps should be marked in regard to an excellent and high bitrate one. The presence of both anchors should guarantee a right grading: not too low, not too high.
• complexity : people reproached to some listening tests to focus only “critical” or “complex” samples. It may be a problem with some VBR implementations, which sometimes decrease too much the bitrate on “non-complex” samples. In my opinion, a listening test should include both types of samples, at least to verify that non-complex/low bitrate parts are as well encoded as complex/high bitrate ones. Usually, VBR encoders handle very well non-complex part. Usually… The complexity range of my gallery of samples is wide enough to represent all situations (from ultra-low bitrate to ultra high ones) and to check the strength of VBR implementations.
• abundance : a bunch of 12…15 samples is maybe not enough to give an accurate idea of the strength and weakness of different encoders. I experienced it myself in the past: my previous tests didn’t reveal some problems I only noticed after on real usage, and more important, they were unable to expose the recurrence of the detected problems. Detecting one problem (like rumbling or ringing) is one thing, measuring the periodicity of this problem is another thing. My test is based on 200 samples; this number should be enough to expose all common problems plus several uncommon ones and is also sufficient to get an idea of their redundancy. This is in my opinion the biggest advantage of my personal listening test over collective ones (which must stay friendly to avoid discouragement and attract a lot of testers).
• statistical analysis : it might appear as trivial to mention this, but statistical analysis of results and confidence bars are presents (they were not used last year and the year before).
• “Apples and Oranges” : no need to recall the problem. This test only mobilizes VBR encoders. No debate this time.

[span style=\'font-size:14pt;line-height:100%\']THE TEST: CHOICE OF ENCODERS [/span]

The market of audio encoders is ruled by a Darwinian process: the stronger only survive. Between my first test (October 2003) and this one (november 2005), only few encoders really progressed. Most other (some of them are still in use) are unchanged or only changed once: MPC, WMA (Standard and Pro), faac, all MP3 encoders (excepted LAME). Another one appeared and disappeared in the meantime (Compaact!).
On the hardware side, the situation is now very different from the one I lived two years ago. With the exception of one or two devices, AAC and Vorbis support in hardware players were more a dream than reality. Testing different audio formats was useful for a virtual and opened future, rich in dreams and promises. Now, the concrete situation is more interesting than dreams. MP3 and WMA (Std) are still the two well-established formats, but Vorbis now benefits from a growing interest of several manufacturers and if AAC still looks like an Apple monopoly the iPod market has at least mutated into several form (flash memory players, Microdrive™ based jukebox). One victim of reality is WMAPro, still not supported; and the growing popularity of WMA labeled as PlaysForSure (based on WMA Std) seems to sentence WMAPro to a long exile.
For all these considerations, I restricted the test to the most usable and interesting encoders: AAC-LC (highly developed by Apple and Nero Digital), MP3 (vigorous as ever, thanks to LAME devs), Vorbis (saved from inertia by Aoyumi). Besides these four encoders, I add two anchors. More precisely:

• Apple AAC: I used iTunes 6.0.0.18 (based on QuickTime 7.03), at 128 kbps and with the recently added VBR mode . I test Apple AAC in VBR for the first time. I sadly discovered that this encoder use the same trick as the MP3 encoder included in iTunes: the minimal size of the frames are not inferior to the targeted bitrate (apart maybe digital silence). In other words, for 128 VBR encodings the bitrate starts at 128 kbps and is increased with complexity. No need to precise that if average bitrate stays close to the target, the variations are necessary limited. One advantage: this restricted mode prevents the VBR engine to use inadequately low bitrate frames, and should guarantee quality from bad surprises compared to a CBR encoding.

• Nero Digital AAC: I used the very new encoder released two weeks ago (aac.dll v.3 and aacenc32.dll v.4.2.1.0 ), in VBR mode too. –internet profile is the closed to 128 kbps (slightly inferior with classical music, but higher with non-classical. I didn’t use the “fast” mode, which is now pretty similar but probably inferior to the “high” one.

• LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

• Vorbis: I used aoTuV beta 4 (4.5 was released during the testing phase) instead of official 1.1.1 which corresponds to the 18 months old aoTuV beta 2 version. I used –q4,25 for the first group and –q4,00 for the second.

• As low anchor, I looked for something really low and also usable in batch mode. I found a very old AAC encoder on ReallyRareWares called mbaacencoder version 0.3: it’s awfully slow, quality is terrific and is as anecdote ideal to get an idea of all progress made around AAC between 1999 (release date of mbaaencoder) and 2005 (Apple and Nero Digital). I tried to get joint stereo and LC profile in batch mode, but the encoder apparently stayed in default mode (Main Profile, 128 kbps and dual stereo).

• As high anchor, I didn’t hesitate and used LAME 3.97 beta 1 –V2 --vbr new (or --preset standard) which is a reference for efficient, high quality and universal encodings. Furthermore, it would be interesting to evaluate the remaining gap between modern implementation of AAC and Vorbis at ~128 kbps to HQ MP3 at ~192 kbps.

[span style=\'font-size:14pt;line-height:100%\']SAMPLES [/span]

The test hinges on two big groups of samples: 150 for “classical” music group and 50 for “non-classical” (or “various”, or “modern”, or “popular”… choose your own) group. I already used the first group in three different tests in the past (80 kbps, 96 kbps, and LAME –V5). The complete collection is available for download. The 2nd group consist on all (35) non-classical samples used in previous collective listening tests; they’re all available on rarewares. To decrease the gap between the first and the second group I’ve add 15 other samples, all recently submitted for the postponed 64 kbps listening test of Sebastian Mares. Most of these last files may still be available.

[span style=\'font-size:14pt;line-height:100%\']THE BITRATE [/span]

The bitrate comparison is more accurate for the first group: it’s based on full tracks (6min 30 sec. per file on average) instead of short samples (10 sec. on average), and the complete collection is last but not least very representative of my entire library. For the second group of samples, I proceeded differently and I based the bitrate calculation on the 50 samples (which are longer: 24 sec. on average) and on external data (bitrate table for LAME posted by someone else). This way to evaluate the bitrate is not very precise, but I don’t have enough material to build a more accurate bitrate table. That’s why I tried to lower at maximum the difference in bitrate for all settings, and changed the command line for Vorbis (from –q4,25 to –q4,00) and LAME (--athaa-sensitivity 1 was removed).
To sum up the datas (a complete bitrate table will follow in the next days):

Code: [Select]

CLASSICAL (full tracks)

low anchor	128,00 kbps (estimated)
AAC iTunes	133,33 kbps [+4,16 %]
AAC Nero	125,71 kbps [-1,79 %]
MP3 LAME	130,81 kbps [+2,20 %]
Vorbis aoTuV	131,69 kbps [+2,88 %]
high anchor	181,46 kbps [+41,77 %]


NON-CLASSICAL (short samples)

low anchor	128,00 kbps (estimated)
AAC iTunes	137,31 kbps [+7,27 %]
AAC Nero	134,10 kbps [+4,76 %]
MP3 LAME¹	137,82 kbps [+7,67 %]
Vorbis aoTuV²	133,42 kbps [+4,23 %]
high anchor	196,28 kbps [+53,34 %]

¹ with --athaa-sensitivity 1 bitrate reaches 139,38 kbps 
² with –q4,25 bitrate reaches 140,21 kbps

[span style=\'font-size:14pt;line-height:100%\']TESTING CONDITIONS [/span]

The full test consists on pure ABC notation. The double blind test conditions are ensured by schnofler ABC/HR 0.5 beta (2005.08.31) software. All samples were decoded by CLI decoded within ABC/HR; offset were removed each times and minor differences in gain were systematically corrected (the highest difference reached 1.2 dB). Small mention for Vorbis: all files were decoded with foobar2000 (I still can’t make ABC/HR decode Vorbis files). There are no ABX comparisons: it’s a luxury I can’t afford with 1200 files awaiting for evaluation (200 x 6). If a difference is really unsure, I don’t rank the file. I finally ranked 16 times the reference instead of the encoded one (and 6 mistakes concern the high anchor). The error is inferior to 1.5%. I didn’t discard the errors from the final results (they don’t have a significant impact).
My hardware setting: Beyerdynamic DT-531 headphone; Audigy2 soundcard; Onkyo A-5 amp.

[span style=\'font-size:14pt;line-height:100%\']DREAM AND REALITY…[/span]

Last words before posting the results: I planed to write a complete review, including a complete synthesis on most common problems encountered in this test. Different encoders have different problems, and some of them are recurrent. As example, LAME produce often weird kind of rumbling (noise in low frequencies) and smearing; Vorbis has still issues with what I called “microdetails” (blurred and replaced by noise) and sometimes coarseness; iTunes suffers sometimes from a form of ringing I can’t define; Nero Digital has serious troubles on tonal passage and poor pre-echo performance.
I didn’t compile this memento yet, which should interest developers more than users. But I publish the results yet, because I feel that it’ time for me to close this test (honestly, seeing ABC/HR running somewhere drives me mad or sick).
Results are published as big png files; file size is not an issue (only 111 kb) but the image size may cause issue on small display resolution (800x600). I apologize for inconvenience. Small comments are ending the graphs. Here again, I planned to write more detailed comments, but until I achieve what I planed to do I fear that the week-end and maybe the month will be over. I postponed several activities during the two last weeks to perform and present this test, but I can’t continue anymore. If I remember correctly there’s a life outside ABC/HR I also suspect that most people are not reading comments or details and are more interested by the final ranking. That’s why my results I’ll post today are a bit in “raw” form. I sincerely apologize, and will try to (slowly) give more textual substance in the next days. Now, results

Personal evaluation at ~130..135 kbps, 200 samples

Reply #1 – 2005-11-15 08:14:59

[span style='font-size:20pt;line-height:100%']RESULTS[/span]

[span style='font-size:14pt;line-height:100%']I. CLASSICAL: 5 electronic/artificial samples micro-group[/span]

[span style='font-size:14pt;line-height:100%']II. CLASSICAL: 60 orchestral & chamber samples macro-group[/span]

[span style='font-size:14pt;line-height:100%']III. CLASSICAL: 55 solo instruments samples macro-group[/span]

[span style='font-size:14pt;line-height:100%']IV. CLASSICAL: 30 samples macro-group[/span]

[span style='font-size:14pt;line-height:100%']V. NON-CLASSICAL or MODERN or VARIOUS: 50 samples macro-group[/span]

Personal evaluation at ~130..135 kbps, 200 samples

Reply #2 – 2005-11-15 08:15:25

Few words to conclude the test…
It’s pretty clear that all encoders tested here correspond to a good or even a very good output quality. There are currently no winner between AAC (iTunes) and Vorbis. It’s funny to see that results are pretty close on the finish line when problems are so different. Encodings are not fully transparent, but quality is in my opinion excellent most often (but not always).
LAME offers to MP3 the chance to stay competitive against AAC and Vorbis. Not fully competitive, but the efficiency of this format forces the respect.
Nero Digital implementation of AAC is slightly disappointing, especially with classical music, which is still a weak point of this encoder. But the quality is far from disaster (it wasn’t the case two years ago), is on average really good, gets even better with “non-classical” music and should satisfy several users.
Last but not least, difference among all these encoders is really small (don't look too much on "zoomed" plots )

But the average mark is somewhat misleading. LAME quality is ~0.5 point lower to iTunes or Vorbis, but it doesn’t mean for example that quality of encoded albums are 0,5 lower. This lower ranking is rather the expression of higher fragility than lower quality. LAME, and Nero Digital, are more inclined to serious distortions than Vorbis or iTunes AAC at the same bitrate. The concept of quality may be replaced with such encoders by the concept of strength or robustness. To illustrate this I made the following histogram (sorry for poor quality, I’ll change it later):

Here, Vorbis and iTunes both get a mark comprise between 4.5 and 5.0 for 50% of the tested samples, whereas Nero only achieve this state (near-transparency or full transparency) for 20% of the same samples. With the classical group of samples, 30% of the them were ranked below 3.0 with Nero when iTunes or Vorbis got the same notation of less than 10% of the sample. The two winners are stronger, and could handle more situations than LAME and Nero Digital AAC.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #3 – 2005-11-15 08:30:26

Quote

(honestly, seeing ABC/HR running somewhere drives me mad or sick).
[a href="index.php?act=findpost&pid=341926"][{POST_SNAPBACK}][/a]

I can imagine that. Boy, performing this test must have been such a huge task... I'm extremely impressed!

Thanks a lot for sharing this with us, it's very interesting (especially now that you also included non-classical music).

My hat's off to you, Sir!

Personal evaluation at ~130..135 kbps, 200 samples

Reply #4 – 2005-11-15 08:34:09

bravo! You are much braver and patient than myself! It would seem that buying from the Itunes store isn't such a bad quality sacrifice going by your test. Also, it's too bad aoTuV saw another update in the middle of your test...now you have to start again...only kidding! I don't think the quality level you tested was tuned any further in 4.5.

Thanks again, your blind tests are one of the top attractions around here.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #5 – 2005-11-15 08:42:03

Changes in aoTuV beta 4.5 are for inferior settings (up to -q3,00). Fortunately I would (exceptionally) say

Personal evaluation at ~130..135 kbps, 200 samples

Reply #6 – 2005-11-15 08:45:24

Once again guruboolez, thankyou for your amazingly informative tests! And thanks for subjecting your ears to rigours the of modern music..

It's also nice to see that Aoyumi's work on vorbis is keeping it at the forefront of modern audio compression.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #7 – 2005-11-15 08:53:07

Thank you guruboolez.

These tests are so important to the community.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #8 – 2005-11-15 09:13:44

Quote

Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology

I am in awe...

I always do my own personal ABX test for my personal usage, but it is nothing compared to the enormous amount of work you do. Your tests and public results are very much appreciated, thank you.

edit: just finished reading the test results twice (to go through all the details), and I find it interesting that Nero still does not match Itunes, even though it uses a true VBR mode, whereas Itunes does not. I have been testing the new nero codec in VBR LC mode at lower bitrates for my W800i, and have been disappointed by it. What I did not do is compare it to Itunes. I will now.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #9 – 2005-11-15 09:32:42

That must have been a heap load of data to compile
Did you nose bleed?

Personal evaluation at ~130..135 kbps, 200 samples

Reply #10 – 2005-11-15 09:40:42

Invaluable tests again. Thank you so much. Vorbis aoTuV is the leading codec at medium bitrates (tied with iTunes AAC). And from other tests you did, Vorbis also shines at low and high bitrates. Nice to confirm that LAME -V2 --vbr-new is still superior to iTunes AAC at medium bitrates (and pretty tied with AAC ~180kbps, I guess).

Personal evaluation at ~130..135 kbps, 200 samples

Reply #11 – 2005-11-15 09:41:49

As we say "down under", "Good on ya, mate!"

Personal evaluation at ~130..135 kbps, 200 samples

Reply #12 – 2005-11-15 10:18:22

Thanks Guru!

Personal evaluation at ~130..135 kbps, 200 samples

Reply #13 – 2005-11-15 10:41:09

Cheers Guru, fascinating stuff.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #14 – 2005-11-15 10:45:44

Thanks a lot, very interesting test again!

Personal evaluation at ~130..135 kbps, 200 samples

Reply #15 – 2005-11-15 10:46:34

Thanks Guruboolez, very informative.

About LAME encoder being not well balanced:

Quote

LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups?

Personal evaluation at ~130..135 kbps, 200 samples

Reply #16 – 2005-11-15 11:07:36

Oh. I don't think this test can be ignored only because it's done by just one person. Nero company really need some work to improve there aac implementation (maby already in Ivan's brain ).

Thank you guruboolez for your great work.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #17 – 2005-11-15 11:09:19

Thanx!
guruboolez, how about low-bitrate comparision (64kbps and below)

Personal evaluation at ~130..135 kbps, 200 samples

Reply #18 – 2005-11-15 11:15:05

you must be crazy
impressive work!

thanks a lot guruboolez

Personal evaluation at ~130..135 kbps, 200 samples

Reply #19 – 2005-11-15 11:38:20

Nice listening test, as always

Personal evaluation at ~130..135 kbps, 200 samples

Reply #20 – 2005-11-15 12:11:04

Awesome, awesome, awesome.

Very big thanks, Francis. You're a legend.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #21 – 2005-11-15 12:23:19

Quote

Quote
LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups?
[{POST_SNAPBACK}][/a]

I don't think so. The --athaa-sensitivity command prevents a specific kind of ringing (I'm used to call it "background ringing"), and I don't remember any sample of the second group suffering from this problem (there are maybe one or two of them).

I already noticed this disparity in performance between classical group and "various" samples during my summer listening tests performed at [a href="http://foobar2000.net/divers/tests/2005.07/80/80TEST_PLOTS_06.png]80 kbps [/url]and 96 kbps.
The difference is also not very important. And as you can see it on the distributive histograms, the main difference occurs on the last part (ranking > 4.5). ~40% of the tested samples (classical) were ranked below 4.5 with LAME, but the proportion falls to 20% for the second category. It seems that for LAME, there are more "easy" to handle situation in my sample gallery than for the 50 samples I collected from various listening tests. (I don't know if I'm really clear...).

Personal evaluation at ~130..135 kbps, 200 samples

Reply #22 – 2005-11-15 12:28:54

Quote

Thanx!
guruboolez, how about low-bitrate comparision (64kbps and below)
[a href="index.php?act=findpost&pid=341964"][{POST_SNAPBACK}][/a]

I'm not very happy with the quality of current encoders at this bitrate. Not really suitable for my personal use. Curiosity would therefore be my only motivation for such exercise.

Personal evaluation at ~130..135 kbps, 200 samples

Reply #23 – 2005-11-15 12:36:31

Surprising to see how close Vorbis and iTunes are to the high anchor. I guess one could safely use 160kbps VBR for transparency with iTunes now (I previously used 192kbps).

Personal evaluation at ~130..135 kbps, 200 samples

Reply #24 – 2005-11-15 13:27:33

To guruboolez, thank you for yet another incredibly fascinating and informative listening test.

I am again very pleased to see Vorbis doing so well. Full credits to Aoyumi for his wonderful work. I'm also very pleased to see iTunes AAC doing so well too. It seems we do get value for money with these two encoders (ie. they're free!!! even better )

Notice