80 kbps personal listening test (summer 2005)
2005-07-10 19:13:53
[span style='font-size:17pt;line-height:100%']I. INTRODUCTION [/span] After some listening tests performed during two years, I’ve experimented something new, based on this discussion . This time, I've tried to perform a multiformat blind comparison based on a much larger group of samples, but without ABX confirmation. Tests are still performed within a double-blind methodology: only difference is that I haven’t spent time to confirm the audible differences with an ABX session. The spared time was invested in something more interesting (to my eyes but also for statistical analysis tools): 150 samples instead of the 15 usual ones.[span style='font-size:14pt;line-height:100%']1.1/ classical samples [/i][/span] Few words about this extravagant number. I was used to perform comparisons on a limited number of classical samples (15…20). It was probably enough to draw reliable leads about relative quality of various codecs, but such limited collection couldn’t represent the fullness of classical music, which consists of numerous instruments played in countless combinations, offering for most of them a wide dynamics. There are also voice, electronic, and to finish all variants linked to technical factors (acoustic, recording noise, etc…). That’s why I’ve tried to build a structured collection of “classical music” situations, which of course doesn’t aspire to completeness, but which should represent most situations. The collection is made up of very hard to encode samples as well as of very easy ones, loud (+10 dB) and ultra-quiet (+30 dB); noisy and crystal clear recordings; ultra-tonal and micro-detailed sounds. I’ve split it in four series:• artificial : electronic samples – most should correspond to critical samples for lossy encoders. Total: 5 samples . • ensemble : various instruments (no voice) played together. I’ve divided it in 2 categories: chamber music and orchestral music (wider ensemble). For each category, I’ve distinguished period instruments (Middle-Age, Renaissance, Baroque) and modern ones (~19 and 20 century). Total: 60 samples . • solo : instrument played alone. Again, I’ve created separate categories (winds, bow, pinch strings [i.e. guitar family: lute, theorbo, harp…], keyboards). Total: 55 samples . • voice : male, female, child – in solo, duo and chorus. Total: 30 samples . (note#1 : all samples are deliberately short. First, it’s easier to upload them. Second, there’s only one acoustic phenomenon to test per sample, and it makes comparison between different tests a bit more interesting. The exact length for the collection is 25 minutes; it corresponds to 10.00 seconds per sample on average). (note#2 : all samples were named following a simple convention. The first letter (A, E, S, V) corresponds to the category (artificial, ensemble, solo, voice). The number to the catalogue number. Then, additional information is tied: nature of instrument, type of instrument or voice, etc…ex : S11_KEYBOARD_Harpsichord_Aex : E35_PERIOD_CHAMBER_E_flutes_harpsichord.mpc To make short, samples will be called S11, E35, etc…) With such a collection, I should obtain very precise idea of different lossy encoders performance on classical. For me, it’s interesting, especially if I plan to buy in the near future a portable player supporting one new audio format, as Vorbis, AAC or WMAPro. I’d like to know how good these new formats are compared to MP3. These 150 samples may also help developers/testers for evaluating the performance of codec on a wide panel of situations.[span style='font-size:14pt;line-height:100%']1.2/ various music samples [/i][/span] Last and not least, I’ve decided to give more audience to this test by adding samples representing some other genres than classical. For an elementary reason –99.9% of my CDs are classical- I can’t build the same kind of structured collection with what I will call now to make short “various music”. I used all samples selected by Roberto during his listening tests, removed all classical ones, and kept the 35 samples representing “various music”. It’s much less than the 150 above, but more than the double of what was used during all previous collective listening tests. => total = 150 classical + 35 various = 185 samples.[span style='font-size:14pt;line-height:100%']1.3/ choice of bitrate [/i][/span] For my first test based on these samples, I’ve selected a friendly bitrate (at least as tester): 80 kbps. It may appear as uninteresting, that’s why I must explain my choice. First, I plan to perform similar tests at higher bitrate. My dream is to build a coherent set of tests including all bitrate from 80 to 160 or 192. But this project is very ambitious –too ambitious certainly- and I’ll possibly stop my tests (in this current form) at ~130 kbps. But why 80, and not 64 kbps? To my ears, there is currently no encoder that sound satisfying at 64 kbps. They’re all disappointing or unsuitable to listening on headphone, even crap ones, even on urban environment (I repeat: to my ears). But I’ve noticed that the perceptible and highly annoying distortions I’ve heard at 64 kbps are seriously lowered once the bitrate reaches the next step. Vorbis has less problems, AAC-LC (at least advanced encoders) also seems to improve quickly beyond 64 kbps. It’s a bit like mp3, which was considered as acceptable at 128 kbps, but which quickly sunk below this value. I would consider as reasonable the *idea* of an acceptable quality at 80 kbps with modern encoders. Let’s see the facts...[span style='font-size:17pt;line-height:100%']II. PROBLEMS [/span][span style='font-size:14pt;line-height:100%']2.1/ competitors [/i][/span] One big problem with this kind of test is the choice of competitors. Choosing the formats is easy: tester has just to select want he considers as interesting. Here, I’ll exclude outdated formats (vqf, MP3Pro) and unsuitable ones (MPC, MP3 – this last one would also be interesting to test, just for reference...). Remains: WMA, WMAPro (if available at this bitrate), AAC-LC, AAC-HE, Vorbis. But what implementation should I use? Nero AAC or iTunes AAC? Nero AAC features a VBR mode, but is VBR reliable at this bitrate, especially for samples which represents a wide dynamic? And for Nero, which encoder would be the best: the “high” one (default, which has verified issues with classical) or the “fast” one (which performs better with classical, but maybe not as well with various music, and which is still considered as not completely mature by Nero’s developers)? Vorbis CVS or Vorbis aoTuV? I’d say aoTuV, but if vorbis fails people will (legitimately) suspect the other one could have performed better. WMA CBR or WMA VBR? VBR is theoretically better than CBR, but tests have already shown that VBR could be unsafe at low bitrate. My first idea was to test them all. Schnofler ABC/HR allows the use of countless encoders in a same round (ff123 software is limited to 8 contenders). But after a quick enumeration of all possible competitors (iTunes AAC, Nero AAC CBR fast, Nero AAC CBR high, Nero AAC VBR fast, Nero AAC VBR high, faac, Vorbis aoTuV, Vorbis CVS, Vorbis ABR, WMA CBR, WMA VBR, HE-AAC fast, high, CBR & VBR...) and a mental calculation of the number of comparisons I have to perform with 185 samples and so many contenders, I’ve immediately canceled this project. Last but not least, multiplying the competitors in a single test will lower the significance (statistically speaking) of the results. Then, I came to a second idea: testing all competitors for one single format in a single pool, and put the winner of each pool in the final arena. It’s like sports: qualification first, final for the best. Remaining problem is the additional work. I’ve planned to test 4…5 codecs per bitrate with 185 samples, not 13 or 14. That’s why I’ve reduced the number of tested samples for the preliminary pools. I’ve limited the number at 40 samples, using 25 samples coming from different categories of the complete classical collection and 15 from the 35 samples representing “various music”. The imbalance in favor of classical is intended: the whole test is clearly focused on classical – “various music” is just an extension or bonus.[span style='font-size:14pt;line-height:100%']2.2/ Encoding mode and output bitrate [/i][/span] Other problem: VBR and CBR. Testing VBR and CBR has always been a source of controversy. In my opinion, testing a VBR encoder which outputs the targeted bitrate on average (i.e. a full set of CDs) is absolutely not a problem, even if bitrate reach amazing value on short tested samples. It’s not a problem, but the test should meet in my opinion the following condition: the test must include samples for which VBR encoders produce high bitrate as well as low one. VBR encoders have the chance to automatically increase the bitrate when a difficulty is detected – possibility that CBR encoders don’t have, and they sometime suffer from that handicap, especially on critical samples. But VBR encoders also decrease the bitrate of musical parts they don’t consider as difficult – and this diminution is sometimes very important; theoretically it shouldn’t affect the quality, but we know the gap between theory and reality, between principle and implementations of the principle. Testing the output quality of ‘non-difficult’ part is therefore very important, because these samples are the possible handicap of VBR encoders; otherwise there’s a big risk of favoring VBR encoders over CBR by testing only samples apparently favorable to VBR (whatever the format). My classical music gallery is not exclusively based on critical or difficult samples; most of them don’t exhibit any specific issue. The sample pool should therefore be fairly distributed between samples with lower bitrate than the targeted one and samples with a higher bitrate. I’ll post as appendix a distribution curve which confirms this.[span style='font-size:14pt;line-height:100%']2.3/ degree of tolerance [/span] By testing VBR profiles, it’s not always possible to match the exact target. Some encoders don’t have a precise scale of VBR settings. With luck, one available profile will approximately correspond to the fixed bitrate; sometimes, the output bitrate will deviate too much from the target. CBR is not free of problem either, although they’re less important. With AAC for example, CBR is a form of ABR: output bitrate could vary a little (but fortunately not very much). That’s why trying to obtain identical bitrate between various contender could be considered as an utopia, even when the test is limited to CBR encoders only. The tester has therefore to allow some freedom: not too much of course in order to keep significant comparisons and not too less in order to make the test possible. I consider a deviation of 10% as acceptable, but again, at one condition: 10% between the lowest averaged bitrate and the highest averaged one, and not 10% between all encoders and the target. As example, if one encoder reaches 72 kbps (80 kbps - 10%) and another 88 kbps (80 kbps + 10%), the total difference would be ~20%: too much. However, I will possibly allow rare exceptions: when a VBR profile is outside but close to the limit or if it would be more interesting to test a more common profile (example : musepack –quality 4 instead of –quality 3.9). Of course, the deviation mustn’t be exaggerated; and I’ll try to limit the possible exceptions to the pool, in order to keep the fairest conditions during the final test.[span style='font-size:14pt;line-height:100%']2.4/ Bitrate evaluation for VBR encoders [/span] Now that rules are fixed, we have to estimate the corresponding bitrate for each VBR encoder and profile. It’s not as easy as we can suppose. Ideally, I had to encode a lot of albums at each profile. But with my slow computer, it’s not really possible. And doing it would only help to obtain the corresponding bitrate for classical; according to my experience, this average bitrate could seriously differ from the output value that other people listening to other music (like metal) have already reported. Think about LAME sf21 issues, which could inflate the bitrate up to 230…250 kbps with –preset-standard, and compare it to the average bitrate I obtain with classical: <190 kbps! Other but different example: lossless. For practical reasons, I followed a methodology I don’t really consider as acceptable, and took the average bitrate of the 185 kbps as reference for my test. I don’t like it, because short samples could dramatically exaggerate the behavior of VBR encoders, and therefore distort the final estimation. Nevertheless, with 185 samples, this kind of over- and underrating occurring with some samples would normally be softened. And indeed, it seems that the average bitrate of encodings I’ve done of the full suite with formats I’ve used in the past (lame –preset standard, MPC) are very close to the average bitrate of my ancient music library. I can’t absolutely be certain that my gallery works like a microcosm and that bitrate matches the real usage of a full library, but I’m pretty sure that the deviation isn’t significative (+/- 5%, something like that).[span style='font-size:14pt;line-height:100%']2.5/ Bitrate report [/i][/span] There’s, before starting to reveal the results one last problem I’d like to put in the spotlight. It concerns the different way to calculate the bitrate. I’ve tried to obtain the most reliable value, and that’s why I’ve logically thought to calculate it myself with the filesize as basis. As long as no tags are integrated within the files, the calculated bitrate should correspond to the real one (audio stream). But the problem is somewhere else. Some formats are apparently embedded in complex containers, which weigh the size down. It’s not a problem in real life: adding something like 30 Kb per 5 Mb file is totally insignificant. But when these 30 Kb are appended to very short encodings, the calculation of the average bitrate is as consequence completely distorted. Concrete example: iTunes AAC. Just experiment the following thing: encode a sample (length: one second exactly) in CBR. At 80 kbps, we should obtain an 80 Kbits or 10 Kb file (80 x 1 / . But the final size is 60 Kb, and it corresponds to a 480 kbps (60x8) encoding! What’s the problem? Simply because iTunes add for each encoding something like 50 Kb of extra-chunks. The problem could be solved with foobar2000 0.8 and the “optimize mp4 layout” command: filesize drops to 14 Kb. But even here, the 14 Kb correspond to ~128 kbps bitrate, and the audio stream is only 80 kbps. iTunes is not apparently alone in this situation. I haven’t looked closely, but it seems that WMA (Pro) have the same behavior, and we have no “optimize WMA layout” tool to partially correct this. If we keep in mind that the average length of my samples is 10 second with some of them at only 5 seconds, we have to admit that calculating the bitrate with filesize/length formula is for this test anything but reliable. That’s why I followed the value calculated by specialized software. MrQuestionMan 0.7 was released during my test, but the software have some issue to calculate a correct average size on short-sized encodings (iTunes AAC encodings as example). Foobar2000 appeared as the most reliable tool, and I’ve decided to trust the calculated value. For practical reasons, foobar2000 is also preferable: the “copy name” command could be modified to easily export bitrate in spreadsheet.[span style='font-size:14pt;line-height:100%']2.6/ notation and scale [/u][/span] The -really- last problem Each time I have to evaluate quality at low bitrates I regret the inappropriateness of the scale in use in ABC/HR. At 80 kbps, encodings would rarely reach the 4.0 state (“slight but not annoying difference”). 3.0 (“slightly annoying”) would rather be the best quality degree that modern encoders could obtain at this bitrate. It implies that the notation will fluctuate within a compressed scale, from 1.0 to 3.0. It’s not very much, especially when big differences in quality between contenders are noticed by the tester. To solve this issue, I’ve simply mentally lowered the visible scale by one point. Example: when I considered an encoding to be “annoying” (state corresponding to “2.0”) I put the slider to 3.0. The scale I used for the test was: 5.0 : “perceptible but not annoying” 4.0 : “slightly annoying” 3.0 : “annoying” 2.0 : “very annoying” 1.0 : “totally crap” If exceptionally one encoding appeared as corresponding to “perceptible but not annoying” I’ve put the slider on 4.9, which means “5.0”; if the quality was superior to this state, I wrote the exact notation in comments. A transparent encoding obtained 6.0. When the tests were finished, I’ve removed one point to all notation. 6.0 became 5.0, 3.4 -> 2.4 and 1.0 were transformed in a shameful 0.0! By doing it, I maintain the usual scale; only change is therefore a lower floor, corresponding to an exceptionally bad quality. The redefinition of the quality scale could directly be redefined with Schnofler’s ABC/HR software, but apparently the tester have to type the description for each new test (did I miss an option?); it was faster for me to do this small mental exercise rather than typing more than 200 time the same content Now, the pools !