Personal Listening Test of Opus, Celt, AAC at 75-100kbps
Reply #28 – 2012-11-22 18:03:54
I think the objectives in tests (experiments) matter. And the questions they aim to answer should determine the primary graphs produced. Scatter plots with actual bitrates in test samples are often not important to the question being asked but may be interesting secondary information. I suspect for a lot of consumers, bitrate really amounts to a proxy for "How much music can I store on my device?" or "How much space can I save in my music to take photos on my smartphone?". Instantaneous bitrate or even bitrate over a whole song isn't that important to them. They may also ask: "What's the most music I can store on my device at a reasonably good quality?" The meaning of reasonably good quality will vary. For some, perceptible but not annoying differences may be tolerable much of the time, with just once or twice every few hours something mildly annoying and unmusical being noticed. For others, entirely transparent most of the time, but once or twice every few hours, some difference that's perceptible but not annoying being noticed is their quality lower limit. These are just two examples from the range of possible requirements. For example, if testing 160 variants (e.g. 20 samples over 2 bitrates of 4 encoders or 10 samples over 4 bitrates of 4 encoders)Question 1: At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), which encoder offers the best quality? (This is the question your tests are roughly set up to answer)Question 2: At the same bitrate over a general representative collection (thus being able to store the same amount of music on my device), what typical quality can we expect from each encoder (ignoring problem samples unless they account for 5% or more of general music). (This is the question 'some people' seem to want answered) Each question might warrant a different test: Question 1 might warrant a large number of problem samples of various classes to make rare annoyances easy to detect and penalize the encoder that deals with problem samples worse. This might be best for people who wish to avoid the occasional annoying and unmusical artifacts. Question 2 might warrant a collection of normal samples, not known to be problematic, to get a representative idea of typical quality, giving people an idea which bitrates might suit them. This might be best for people who are more forgiving of occasional artifacts, but would start to penalize any encoder frequently exposes artifacts rather than rarely. It might also be sensible to include the same 'typical' samples over a range of a few bitrates and limit the number of samples in the test corpus (e.g. make it 10 samples but test over 4 bitrates per encoder) to match this aim. There are reasons of academic interest and limited practical use to the generic listener, where you'd really want to see a scatter plot to know either how variable VBR might be in each encoder, or to compare quality with actual bitrate to get an idea of bitrate-efficiency of the coding tools available in a particular format. Even then, the length of the sample may have a big influence on the reported bitrate, e.g. a 2 second period requiring lots of high bitrate short blocks within a 30 second clip of otherwise ordinary sound, will exhibit a lower bitrate than the same 2 seconds within a 5 second clip even though the instantaneous bit rate of the difficult part would be the same. Then again, there are some samples like fatboy that occur almost throughout the song (Kalifornia by Fatboy Slim). In a sense it ought to be made clear to the average listener that the average bitrate line, not the scattered bitrate is what's important for the question of "How much music can I fit on my device" or "How much space will be left on my phone's microSD card for taking photos and videos?" and the y-axis height is the important point regarding quality, where the average quality may have some importance, the spread of quality values might also be important (less spread usually being good) and the lowest quality reported also having some importance (if you'll be rather unforgiving of a really nasty artifact, for example - in my case it's 'birdies' and 'warbling' in bad MP3 encoders that drive me crazy). I think Question 1 is where you're leaning when saying that changing the type of sample used changes the quality (easy sample - higher quality) even for VBR. I think Question 2 is the sort of listening test 'some people' have asked you for. Where many of us at Hydrogen Audio differ from you is in our belief that average bitrate over a collection of CDs is the important bitrate calibration even for Question 1. When making an experiment to test Question 1, we use hard samples to differentiate encoders more easily and find out which is best. At the same time, we realise that the reported Quality is more representative of rare problem cases only, and is therefore penalised, and is not representative of normal music, so we wouldn't claim to have useful information about general quality that can be expected. Question 2 is actually quite hard to test, partly because the Quality is often very close to transparent (5.0) especially at around 96-100 kbps on recent encoders with easy samples. I guess a useful graph for a single codec and mode (e.g. Opus in VBR mode) could be one that showed Quality (y) versus Average Bitrate (x) but plotted as two different lines at each bitrate. The upper line (blue) would be general music quality (based on 5 to 20 samples of normal music of various genres). This might start rather low at 32kbps, increase at 48 kbps, get pretty high at 64 kbps and reach 5.0 at 96 kbps The lower line (red) would be 'problem sample quality' where a collection of typical codec-killers is used (fatboy, tomsdiner, eig, etc.). This might start terribly low at 32kbps, still be pretty bad at 48, get a bit better and 64 kbps and reach somethiing like 4.0 at 96 kbps for example, and if extended to, say 128 kbps, it might get quite close to 5.0 for example. The lines could also be fit lines, with quality scatter points above and below them to give an indication of the spread of quality for general samples (blue) and for problem samples (red). The results of such a test might be relatively informative especially in terms of total space occupied by your music or total music duration per amount of storage space (e.g. expressed in hours per Gigabyte or Gigabytes per 10 hours) rather than focusing on bitrate.