As SoundExpert now has pretty stable flow of volunteer testers it is time to update codecs in all bitrate sections. After a short discussion at Russian bit-torrent tracker we decided to update 64-96-128-192-256-320 sections first. For the 64kbit/s section five codecs were chosen:
Fhg AAC (2012-06-24) - 59.2 kbit/s (fhgaacenc --vbr 2 se_ref.wav)
QAAC TVBR (v1.42) - 59.6 kbit/s (qaac --he -v56 se_ref.wav)
Nero AAC (v1.5.4.0) - 60.1 kbit/s (neroAacEnc.exe -q 0.25 -if se_ref.wav -of out.mp4)
Vorbis (Xiph 1.3.3) - 60.3 kbit/s (oggenc2.exe -q-0.3 se_ref.wav)
Opus (libopus 1.0.1) - 59.9 kbit/s (opusenc --bitrate 59 se_ref48.wav out.opus)
Conversion chain for Opus: 44.1/16 -->> 48/24(Audition CS6) -->> opusenc -->> foobar2000(48/24) -->> 44.1/16(Audition CS6)
Bitrates are calculated on the basis of nine SE test samples concatenated. Fortunately first two codecs have close values of resulting bitrates at their corresponding discrete quality settings. Other contenders were adjusted to have close values.
Did we miss something?
BTW, qaac doesn't contain a codec: it uses AAC encoder from iTunes. So probably it's better to write:
QAAC 1.42 + iTunes 10.7
or something like this.
As SoundExpert now has pretty stable flow of volunteer testers it is time to update codecs in all bitrate sections. After a short discussion at Russian bit-torrent tracker we decided to update 64-96-128-192-256-320 sections first.
It's probably a very daunting task to get good results going beyond 128 kbps. Have you performed successful listening tests at high bitrates in the past?
Wouldn't it be interesting to include MP3/LAME instead of three flavours of AAC?
QAAC TVBR (v1.42) - 59.6 kbit/s (qaac --he -v56 se_ref.wav)
-v is not TVBR but CVBR. -V is TVBR and --he is not possible with it.
qaac TVBR settings:
Q0 - Q4 (0) = ~40 Kbps
Q5 - Q13 (9) = ~45 Kbps
Q14 - Q22 (18) = ~75 Kbps
Q23 - Q31 (27) = ~80 Kbps
Q32 - Q40 (36) = ~95 Kbps
Q41 - Q49 (45) = ~105 Kbps
Q50 - Q58 (54) = ~115 Kbps
Q59 - Q68 (63) = ~135 Kbps
Q69 - Q77 (73) = ~150 Kbps
Q78 - Q86 (82) = ~165 Kbps
Q87 - Q95 (91) = ~195 Kbps
Q96 - Q104 (100) = ~225 Kbps
Q105 - Q113 (109) = ~255 Kbps
Q114 - Q122 (118) = ~285 Kbps
Q123 - Q127 (127) = ~320 Kbps
BTW, qaac doesn't contain a codec: it uses AAC encoder from iTunes. So probably it's better to write:
QAAC 1.42 + iTunes 10.7
or something like this.
Even better if he adds the CoreAudioToolbox.dll version he's using it (7.9.7.9, 7.9.8.1 etc.).
It's probably a very daunting task to get good results going beyond 128 kbps. Have you performed successful listening tests at high bitrates in the past?
Ratings above 5-th grade mean that the devices/technologies have some quality headroom, their artifacts are beyond threshold of human audibility. Testing files of such devices are processed additionally - sound artifacts are amplified to the extent when they could be heard by ordinary listeners.
(from soundexpert.org, emphasis mine)
It's probably a very daunting task to get good results going beyond 128 kbps. Have you performed successful listening tests at high bitrates in the past?
Ratings above 5-th grade mean that the devices/technologies have some quality headroom, their artifacts are beyond threshold of human audibility. Testing files of such devices are processed additionally - sound artifacts are amplified to the extent when they could be heard by ordinary listeners.
(from soundexpert.org, emphasis mine)
How do you algorithmically selectively amplify artifacts, other than modifying the encoders to work worse than they'd normally would?
How do you algorithmically selectively amplify artifacts, other than modifying the encoders to work worse than they'd normally would?
IIRC they subtract the encoded signal from the lossless input and then do some kind of processing on it and then add it back to "enhance" artifacts. The end result is that the encoded files are easier to distinguish from lossless, but OTOH I don't think anyone has ever shown that the differences correlate with actual audio quality.
QAAC TVBR (v1.42) - 59.6 kbit/s (qaac --he -v56 se_ref.wav)
-v is not TVBR but CVBR. -V is TVBR and --he is not possible with it.
accepted, thanks; for 64kbit/s I think HE-AAC is more appropriate setting for this codec. True VBR will be used at higher bitrates.
BTW, qaac doesn't contain a codec: it uses AAC encoder from iTunes. So probably it's better to write:
QAAC 1.42 + iTunes 10.7
or something like this.
Even better if he adds the CoreAudioToolbox.dll version he's using it (7.9.7.9, 7.9.8.1 etc.).
QAAC 1.42 + iTunes 10.7.0.21 (CoreAudioToolbox.dll 7.9.7.3)
This CoreAudioToolbox looks outdated but iTunes and QuickTime are the latest on my system. Is it important to have the latest version of the file?
Wouldn't it be interesting to include MP3/LAME instead of three flavours of AAC?
I think 64 is too low for mp3, result is too predictable and the usage scenario is too uncommon.
Sounds like a useless test to me. There is nothing to be gained by exposing otherwise inaudible artifacts. These codecs make decisions based on the fact the artifacts this test seeks to expose would be inaudible.
As for including mp3, it's still the most popular codec, and it would be useful to see the bitrate at which mp3 gains parity with the test cases.
Sounds like a useless test to me. There is nothing to be gained by exposing otherwise inaudible artifacts. These codecs make decisions based on the fact the artifacts this test seeks to expose would be inaudible.
Below 128kbit/s artifact amplification is not applied. Outputs of codecs are used as is.
As for including mp3, it's still the most popular codec, and it would be useful to see the bitrate at which mp3 gains parity with the test cases.
Preliminary decision was to start testing of mp3 from 96kbit/s, should we really begin with 64kbit/s? There are a lot of outdated mp3 codecs in this section already, btw.
Below 128kbit/s artifact amplification is not applied. Outputs of codecs are used as is.
The fact remains I question doing it above 128kbps.. Amplifying inaudible artifacts to the point they become audible SERVES NO PURPOSE. You can't judge the quality of a lossy codec that way. The whole exercise will do nothing but provide misleading reference material which nimrods will use to base inaccurate claims that some codec is better or worse than another because of the INAUDIBLE artifacts this test needlessly exposes.
Each codec should be included in all test samples. If for no other reason than to illustrate how badly mp3 works at those low bit rates compared to the more modern ones.
If every codec other than mp3 becomes transparent at 128kbps then let your study confirm that so that it adds even more to the mountain of evidence that above those bitrates you can use any codec you like with no audible problems, and that mp3 sucks.
The fact remains I question doing it above 128kbps.. Amplifying inaudible artifacts to the point they become audible SERVES NO PURPOSE. You can't judge the quality of a lossy codec that way. The whole exercise will do nothing but provide misleading reference material which nimrods will use to base inaccurate claims that some codec is better or worse than another because of the INAUDIBLE artifacts this test needlessly exposes.
Please not here, there is more appropriate place for the discussion - http://www.hydrogenaudio.org/forums/index....=85182&st=0 (http://www.hydrogenaudio.org/forums/index.php?showtopic=85182&st=0)
Each codec should be included in all test samples. If for no other reason than to illustrate how badly mp3 works at those low bit rates compared to the more modern ones.
I'm not sure there is a need to prove shortcomings of mp3 at low bitrates over and over again.
The fact remains I question doing it above 128kbps.. Amplifying inaudible artifacts to the point they become audible SERVES NO PURPOSE. You can't judge the quality of a lossy codec that way. The whole exercise will do nothing but provide misleading reference material which nimrods will use to base inaccurate claims that some codec is better or worse than another because of the INAUDIBLE artifacts this test needlessly exposes.
Please not here, there is more appropriate place for the discussion - http://www.hydrogenaudio.org/forums/index....=85182&st=0 (http://www.hydrogenaudio.org/forums/index.php?showtopic=85182&st=0)
While the concerns with the basic premise have not been addressed as far as I can tell, this is the thread you asked for input and comments in this thread, so let's stick with this one instead of reviving the old one.
Each codec should be included in all test samples. If for no other reason than to illustrate how badly mp3 works at those low bit rates compared to the more modern ones.
I'm not sure there is a need to prove shortcomings of mp3 at low bitrates over and over again.
I'm on the other hand not sure if there is
any need to prove "shortcomings" of lossy encoders by trying to inflate certain, previously inaudible, artifacts in a listening test. How do you make sure thise method doesn't artificially bias towards certain encoders/artifacts? If you want to prove that lossy encodes differ form the original, you're done now, since they obviously do and have to. Another useful metric to me is the
binary issue of transparency. Either the (unaltered!) encoder result is transparent or it isn't. In the real world you'll never have weird mixes where you superimpose difference signals onto the encoded signal. This method is completely artificial with
no real world application or meaning. Another thing done regularly here are the ABC tests, but those are mainly useful to grade encoders on results with obvious audible flaws, to decide which encoder produces the less annoying results.
I'm on the other hand not sure if there is any need to prove "shortcomings" of lossy encoders by trying to inflate certain, previously inaudible, artifacts in a listening test.
At 64kbit/s there is no need for artifacts amplification for sure. Above 128kbit/s meaningful results of ABX testing become more and more expensive (but still meaningful). SoundExpert proposes methodology that makes those tests less expensive. SE quality ratings of devices with small impairments could be considered as results of specially simplified listening tests. Results are experimental which is clearly stated on the site.
What I would like to see one day is more CVBR tests. All the previous listening test I have seen were more concerned about offline storage than streaming.
How do you make sure thise method doesn't artificially bias towards certain encoders/artifacts? If you want to prove that lossy encodes differ form the original, you're done now, since they obviously do and have to. Another useful metric to me is the binary issue of transparency. Either the (unaltered!) encoder result is transparent or it isn't. In the real world you'll never have weird mixes where you superimpose difference signals onto the encoded signal. This method is completely artificial with no real world application or meaning.
To add to this point, I notice that the newer version of the site no longer ranks SBR codecs above non-SBR codecs, presumably due to some adjustment of the 'enhancement' process to give less obviously incorrect results?
What I would like to see one day is more CVBR tests. All the previous listening test I have seen were more concerned about offline storage than streaming.
All AAC contenders for this 64kbit/s testing are in CVBR mode.
How do you make sure thise method doesn't artificially bias towards certain encoders/artifacts? If you want to prove that lossy encodes differ form the original, you're done now, since they obviously do and have to. Another useful metric to me is the binary issue of transparency. Either the (unaltered!) encoder result is transparent or it isn't. In the real world you'll never have weird mixes where you superimpose difference signals onto the encoded signal. This method is completely artificial with no real world application or meaning.
To add to this point, I notice that the newer version of the site no longer ranks SBR codecs above non-SBR codecs, presumably due to some adjustment of the 'enhancement' process to give less obviously incorrect results?
The only adjustment that was brought into operation last year is post-screening of incoming grades. The reason of instability of high bit-rate ratings (320+) is insufficient number of testing points and the problem still needs some research. Another SBR codec in 192 section never showed higher results.
At 64kbit/s there is no need for artifacts amplification for sure. Above 128kbit/s meaningful results of ABX testing become more and more expensive (but still meaningful). SoundExpert proposes methodology that makes those tests less expensive. SE quality ratings of devices with small impairments could be considered as results of specially simplified listening tests. Results are experimental which is clearly stated on the site.
If meaningful results of ABX tests above 128kbps become more and more expensive it's because the codecs are doing their jobs and producing audibly transparent output. At a point where normal ABX results become statistically insignificant then transparency has been reached and we're done. Artificially altering encoder output to highlight normally inaudible artifacts of the encoding process and then trying to assign a quality to a codec based on those artificially accentuated normally inaudible artifacts is a USELSS process. It has no application in the real world, it means nothing, and the results obtained from such "tests" are useless noise and best ignored.
You might as well subtract the lossy output from the original, post the spectrograms, and start running around screaming the sky is falling and vinyl is better than digital..
In short, you need to understand what lossy audio/video encoding tries to achieve. It aims to produce audibly/visibly artifact-free files, and not generally artifact-free files. That is what lossless compression is for.
All AAC contenders for this 64kbit/s testing are in CVBR mode.
Can we also use Vorbis and Opus at CVBR rates?
Lets say we have Codec X and Y
X we do at VBR. With some difficult songs X jumps to 82Kbps for some sections. It is okay for offline, because the average might be around 67Kbit/s for all the songs encoded.
Y we do at CVBR. It stays at +/- 64Kbps.
If Codec X won the listening test people might think X is also better than Y when it comes to streaming (Like internet radio stations). This might not be the case, because the X simply used more bits with difficult sections.
How about, can we have a test worth performing?
If AAC, Vorbis, Opus, and MP3 are all statistically transparent at a given nominal bit rate then they are all audibly the SAME QUALITY at that bit rate. No one codec offers any audible benefit over the others at that point. There is nothing to gain in claiming to judge codec quality by adding distortion to their output and pretending it somehow matters in the real world.
I'm honestly not sure why this thread hasn't been locked/removed. It smells of snake oil and pixie dust, or at the least is ill-conceived. These "tests" of adulterated codec outputs offer us no relevant results on which to base any kind of rational discussion or decisions, other than how NOT to conduct a codec quality test. It can only serve to spread disinformation and ignorance. IMO it has no place on HA.
This argument is nothing new here (I too am skeptical about the relevance of SE tests). So long as TOS #8 or any other rule isn't being violated, the discussion can stand.
I was tempted to ask people to refrain from this line of conversation, but Serge did solicit criticism. I also think those who aren't familiar with SE should be aware that results from SE are not exactly in keeping with the spirit of this forum.
At 64kbit/s there is no need for artifacts amplification for sure. Above 128kbit/s meaningful results of ABX testing become more and more expensive (but still meaningful). SoundExpert proposes methodology that makes those tests less expensive. SE quality ratings of devices with small impairments could be considered as results of specially simplified listening tests. Results are experimental which is clearly stated on the site.
If meaningful results of ABX tests above 128kbps become more and more expensive it's because the codecs are doing their jobs and producing audibly transparent output. At a point where normal ABX results become statistically insignificant then transparency has been reached and we're done.
The problem is that there is no such "point" in practice. Another more seriously organized listening test moves the point of transparency to higher bitrates. Codecs at 256 are not the same even if your "normal ABX results" show that; another super-normal ABX results will reveal the differences for sure. In other words, differences between "equally transparent" codecs can be revealed by more thoroughly prepared listening tests. Besides codecs there is a lot of audio equipment with small impairments that requires evaluation and expensive listening tests. So, the purpose of SE testing methodology is to make such listening tests cheaper but still relevant.
How about, can we have a test worth performing?
If AAC, Vorbis, Opus, and MP3 are all statistically transparent at a given nominal bit rate then they are all audibly the SAME QUALITY at that bit rate. No one codec offers any audible benefit over the others at that point. There is nothing to gain in claiming to judge codec quality by adding distortion to their output and pretending it somehow matters in the real world.
I'm honestly not sure why this thread hasn't been locked/removed. It smells of snake oil and pixie dust, or at the least is ill-conceived. These "tests" of adulterated codec outputs offer us no relevant results on which to base any kind of rational discussion or decisions, other than how NOT to conduct a codec quality test. It can only serve to spread disinformation and ignorance. IMO it has no place on HA.
Probably I should repeat once again - 32kbps and up to 128kbps SE listening tests are performed
without artifacts amplification. The topic is devoted to codecs and settings for 64kbps listening test.
All AAC contenders for this 64kbit/s testing are in CVBR mode.
Can we also use Vorbis and Opus at CVBR rates?
Both Opus and Vorbis have managed bitrate mode. But right you are - already chosen codecs and settings are storage oriented. I suppose if initial goal is testing codecs for streaming, then contenders and their settings should be different.
Probably I should repeat once again - 32kbps and up to 128kbps SE listening tests are performed without artifacts amplification. The topic is devoted to codecs and settings for 64kbps listening test.
Which is even worse since you're mixing potentially valid and useful results in with utter trash results that deliver no practical information. The fact that some results in the test may be valid doesn't correct the fact that the overall tests are tainted with intentionally corrupted and useless results. The fact that you've chosen to intentionally distort the resulting waveforms at the higher bitrates brings into question the testing methodology of all of the results.
I'll repeat, the testing you intend to perform has no place being used as a reference for anything other than how NOT to perform a codec listening test. I would personally question the morality of even labelling your results with the names of the codecs since your tested output has NOTHING to do with nominal operation of those codecs.
The following codecs were added to 64kbit/s section:
AAC+ VBR@59.2 (Winamp 5.63) - CVBR, HE-AAC
AAC Encoder v1.04 (Fraunhofer IIS) from Winamp 5.63: variable Bitrate, preset: 2
AAC+ VBR@59.6 (QTime 7.7.2) - CVBR, HE-AAC
QuickTime (7.7.2) AAC Encoder via qaac 1.45 (CoreAudioToolbox 7.9.8.1): qaac --he -v56 ref.wav
AAC+ VBR@60.1 (NeroRef 1540) - CVBR, HE-AAC
Nero AAC Encoder 1.5.4.0 (build 2010-02-18): neroAacEnc.exe -q 0.25 -if ref.wav -of out.mp4
Vorbis VBR@60.3 (Xiph 1.3.3)
OggEnc v2.87 (libVorbis 1.3.3): oggenc2 -q0.3 ref.wav
Opus VBR@59.9 (libopus 1.0.1)
opusenc --bitrate 59 ref48.wav (44.1/16 -> 48/24 by Audition CS6)
Thanks! How does the reliability rating work? How many listeners are needed for, say, 5 percent? And why is Opus graded worse than mp3?
Chris
That must be one good MP3 encoder
Who knew MP3 64Kbit/s CBR at 22050Hz Stereo would sound better than Opus 59.9Kbit/s VBR at 48000Hz Stereo...
How does the reliability rating work?
Each time a device under test receives a grade the rating is recalculated. A sequence of such ratings tends to some final value. ((Max-Min)/Last value)*100% over last N values is reliability. Now N = number of test files for a device. For low bitrates when artifacts amplification is not applied the number of test files equals to 18 (9samples*2).
How many listeners are needed for, say, 5 percent?
Usually 5-7 grades are necessary for each test file in order to achieve 5% reliability of rating. Due to above mentioned nature of the parameter there is no strict relationship between accuracy of ratings and number of returned grades.
And why is Opus graded worse than mp3?
At the moment Opus received 9 grades only and its rating is completely unreliable. I will check now why it shows 4% …
Usually 5-7 grades are necessary for each test file in order to achieve 5% reliability of rating. […] At the moment Opus received 9 grades only and its rating is completely unreliable. I will check now why it shows 4% …
I don't understand
Usually 5-7 grades are necessary for each test file in order to achieve 5% reliability of rating. […] At the moment Opus received 9 grades only and its rating is completely unreliable. I will check now why it shows 4% …
I don't understand
5-7 grades for each test file, number of test files is 18. Opus received 9 grades in total, i.e. not all test files received even 1 grade.
It turned out that reliability of ratings in 64kbit/s section was calculated according to old parameter N=9 which was in use before 2006. The last codec was added to the section in 2008 and since that time the section was a bit neglected. Probably because of that old N remained unnoticed. Fixed now. As all other codecs in the section have outdated reliability parameter we decided to put all of them on rotation for several days at least. Codecs with confirmed reliability (<5%) will be returned on hold. Testing of devices in other sections have been suspended as well. So now only codecs from 64kbit/s section are tested. Test files of newly added codecs will be given out more frequently because the old test files have more grades.
One thing I can't comprehend, is why each of the files is encoded at a different bit rate. Why on earth are they not all encoded to 64kbps, it is the nature of VBR to produce files which may have bit rates higher or lower than the target.
One thing I can't comprehend, is why each of the files is encoded at a different bit rate. Why on earth are they not all encoded to 64kbps, it is the nature of VBR to produce files which may have bit rates higher or lower than the target.
Agree, I wanted to say this yesterday, he should do 64 on all of them then see what the codec does to optimize the bitrate.
Conversion chain for Opus: 44.1/16 -->> 48/24(Audition CS6) -->> opusenc -->> foobar2000(48/24) -->> 44.1/16(Audition CS6)
Why on earth would you do that? You should just use opusenc and opusdec. opusenc will happily accept your 44.1/16 input and running the result through opusdec will by default give you 44.1/16 output.
opusenc and opusdec will handle any resampling internally and transparently, even making sure that opusdec's output has the exact same number of samples as the input to opusenc had. Using a convoluted Audition/foobar/Audition toolchain is more likely to introduce extraneous problems.
Also, I agree with other posters that your bitrate setting selections for these codecs seems rather odd. In concluding that these are the rates that "really" give an average 64kbps output, have you really tried this with a large and diverse collection, or are you jumping to conclusions based on a few choice files? If the latter, you may well be unjustly penalizing codecs that have better VBR rate control.
Bitrate issue: they are calculated on the basis of nine SE samples concatenated with each other; so, all 5 added codecs produce almost the same bitrate for this bunch of test samples. The approach is used from the beginning of SE and has its pros and cons. Calculation of target bitrates on a large sound collection also has its drawback – for classical, rock, minimal …. music collections the bitrates will be different due to different complexity of music styles. If you mix them you'll get some arbitrary averages.
Resampling issue: I couldn't find info about internal Opus resampler and its quality except that “The opus-tools package source code contains a small, high quality, high performance, BSD licensed resampler which can be used where resampling is required”. On the other hand Audition CS6 has one of the best (http://src.infinitewave.ca/ ). Another reason for using external resampler is that OpusDec can't produce 24|32bit output which is necessary for SE utility that generates test files. To be accurate the resulting conversion chain was 44.1/16 -->> 48/24(Audition CS6) -->> opusenc -->> foobar2000(48/24) -->> 44.1/32(Audition CS6) -->> test files production.
I am also generally suspicious of the other results on the website. How is it that files at 320kbps have a higher rating than at 256kbps, when, for certain encoders, 256kbps is already well above the transparency threshold?
It would seem there are 10 people in the world, those who understand perceptual transparency and those who don't. A lossy encoding either achieves it or doesn't. To say that 320 is superior to 256 when both are transparent is utter rubbish.
Detailed results of this listening test are available. (http://soundexpert.org/news/-/blogs/opus-aac-and-vorbis-in-64-kbit-s-section#results)
(http://i52.fastpic.ru/big/2013/0412/fd/e2185e43c15daaff324c6f1da5df99fd.png)
I too am skeptical about the relevance of SE tests
Looking over this discussion I noticed:
Below 128kbit/s artifact amplification is not applied. Outputs of codecs are used as is.
With this in mind, I think this is a worthwhile test for our members.
Thank you for your hard work, Serge.
From this I can gather that Vorbis should give about the same quality as Opus at 64Kbps?
Thanks, greynol.
I still have some questions concerning stat. analysis of the results. I started new thread (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=100394&view=findpost&p=831043) to clear them up.
Below are results using slightly different statistical analysis of the same collected grades. Changes:
- Resulting mean of each codec (black ones) is an average of its nine sample means (previously it was an average of all grades submitted for the codec). Its bootstrapped confidence intervals are also computed using these nine sample means and show therefore consistency of codec performance with different types of audio material.
- All bootstrapped confidence intervals of means are computed using basic percentile method which is more simple and clear (previously bias corrected and accelerated percentile method was used)
(http://soundexpert.org/documents/10179/12683/se_ListenTest@64_11-2012.png)
Some reasoning, which backs the changes in analysis is here - http://www.hydrogenaudio.org/forums/index....st&p=850741 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=100525&view=findpost&p=850741)
Following this very painful but insightful discussion (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853777) it became clear that the above calculation of overall confidence intervals (wide ones) using sample means is not correct. Any arbitrary set of samples (especially small one) chosen for a listening test is representative to some unknown/undefined general population of music. Consequently, the confidence intervals calculated for this unknown population have little or no meaning. Results of such test can't be generalized beyond this set of samples. Loosing that generalization allows to discard separate samples from analysis of overall means and consider all grades of all samples as a single/indivisible entity. Confidence intervals of overall means calculated using grades turn out to be small. Such increase of test power is a reward for loosing generalization of test.
Taking all this into account SE returns to initial/standard calculation of overall confidence intervals. The correct version is below.
(http://soundexpert.org/documents/10179/12683/se-test@64_2012-11.png)