HydrogenAudio

Hydrogenaudio Forum => Listening Tests => Topic started by: Jplus on 2013-02-08 19:15:58

Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-08 19:15:58
My primary motivation for performing this listening test was to find the lowest QT AAC TVBR setting that is fully transparent for me, because I want to use that for my music collection. Secondary motivation was to find out how the other encoders compare to QT AAC at high quality settings.

This is my first rigorous listening test, and a rather extensive one, so I wanted to share the results with the audio community. I hope others may learn as much from this experiment as I did!

Results in a nutshell (for the impatient)
QT AAC was judged fully transparent at q91 and close to transparent at q82. The sample in which I heard a faint difference between these presets had a bitrate of only 128kbps at q82 and 159kbps at q91, so taking that in consideration together with expected bitrates at q82, in CBR mode I would assume files at 190kbps and up to be reasonably safe for my ears.
AoTuV (Vorbis) was judged very close to transparent at q5 and q6 and fully transparent at q7. If I were to use Vorbis for my music collection I would pick q6 because I think the tradeoff between file size and perceived sound quality is better at that preset than at q7. I would trust CBR files of 200kbps or greater.
Opus was judged fully transparent at VBR with target bitrate 224kbps, which is consirably higher than I expected based on previous reports. At preset 192 I judged it untransparent so there's no grey area like in AAC or Vorbis. Opus VBR seems to be a lot less variable than the other codecs so in CBR mode I would trust Opus files of 230kbps and up.
LAME (MP3) was judged very close to transparent at V1 and V0 and fully transparent at c320. I would pick V0 if I were to use LAME for my music collection. In CBR mode I would trust files of 260kbps or greater.

Hardware


Software


Encoder details
QT AAC: my installation of Mac OS X included CoreAudio 3.2.6, QuickTime 7.6.6 and QuickTimeX 10.0. I used TVBR mode and overall encoder quality "max".
AoTuV: XLD included release 1. Apart from the target quality setting no options were shown.
Opus: XLD included libopus 1.0.2. I used VBR mode and framesize 20ms. opus-tools 0.1.6 also uses libopus 1.0.2.
LAME: XLD included version 3.99.5. I used VBR mode with -q2 and the new VBR method.

Ambient conditions
Test setup was in an appartment with reasonably good sound isolation, in a moderately quiet environment with singing birds and low traffic. During ABX trials I kept the room door and the ventilation window closed. Computer fans were turned down. Under those conditions while wearing the headphones, most of the time the only sound I heard was the low humming of the external hard drive that carried the samples. Usually I became unaware of that sound when actively listening to a sample.

Samples
I selected 15 samples from the LAME Quality and Listening Test Information (http://lame.sourceforge.net/quality.php) page. In 8 out of those samples I didn't hear a difference at any of the encodings I've tested. The remaining 7 samples are numbered below. In addition I included a 10-second fragment from Central Industrial by The Future Sound of London which I had previously found to contain obvious artifacts when encoded with QT AAC q63:

Henceforth I'll refer to these samples by their numbers. See the appendix for detailed discussion of each sample.

General test procedure
As a general preparation I transcoded the WavPack samples to ALAC in order to make them playable in ABXTester. I always used the lossless original as sample A and the lossy compressed file as sample B. I took regular breaks in order to prevent fatigue. The measurements were spread over multiple sessions with almost a week between the first and the last session.

For each codec, I would first encode all samples at the middle preset, i.e. q63 for QT AAC, V5 for LAME, q4 for aoTuV and 96kbps for Opus. Then for each sample I would conduct ABX testing and conclude one of the following levels of quality:

By default I set the audio volume to 5 notches out of 16. I tended to turn it up to 6 notches if I didn't immediately hear a difference in all samples except for #1, which I experienced as very loud already. Occasionally I would try the sample with the channels reversed (by reversing my headphones) in order to test if something new might come to my attention.
After testing all samples at the middle preset I would proceed to higher presets with the samples in which I heard any difference, until I found the minimal preset at which I heard no difference or until I couldn't go higher. A preset was judged "fully transparent" if I heard no difference in any sample, "very close to transparent" if I heard a marginal difference in at most one sample, and "untransparent" otherwise. I decided to assign QT AAC q82 an intermediate category "close to transparent" because I heard a clear but very faint difference in one sample. More on that below. The overall search path from preset to preset generally went like a binary search or similarly "jumpy".

I executed the above procedure first for QT AAC, then for LAME, then aoTuV and finally Opus. During the course of the experiment I noticed I had become better at detecting artifacts, so in the end I returned to QT AAC to verify my end results for that encoder.

QT AAC
Observed bitrate range: varies wildly around the official expected value. For example, at q63 (135kbps expected) some samples had an average bitrate of 80kbps while others went over 190kbps.
Observed artifacts: even at medium bitrates (q27) most artifacts were slight changes in timbre or texture rather than very obtrusive stand-alone sounds. The exception is sample 8 which obtained some obvious, very sharp "ticks" after encoding which were audible up to q82 at 128kbps average file bitrate.

Stage 1: all samples at q63.
I heard no differences except for a clear difference in sample 8. I decided to ignore that for the moment and to proceed my search downwards first.

Stage 2: samples 1-7 at q27.
I heard clear differences in samples 1, 2, 6, 7.

Stage 3: samples 1, 2, 6, 7 at q45.
I heard clear differences in samples 2, 6 but no difference in samples 1, 7.

Stage 4: samples 2, 6 at q54.
I heard no differences anymore and decided q54 to be fully transparent if disregarding sample 8.

Stage 5: sample 8 at q100.
No difference.

Stage 6: sample 8 at q82.
No difference.

Stage 7: sample 8 at q73.
Clear difference, I chose q83 as my search result for the time being.

Stage 8: samples 1, 2, 6, 7, 8 at q82 (verification after finishing the other encoders).
I did hear a clear difference in sample 8 afterall, though I had to listen to A and B a few times before I noticed it. I heard no difference in the other samples.
Note: I have not reviewed stages 1-4. With my trained ears I might actually hear some additional differences at q54 or even q63 but I haven't tested.

Stage 9: sample 8 at q91.
No difference. I decided q91 to be my final search result for QT AAC.

LAME
Observed bitrate range: the spread is somewhat less than in QT AAC, generally the highest and lowest average bitrates where within 30kbps of the expected bitrate for the given quality preset.
Observed artifacts: no standalone "objects", but changes in timbre or texture could be very un-subtle.

Stage 1: all samples at V5.
I heard clear differences in samples 1, 4, 6, marginal difference in sample 7 and no difference in samples 2, 3, 5, 8.

Stage 2: samples 1, 4, 6, 7 at V3.
Clear differences in samples 1, 6.

Stage 3: samples 1, 6 at V1.
Marginal difference in sample 1. I decided V1 to be my search result for the time being.

Stage 4: sample 1 at V2 (checking for consistency with aoTuV after finishing Opus).
Clear difference. I chose V0 as my final search result instead.

Stage 5: sample 1 at V0 (for completeness, shortly before starting this report).
Marginal difference (yes really, I believe I heard a difference and I identified 18 out of 25 Xs correctly: 72%, p=0.014).

Stage 6: sample 1 at c320.
No difference (at first I thought I heard a difference but ABX testing showed I didn't).

AoTuV
Observed bitrate range: average file bitrate is usually greater than the official target bitrate for the given quality preset. For example, the average bitrates at q4 were all greater than 128kbps. Upwards spread from the target bitrate seemed to be similar to QT AAC.
Observed artifacts: few and subtle. The marginal difference in sample 3 that I consistently heard up to q6 was an attenuation effect, the high frequency components were slightly softened.

Stage 1: all samples at q4.
Clear difference in sample 1, marginal difference in sample 3 and no difference in the other samples.

Stage 2: samples 1, 3 at q6.
Marginal difference in sample 3, no difference in sample 1.

Stage 3: sample 1 at q5.
Marginal difference. I decided q6 to be my search result.

Stage 4: sample 1 at q7 (for completeness, shortly before starting this report).
No difference.

Opus
Observed bitrate range: average bitrates were always very close to the target bitrate, with a spread of less than 10kbps in each direction. I would compare Opus VBR to QT AAC ABR.
Observed artifacts: texture changes, some of them very severe, including "rattling" and "grinding" sounds. Usually the timbre became more "metallic".

Stage 1: all samples at target 96kbps.
Clear differences in samples 1, 2, 4, 5, 6, 7, no difference in samples 3, 8.

Stage 2: samples 1, 2, 4, 5, 6, 7 at target 192kbps.
Clear differences in samples 4, 6, no difference in samples 1, 2, 5, 7.

Stage 3: samples 4, 6 at target 256kbps.
No differences.

Stage 4: samples 4, 6 at target 224kbps.
No differences. I chose 224kbps to be my search result.

Conclusions and recommendations
QT AAC and aoTuV are the clear winners in this comparison, with QT AAC achieving full transparency at the best compression ratio. I was a bit surprised to find that the highest quality preset is no overkill (for my ears) in LAME. Opus doesn't seem to perform exceptionally well (though better than LAME) at high bitrates although it's known (http://listening-tests.hydrogenaudio.org/igorc/results.html) to beat QT HE-AAC (more or less) at 64kbps. This is probably in part explained by the fact that Opus is still very young. Another explanation is that Opus might be more intended for low bitrates, which is somewhat suggested by the way it's described on the Opus home page (http://opus-codec.org/).

According to the Hydrogenaudio wiki, most people find AAC to be transparent at about 150kbps, Vorbis at about 150-170kbps and LAME at about 160-224kbps. Given the results of this experiment, my ears might be slightly better than average.

If you wish to repeat this experiment, you might be able to save a lot of time by using my results as a hint where to find the most significant differences. The sample details in the appendix may help you to "look" in the right direction. In addition, you can probably start your searches for Opus and LAME at higher presets than I did.

If you just want to use this report as a hint for choosing your ideal encoder setting, I suggest that you perform a miniature version of my experiment using just a single sample in the encoder that you're interested in. If you hear a difference go up one preset until you don't, otherwise do the opposite by going down. Specifically:
For QT AAC, I would recommend listening to sample 8 and starting at q73. If you descend below q54 I recommend listening to samples 2, 6 instead.
For aoTuV, I would recommend listening to sample 3 and starting at q5. If you don't hear any difference switch to sample 1 at q4.
For Opus, you could take sample 4 at target 160kbps.
For LAME, I recommend listening to sample 1 starting at V3.

Appendix: sample details
Sample 1
Loud applause, with a "thank you" yelled through a microphone shortly after the start. The "thank you" is loud but sounds a bit muffled because of the microphone and there's a faint echo to it.
In the lossless original the applause sounds "wet"; you could compare it to rain or perhaps to oil spattering in a hot pan. In audibly different encodings it may sound dryer, noisier and coarser, perhaps like sandblasting, or very coarse and metallic (in Opus at 96kbps target bitrate).
The "thank you" should be a separate sound layered on top of the applause, and should sound fairly smooth. In audibly different encoding you may expect it to interact with the applause in several ways:


Sample 2
Some sawtooth-like signal with an additional trill effect that seems to contain vowels. I'm not sure whether this is a heavily filtered human voice or just something creative from a synthesizer, but either way it sounds quite interesting.
At medium bitrates in QT AAC and Opus it sounded distorted and metallic.

Sample 3
Symphonic fragment with drums, trumpets, violins, vocals and some high-pitched snare instrument which I think might be a steel guitar. There's also some high tingling in the right channel which I suspect is an artifact in the original file coming from the snare instrument. Sounds like a soundtrack to an epic 1960s movie.
In aoTuV you may find that the snare instrument (the proper sound slightly to the left, not the tingling in the right channel) is arpeggiated less sharply and sounds softer overall; I would call it a bit "timid" compared to the original.

Sample 4
Bagpipe playing a slow high-pitched melody over a constant bass. The sound is smooth overall although you'll find some irregularity especially in the second long-lasting high note. In the background there's the occasional hollow, raspy, low-pitched sound which might be either the bag being inflated by the artist or (a suggestion of) wind.
Focus on the long-lasting high-pitched notes, especially the very last one. In case of audible difference you'll find that they sound metallic and/or less smooth or even straightout distorted (Opus at 96kbps target bitrate).

Sample 5
Drums (something that sounds similar to a conga or a djembe) playing a samba-like rhythm. At the start an alto voice sings "aaaa", which is a bit of a shame because the voice will not help you to distinguish the encoded sample from the original and it partially masks the drums.
In Opus at 96kbps target bitrate the high-pitched slap beats sound more metallic than in the original.

Sample 6
Western guitar playing a country tune.
At lesser bitrates you might recognise the encoded sample directly because it sounds metallic and perhaps even a bit distorted. A high bitrates you might be able to make out the difference if you focus on the initial arpeggio and the final note. The last note of the initial arpeggio (which lasts longer than the previous notes) might sound a bit more rough than in the original. The final note might sound metallic. The latter difference is probably easier to hear than the former. You probably won't find a difference in the chords.

Sample 7
Monotone (synthetic) drum rythm with bass, big tom beating every second base beat, open-closing hi-hat in the right channel alternating with the bass beat and another closed hi-hat in the left channel beating four times for every bass beat.
You'll only hear a difference at the lesser quality settings, and you are most likely to find it in the closed hi-hat in the left channel.

Sample 8
Synthesizer music of fairly low complexity.
Frankly, the sounds aren't really important, because the main reason to listen to this fragment is the sharp ticks that are introduced by QT AAC. I don't think I need to tell you where they are because you're pretty much guaranteed to hear them at q63 and below.
Since this sample isn't available from the LAME Quality and Listening Test Information page, I made it available for download over here: https://dl.dropbox.com/u/3512486/central%20industrial.m4a (https://dl.dropbox.com/u/3512486/central%20industrial.m4a)
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: eahm on 2013-02-08 19:26:59
I believe you but I am skeptical, these bitrates are way too high. I used to test a lot and train my ears to hear artifacts but I gave up since I couldn't believe I find Apple's AAC to be transparent to me at ~100kbps. I now use -V73 just to cover a wider range of music and I still think it's too much, I was happy with -V63 as well.

I'd like to see your ABX logs at "even" ~96/~128/~160kbps.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: DonP on 2013-02-08 20:43:07
What versions of these encoders did you use?  For opus I hear a lot of improvement with 1.1a over the previous 0.1.5 and can see more variable in the variable bit rate.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-08 20:53:21
@eahm:
If by "logs" you mean the kind of logs you often see at this forum (I presume those are produced by foobar2000) I'll have to disappoint you, because ABXTester doesn't produce anything like that. Apart from that, I'll happily do some additional tests for you.

What do you mean by "even"?

@DonP:
I listed the versions of all encoders clearly in my post. See the sections "Software" and "Encoder details" close to the top.
(Edit: I checked the Opus website and libopus 1.0.2, which I used, seems to be latest version.)
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: eahm on 2013-02-08 21:05:41
Jplus, yes I meant these logs. Until I see a proper ABX test that tells me you really hear quality difference between lossless and 192 AAC I have to remain skeptical, AAC is soo good at low bitrates.

Even = low. It was more for the ~96.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: DonP on 2013-02-08 21:35:09
@DonP:
I listed the versions of all encoders clearly in my post. See the sections "Software" and "Encoder details" close to the top.
(Edit: I checked the Opus website and libopus 1.0.2, which I used, seems to be latest version.)




OK.. for  aoTuV you listed version 1.  Rarewares shows the current version as  6.03

I grabbed the pipes sample, and have been able to ABX  Opus 1.1a up to target bitrate=128 (so far.)  The 3 rates I tried: 70,100,128 all encoded at 40-50% over the target rate, foobar showing 180 kb/s for the section I was using for 128 kb abx.  So the encoder does seem to recognize this sample as hard.

edit: description of opus 1.1a vs 1.0x: http://jmspeex.livejournal.com/11737.html (http://jmspeex.livejournal.com/11737.html)

foo_abx 1.3.4 report
foobar2000 v1.2
2013/02/08 16:03:27

File A: D:\rips\abxstuff\pipes\pipes.wv
File B: D:\rips\abxstuff\pipes\_\_\track 128kb.opus

16:03:27 : Test started.
16:04:03 : 01/01  50.0%
16:04:20 : 02/02  25.0%
16:04:50 : 02/03  50.0%
16:05:35 : 03/04  31.3%
16:05:59 : 04/05  18.8%
16:06:19 : 05/06  10.9%
16:06:40 : 06/07  6.3%
16:07:01 : Test finished.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-08 22:42:46
@eahm:
I'm sorry for quote-sniping you, but I see several things in your post which I think need to be addressed before I can do any additional testing:

Jplus, yes I meant these logs. Until I see a proper ABX test that tells me you really hear quality difference

Excuse me, but what is improper about my ABX tests? The only difference with the foobar2000 logs is that you can't see trial-by-trial whether I identified the X correctly or not. I explained my definitions of "clear difference" and "marginal difference". Everywhere I said I heard a "clear difference" I got 100% or near-100% score with a probability less than 0.002 that I identified them correctly by luck. For example, I might have correctly identified 18 out of 20 trials. The probability of getting that score by guessing is 0.5^20*choose(20,2)= 0.00018 (judging from an example log, foobar2000 would round that down to 0.0%). A similar but less-extremely-certain story applies to the "marginal differences".

If you think that I might have made this up then logs shouldn't change anything, because I can make those up as well.

If knowing my exact score for each individual trial is important for you, I can keep track of that during my next experiments and write it down in my own way in my next post. Would that solve your issue? Because frankly, I won't be able to run foobar2000 on my mac.

Quote
between lossless and 192 AAC

Note that from my VBR results I concluded that most files at or above 190kbps are probably transparent to me (I called it "probably safe for my ears" but that amounts to the same thing). That means that I actually don't expect to hear a difference between lossless and 192kbps AAC. If you want, I can check whether any of the QT AAC samples that I found audibly different from the lossless original had a bitrate near 192kbps.
Edit: I did this, and none of them did. The highest average bitrate was 161kbps for sample 2 at q45. If I'd hear the difference in sample 2 at q54 (which I didn't verify after my ears became more trained) then that one would come close at 186kbps.

Quote
I have to remain skeptical, AAC is soo good at low bitrates.

I completely agree with that! Note that at the start of my experiment, I heard no difference at q54 (expected bitrate 95kbps) in any of my samples except for #8, which had obvious ticks which were probably caused by QT choosing the bitrate too low. I didn't hear those ticks anymore at q91, where the average bitrate in sample 8 was still only 159kbps.

I'm not denying that QT AAC is really good even at medium bitrates (where I follow the apparent convention that 80-120kbps is medium). I'm just saying that I found a case where q82 isn't strictly transparent, so I'll have to choose q91 for my music in order to be on the safe side.

Quote
Even = low. It was more for the ~96.

So you'd like me to test at about 96kbps, 128kbps and 160kbps. I'm fine with that, but how would you want me to approach that? Use the VBR preset which has an expected bitrate near the proposed bitrate?
Why exactly would you like me to do that? Do you expect results that are somehow in conflict with my first post?


@DonP:
Alright, so libopus 1.1a is probably better and more variable than 1.0.2. That seems to confirm my suspicion that 1.0.2 didn't score very well in my experiment because it's still a very young codec. I acknowledge that I wasn't using the bleeding-edge version in my measurements and that Opus would probably have scored better if I had.
I prefer testing release versions only because you never know what rare errors an alpha encoder might have that happen not to show up in my limited set of test samples. It seems that you are concerned that Opus might look worse from my results than it deserves. Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?

As for aoTuV, XLD is probably just displaying the version number incorrectly (indeed if you search "aotuv" at the XLD homepage the last hit you'll find indicates that the default included version should be at least 4.51). My results don't seem any worse than you'd expect from aoTuV (as compared to QT AAC), so I think there's no reason for concern.

That said, XLD offers plugins for Opus 1.1a and for aoTuV 6.03b, so if more people think I really have to test those, I can without needing to jump through hoops. Please do keep in mind that what I've done here is very time-consuming though.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: DonP on 2013-02-08 23:58:11
@DonP:
Alright, so libopus 1.1a is probably better and more variable than 1.0.2. That seems to confirm my suspicion that 1.0.2 didn't score very well in my experiment because it's still a very young codec. I acknowledge that I wasn't using the bleeding-edge version in my measurements and that Opus would probably have scored better if I had.
I prefer testing release versions only because you never know what rare errors an alpha encoder might have that happen not to show up in my limited set of test samples. It seems that you are concerned that Opus might look worse from my results than it deserves. Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?


I guess my 2 points were that a) you noted that the VBR wasn't varying much, and that has been fixed (overall average on music I've coded seems pretty close to target still), and b) Someone was questioning your lack of logs, so I pointed out that with the one sample I tried, my results at an actual 180 kb/s were consistent with yours at the roughly the same rate, and log supplied.

Though in general, anyone requiring stability would be nuts to count on alpha software, I've found no case where it performs worse than the "stable" release, and it's more dependable than the development builds, where folks who had problems with the production release (including me) had been directed for quite a while.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-09 01:04:51
Ahh, I guess I've been too defensive. Thanks for pointing out these useful bits of information!

(Concerning the logs: I realised I could write a little script to complement ABXTester and produce a simple log of the same kind as those produced by foobar2000. So in the future that should fix the issue for those who care very much about the textual representation of my results. Example:
Code: [Select]
test
batch  score  subtotal  p
    1    4/5      4/ 5  0.15625
    2    5/5      9/10  0.009765625
    3    5/5    14/15  0.0004577637
)
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: eahm on 2013-02-09 04:26:52
Jplus, don't worry about lower bitrates just for me if you hear difference in higher ones.

I meant proper logs when I said proper ABX test. I am sure you tested but you only say "clear difference" and "no difference" here, here and here but who tells us you really do.

Everyone who talks about testing and transparency needs to post logs for every single test taken. Proper logs with percentages, seconds, test number etc.

I don't understand why this time, this thread is different.


Let me be clear and not rude. For example this test:
Stage 1: all samples at target 96kbps.
Clear differences in samples 1, 2, 4, 5, 6, 7, no difference in samples 3, 8.

I'd like to see the logs for every sample.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Mach-X on 2013-02-09 06:03:05
The OP presented clear and concise criteria for testing, do we really need to hammer him with the TOS 8 card? Yes I understand the importance of unbiased log results, but being a mac user he doesn't have the ability to do foobar abx testing, if the mods truly believe his results to be biased they can delete the topic. It's not like he wandered in here screaming 'wma is better than mp3 so nyah'.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: lvqcl on 2013-02-09 08:11:36
About aotuv version numbers: homepage (http://www.geocities.jp/aoyoume/aotuv/)

Quote
aoTuV Release 1 (2006/08/23)
# This is the stable version. The contents are almost the same as beta4.51.


Current version is aoTuV Beta6.03.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-09 10:20:27
@lvqcl:
Thank you, that clarifies some things. I hope you don't mind me asking for you personal opinion: do you think I should test aoTuV 6.03b?

@Mach-X:
Thank you.

Concerning my clear and concise criteria, I should say they were also very strict. Judging from foobar2000 logs that I've seen elsewhere on Hydrogenaudio many people would be already comfortable to conclude that they really hear a difference at p=0.004 (or probably even higher). According to my criteria that would only qualify as a marginal difference.

@eahm:
I'm confused. You've said you believe me and that you're sure I did the tests. At the same time you emphasize that you're skeptical and you insist on viewing logs of every individual test.

The thing is this. I took 116 tests, give or take a few for counting errors. I've done you a great favor by compressing my results to just a single datum per test, i.e. at what level of confidence I heard a difference if any at all (you won't find 116 results in my OP but that's because I skipped over all samples that I never heard a difference in and because some "no difference" judgments were implicit). If I had posted foobar2000 logs for all of those tests, or even just the 53 that are interesting, would that really help you? Would you read and verify all of them?

There's another thing. The only service that ABXTester offers is to present me with an A, a B and five Xs. I can try to identify the Xs and then ask for my score which is shown in a popup window. I can ask for new batches of Xs as often as I want but ABXTester doesn't keep track of my running total. Any logging will have to be done manually by me. Which I did in my own way, using a calculator and the back of an envelope. Between batches I would recalculate my p-value. At some point I would decide to end the test and assign the current confidence level (clear/marginal/no difference), which I logged on my envelope with a symbol (respectively star/half star/dash).

Those symbols at the back of that envelope are the only permanent log I've kept, so there's no way I'll be able to show you "proper" logs for the tests I've already conducted. I would have to repeat the tests and manually enter my score for each individual batch into my new script in order to do that. That would be several days of work if I were to do it for all of my tests.

However, I take your skepticism seriously and I can offer you to repeat the three tests that you're most skeptical about. I guess that might be these:

Please let me know what you think of this offer.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Alexxander on 2013-02-09 11:20:55
Without hard numbers, only you can draw your conclusions and they're valid just for you. You're on the right track, but you're not there.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: greynol on 2013-02-09 12:30:03
I have my doubts that the p-values are being calculated the same way as fb2k.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-09 13:33:06
TLDR version: greynol is right that fb2k does something slightly different from what I do. I figured out what it calculates and will do the same from now on. So from now on my p-values will be compatible with those of everyone else at the HA forums.

I calculate the probability of a false positive, i.e. the probability that I would get the score if I were guessing randomly.

The probability of identifying a single trial correctly by luck is 0.5, and the probability of identifiying it incorrectly is also 0.5. So the probability of identifying N trials all correctly (or all incorrectly) is 0.5^N.

If you identify K out of N trials correctly (meaning some but not all of them) then the probability is greater because there are multiple ways to get exactly K out of N trials correct. For example, if you get 14 out of 15 trials correct then the single incorrect trial might be the first, the second, and so on, so there are 15 ways to get that score, so I'd have to multiply 0.5^15 by 15 to get the true probability of a false positive.

The overall formula for a false positive with K out of N trials correct is 0.5^N * (number of ways to get exactly K out of N correct). This actually also applies to getting all of the trials or none of them correct, because in both cases there's only one way to do that so you're just multiplying 0.5^N by 1. The "number of ways to get exactly K out of N correct" is known mathematically as the combination (https://en.wikipedia.org/wiki/Combinations). In computer-related contexts this is often written as choose(N, K). Therefore the more formal way to write the false positive probability formula is 0.5^N * choose(N, K). This is also known as the Bernouilli distribution (actually it's a special case because the probability of success is equal to the probability of failure, but that doesn't matter now).

I checked whether foobar2000 is calculating the same thing using the first log in [a href='index.php?showtopic=98841']this topic[/a]. You are right that it isn't the same:
foobar2000 p-value at the fifth trial: 18.8%
what I would calculate: 0.5^5*choose(5, 4) = 0.15625 =~ 15.6%

I checked every fifth value and it turns out that my p-value are consistently close to, but slightly smaller than the foobar2000 p-values. Fortunately I know where this difference is coming from: foobar2000 is calculating the probability of a false positive if you correctly identify K or more trials out of N. See the math:
0.5^5*choose(5, 4) + 0.5^5*choose(5, 5) = 0.1875 =~ 18.8%

It also works for the other trials. For example, here's the tenth trial:
foobar2000 p-value: 5.5%
doing it manually: 0.5^10*choose(10, 8) + 0.5^10*choose(10, 9) + 0.5^10*choose(10, 10) = 0.0546875 =~ 5.5%

And the sixteenth trial:
foobar2000: 0.2%
manually: 0.5^16 * (choose(16, 14) + choose(16, 15) + choose(16, 16)) = 0.002090454 =~ 0.2%

I'll adopt the foobar2000 style p-values in my future tests in order to make my outcomes comparable with those of other people at the HA forums. It's to my own benefit as well as the slightly larger p-values will force me to be slightly more cautious. Thank you for making me aware of this difference!
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: LordWarlock on 2013-02-09 13:53:42
Why are you all so hung up on the ABX logs? It's sufficient that he said he used ABX and got statistically relevant results. Until the ABX tools start to append some randomly generated strings at the end of the log and then sing the whole with some sufficiently long cryptographic key (similar to what new versions of EAC do to its logs), there is nothing that would prevent creation of fabricated logs, so their value is questionable at best.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: eahm on 2013-02-09 18:55:48
LordWarlock, I was able to hear difference between CBR 320kbps FhG and CBR 320kbps LAME on the first minute of Guns n' Roses - Don't Cry. I tested this for four straight hours but sorry I don't have any logs, you have to trust me. Go tell LAME and Fraunhofer there is something wrong with their encoders.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: greynol on 2013-02-09 19:16:55
Thanks for the clarification, Jplus. Even more thanks for sharing your findings and for doing so with such rigor.  It is clear to me that you have keen hearing and a strong handle on methodology.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: db1989 on 2013-02-09 19:27:01
Not so thanks to eahm for continuing to troll against someone who has clearly got good intentions and has made a huge amount of effort to tell us about the results.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: vinnie97 on 2013-02-09 21:58:03
Realizing that Opus was built for mobility, the apparent regression of Opus at higher bitrates (when compared to Vorby and AAC) is still disconcerting.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: halb27 on 2013-02-09 21:58:54
A big thank you from me too, Jplus. I appreciate your findings. And I can't see any reason why not to trust your hard work.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: LordWarlock on 2013-02-09 22:36:33
LordWarlock, I was able to hear difference between CBR 320kbps FhG and CBR 320kbps LAME on the first minute of Guns n' Roses - Don't Cry. I tested this for four straight hours but sorry I don't have any logs, you have to trust me. Go tell LAME and Fraunhofer there is something wrong with their encoders.
And? Even if you provided a log (or logs) supporting this statement, I still wouldn't have any reason to believe you (or not to believe you, if I took your side...). You could type it in manually, you could create one by comparing completely different sounds, or you could plain and simple go for a brute-force method and repeat the test until you get your desired result by pure luck.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: eahm on 2013-02-09 23:16:38
And? Even if you provided a log (or logs) supporting this statement, I still wouldn't have any reason to believe you (or not to believe you, if I took your side...). You could type it in manually, you could create one by comparing completely different sounds, or you could plain and simple go for a brute-force method and repeat the test until you get your desired result by pure luck.

Of course and for the same reason I thank Jplus for putting the effort to test that much.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: IgorC on 2013-02-10 01:04:49
Jplus, welcome on HA forum. . It's great to see your OP.

Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?

Or You can try it now.
It's a closed circle. Everybody is waiting for a final release while a devs are waiting for You to try it, like nobody is sure when to make a first step. Would You do it?

Realizing that Opus was built for mobility, the apparent regression of Opus at higher bitrates (when compared to Vorby and AAC) is still disconcerting.

Opus is still a very young format. Jplus has tested 1.0.2 version while there is a new alpha 1.1.
1.1 has an unconstrained VBR.
There is a lot of stuff going on. Check it here http://www.hydrogenaudio.org/forums/index....st&p=823712 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=86580&view=findpost&p=823712)
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-10 12:36:13
Thanks for the grateful and welcoming reactions, I feel all warm and fuzzy! It was interesting to perform the measurements and a pleasure to share the results with you, and given your reactions I'd definitely do it again. With a bonus: next time I'll include (Jplus-scripted) logs for those who care about it, and provide p-values that are calculated in the same way as in foobar2000.

Speaking of the next time. IgorC, yes I do think I'll test libopus 1.1 while it's still in alpha stage. You provide a solid argument (devs need listening results in order to improve their encoder), and while I'd never use alpha software for my music collection it can't hurt to just try how it performs. To be honest I've become quite curious anyway because of the change from constrained to unconstrained VBR. I really like unconstrained VBR.

I might try aoTuV 6.03b as well, while I'm at it. I can probably save a lot of time by restricting myself to the samples that I already found a difference in.

Question on good HA forums style: should I post the additional results in this thread or in a new topic?

Stay tuned!
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: DonP on 2013-02-10 12:54:41
To be honest I've become quite curious anyway because of the change from constrained to unconstrained VBR. I really like unconstrained VBR.

I might try aoTuV 6.03b as well, while I'm at it. I can probably save a lot of time by restricting myself to the samples that I already found a difference in.


Not to give you assignments, but it would be interesting to note the actual bit rate and see  how quality/transparency compares at a given target between hard and easy samples.

If all your current samples come from a set of cases considered hard tests for lame, I'd guess they lean to the high rate side for others as well.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-10 14:43:26
Writing down the (average) bitrate of each encoded file seems like a good idea to me, I'll do that.

Not all of my samples are considered hard for LAME (at least not anymore), in fact sample 8 was not taken from the LAME testing page at all. I'm sure that sample 8 is easy because it's invariably encoded at lesser bitrates than expected for the given preset. I think sample 5 is easy as well because it doesn't sound complex at all and I almost never heard a difference in it, but I'd have to check the bitrates to verify.
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: IgorC on 2013-02-11 16:37:14
Jplus,

It might be worth to mention that ABX is adequate for  "lossless vs lossy" comparison, while if You want to strictly compare a performance of a few lossy encoders  then there is ABC/HR for such purpose. ABC/HR Java (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=77573&view=findpost&p=683924). It's quite easy to use.

Some links
http://wiki.hydrogenaudio.org/index.php?title=ABC/HR (http://wiki.hydrogenaudio.org/index.php?title=ABC/HR)
http://www.rarewares.org/rja/ListeningTest.pdf (http://www.rarewares.org/rja/ListeningTest.pdf)
Title: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings
Post by: Jplus on 2013-02-11 18:04:53
As promised: additional measurements for the latest versions of libopus and aoTuV, with precise bitrate information. Logs are in the appendix.

But first a reaction to IgorC, because their post above appeared while I was writing this report.

IgorC: that's certainly worth pointing out. Thank you! While my primary question is "at what presets do these codecs reach transparency" and secondary "which codec will cost me the least amount of disk space if I encode everything transparently", I
am ranking the codecs by these results and that will be misleading at lesser bitrates. If I were to ask instead "what codec will give me the best results if my target bitrate is X kbps", this experiment would not provide an answer (unless X=190).

I'm interested in the latter question as well and I think I might want to perform an ABC/HR test sometime in the future in order to answer it. I'm thinking that 96kbps might be a good target. Perhaps I should try a second target as well. Judging from the documentation that you referred to, and given that I've used samples that "push me to the safe side", I'd probably need to make some changes to my selection of listening samples. I'd be most grateful if anyone is willing to provide their input on these considerations!

Now, back to the current report.


Results in a nutshell
Opus 1.1a shows a considerable improvement over version 1.0.2 in terms of the variability of bitrates. I reached full transparency at preset 192 rather than at 224. I also judged preset 160 to be very close to transparency. In terms of efficiency however the improvement seems to be less extreme; in CBR mode I would trust files with total bitrate 220kbps or greater while I would previously do so at 230kbps.
AoTuV 6.03b seems to be a much more gradual improvement over release 1, which is perhaps unsurprising since aoTuV is older and more mature than Opus. This time I judged q6 to be fully transparent rather than very close to it (but see discussion in the aoTuV section). Regardless, q6 is still my optimal setting if I would use aoTuV for my music. In the OP I stated that I would trust aoTuV files in CBR mode at 200kbps or greater, but that was based solely on expected bitrates. Now that I've paid full attention to the observed bitrates, I have to increase that estimate to a shocking 290kbps.
The new bitrate information does not affect the conclusions for QT AAC and LAME. If I were to rank the codecs for their performance at high quality settings, QT AAC ends up at a distinct first place and Opus at second place, while I'd have a hard time to decide whether to put LAME or aoTuV next. Theoretically aoTuV seems better because of the expected bitrates associated with my optimal setting in both codecs (192-224 for aoTuV q6 versus 256 for LAME V0), but since aoTuV encodes everything at equal or higher bitrates than expected while LAME seems to actually meet the target, aoTuV might not be the most efficient of the two.

Equipment, procedure, and so on
Nothing changed compared to the OP, except that I used the following additional software:
In the Opus search I restricted myself to the eight numbered samples from the OP, skipping over the other eight samples in which I've never heard any difference. In the aoTuV search I only listened to samples 1, 3 because I didn't expect to hear a difference anywhere else.
The limits for "marginal difference" and "clear difference" are still the same (respectively 0.05 and 0.002), but that means my criteria have become stricter as my p-values are now calculated the same way as in foobar2000, which yields slightly larger numbers.

Observed bitrates
1. Bitrates for the files encoded with libopus 1.1a and aoTuV 6.03b, together with ALAC bitrates of the original files as an indicator of entropy. A zero means "not encoded".
Official expected bitrate for Vorbis q4 (according to ogginfo) is 129kbps.
Code: [Select]
  opus.96 opus.128 opus.160 opus.192 aotuv.q4 aotuv.q5 aotuv.q6 alac
1    114      151      188      226      257      321        0 1078
2    170        0        0        0      214        0        0  673
3      95        0        0        0      139      164      198  872
4    137      172      209      239      138        0        0  817
5    112        0        0        0      137        0        0  683
6    141      180      220      255      165        0        0  835
7    111      147        0      221      161        0        0  983
8      94        0        0        0      146        0        0  811
2. Bitrates for the files encoded with QT AAC and LAME, as they were used in the OP. ALAC bitrates again included as a reference.
Expected bitrate for QT AAC q45 is 105 kbps. Expected for LAME V5 is 135kbps.
Code: [Select]
  qtaac.q45 qtaac.q54 qtaac.q82 qtaac.q91 lame.V5 lame.V3 lame.V0 alac
1      102        0        0        0    168    206    296 1078
2      161      186        0        0    230      0      0  673
3      103        0        0        0    146      0      0  872
4        84        0        0        0    127      0      0  817
5        93        0        0        0    126      0      0  683
6        81        95        0        0    129    171      0  835
7      107        0        0        0    168      0      0  983
8        70        0      128      159    128      0      0  811
Some interesting observations:

Opus
1. Target 96, samples 1-8. Clear differences in 1, 4, 6, 7.
2. Target 192, samples 1, 4, 6, 7. No difference.
3. Target 128, samples 1, 4, 6, 7. Clear differences in 1, 4, marginal difference in 6.
4. Target 160, samples 1, 4, 6. Marginal difference in 4.

AoTuV
1. q4, samples 1, 3. Clear difference in 1 (at 257kbps!), marginal difference in 3.
2. q5, samples 1, 3. Marginal difference in 3.
3. q6, sample 3. No difference (but see comment).
Comment: I believe in stage 3 I heard the same difference as in stage 2, but too subtle to be able to prove it. As you can see from the log I came quite close to the marginal difference limit before the last batch. So meticulously speaking aoTuV q6 might be "extremely close to transparent" instead of "fully transparent".

Appendix: logs
Often there's no log of the tests in which I heard no difference, because I didn't even try to identify the Xs.
The p-values are compatible with the percentages in foobar2000 logs.

Code: [Select]
opus.96.sample1
batch  score  subtotal  p
    1    5/5      5/ 5  0.03125
    2    5/5    10/10  0.0009765625
clear difference

opus.96.sample3
batch  score  subtotal  p
    1    2/5      2/ 5  0.8125
    2    2/5      4/10  0.828125
no difference

opus.96.sample4
batch  score  subtotal  p
    1    5/5      5/ 5  0.03125
    2    5/5    10/10  0.0009765625
clear difference

opus.96.sample6
batch  score  subtotal  p
    1    5/5      5/ 5  0.03125
    2    4/5      9/10  0.01074219
    3    5/5    14/15  0.0004882812
clear difference

opus.96.sample7
batch  score  subtotal  p
    1    4/5      4/ 5  0.1875
    2    5/5      9/10  0.01074219
    3    4/5    13/15  0.003692627
    4    4/5    17/20  0.001288414
clear difference

opus.192.sample4
batch  score  subtotal  p
    1    2/5      2/ 5  0.8125
    2    2/5      4/10  0.828125
no difference

opus.128.sample1
batch  score  subtotal  p
    1    4/5      4/ 5  0.1875
    2    4/5      8/10  0.0546875
    3    4/5    12/15  0.01757812
    4    5/5    17/20  0.001288414
clear difference

opus.128.sample4
batch  score  subtotal  p
    1    5/5      5/ 5  0.03125
    2    5/5    10/10  0.0009765625
clear difference

opus.128.sample6
batch  score  subtotal  p
    1    2/5      2/ 5  0.8125
    2    3/5      5/10  0.6230469
    3    4/5      9/15  0.3036194
    4    4/5    13/20  0.131588
    5    3/5    16/25  0.1147615
    6    5/5    21/30  0.02138697
marginal difference

opus.160.sample4
batch  score  subtotal  p
    1    2/5      2/ 5  0.8125
    2    4/5      6/10  0.3769531
    3    2/5      8/15  0.5
    4    3/5    11/20  0.4119015
    5    4/5    15/25  0.2121781
    6    4/5    19/30  0.1002442
    7    4/5    23/35  0.04476554
marginal difference

opus.160.sample6
batch  score  subtotal  p
    1    3/5      3/ 5  0.5
    2    4/5      7/10  0.171875
    3    4/5    11/15  0.05923462
    4    2/5    13/20  0.131588
    5    1/5    14/25  0.345019
    6    4/5    18/30  0.1807973
    7    3/5    21/35  0.1552523
no difference

aotuv.q4.sample1
batch  score  subtotal  p
    1    4/5      4/ 5  0.1875
    2    5/5      9/10  0.01074219
    3    4/5    13/15  0.003692627
    4    5/5    18/20  0.0002012253
clear difference

aotuv.q4.sample3
batch  score  subtotal  p
    1    3/5      3/ 5  0.5
    2    3/5      6/10  0.3769531
    3    5/5    11/15  0.05923462
    4    3/5    14/20  0.05765915
    5    4/5    18/25  0.02164263
marginal difference

aotuv.q5.sample3
batch  score  subtotal  p
    1    2/5      2/ 5  0.8125
    2    3/5      5/10  0.6230469
    3    3/5      8/15  0.5
    4    4/5    12/20  0.2517223
    5    3/5    15/25  0.2121781
    6    5/5    20/30  0.04936857
    7    4/5    24/35  0.0204798
marginal difference

aotuv.q6.sample3
batch  score  subtotal  p
    1    4/5      4/ 5  0.1875
    2    4/5      8/10  0.0546875
    3    3/5    11/15  0.05923462
    4    3/5    14/20  0.05765915
    5    3/5    17/25  0.05387607
    6    2/5    19/30  0.1002442
no difference