Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings (Read 28982 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

My primary motivation for performing this listening test was to find the lowest QT AAC TVBR setting that is fully transparent for me, because I want to use that for my music collection. Secondary motivation was to find out how the other encoders compare to QT AAC at high quality settings.

This is my first rigorous listening test, and a rather extensive one, so I wanted to share the results with the audio community. I hope others may learn as much from this experiment as I did!

Results in a nutshell (for the impatient)
QT AAC was judged fully transparent at q91 and close to transparent at q82. The sample in which I heard a faint difference between these presets had a bitrate of only 128kbps at q82 and 159kbps at q91, so taking that in consideration together with expected bitrates at q82, in CBR mode I would assume files at 190kbps and up to be reasonably safe for my ears.
AoTuV (Vorbis) was judged very close to transparent at q5 and q6 and fully transparent at q7. If I were to use Vorbis for my music collection I would pick q6 because I think the tradeoff between file size and perceived sound quality is better at that preset than at q7. I would trust CBR files of 200kbps or greater.
Opus was judged fully transparent at VBR with target bitrate 224kbps, which is consirably higher than I expected based on previous reports. At preset 192 I judged it untransparent so there's no grey area like in AAC or Vorbis. Opus VBR seems to be a lot less variable than the other codecs so in CBR mode I would trust Opus files of 230kbps and up.
LAME (MP3) was judged very close to transparent at V1 and V0 and fully transparent at c320. I would pick V0 if I were to use LAME for my music collection. In CBR mode I would trust files of 260kbps or greater.

Hardware
  • iMac7,1 with default Intel HD sound processor
  • Sennheiser HD 201 ear-enclosing headphones
  • fairly sensitive ears which were recently rinsed


Software
  • Mac OS X 10.6.8
  • X Lossless Decoder 20130127 for transcoding the samples (see encoder details below)
  • ABXTester 0.9, a simple GUI tool that presents the Xs in batches of 5 and uses QuickTime to play the samples
  • opus-tools 0.1.6 in order to decode Opus files to WAV so I could play them with QuickTime in ABXTester
  • Perian 1.2.3 QuickTime component that allows for playback of Vorbis ogg files


Encoder details
QT AAC: my installation of Mac OS X included CoreAudio 3.2.6, QuickTime 7.6.6 and QuickTimeX 10.0. I used TVBR mode and overall encoder quality "max".
AoTuV: XLD included release 1. Apart from the target quality setting no options were shown.
Opus: XLD included libopus 1.0.2. I used VBR mode and framesize 20ms. opus-tools 0.1.6 also uses libopus 1.0.2.
LAME: XLD included version 3.99.5. I used VBR mode with -q2 and the new VBR method.

Ambient conditions
Test setup was in an appartment with reasonably good sound isolation, in a moderately quiet environment with singing birds and low traffic. During ABX trials I kept the room door and the ventilation window closed. Computer fans were turned down. Under those conditions while wearing the headphones, most of the time the only sound I heard was the low humming of the external hard drive that carried the samples. Usually I became unaware of that sound when actively listening to a sample.

Samples
I selected 15 samples from the LAME Quality and Listening Test Information page. In 8 out of those samples I didn't hear a difference at any of the encodings I've tested. The remaining 7 samples are numbered below. In addition I included a 10-second fragment from Central Industrial by The Future Sound of London which I had previously found to contain obvious artifacts when encoded with QT AAC q63:
  • applaud.wv
  • fatboy.wv
  • goldc.wv
  • pipes.wv
  • testsignal2.wv
  • vbrtest.wv
  • velvet.wv
  • central industrial.m4a (ALAC)

Henceforth I'll refer to these samples by their numbers. See the appendix for detailed discussion of each sample.

General test procedure
As a general preparation I transcoded the WavPack samples to ALAC in order to make them playable in ABXTester. I always used the lossless original as sample A and the lossy compressed file as sample B. I took regular breaks in order to prevent fatigue. The measurements were spread over multiple sessions with almost a week between the first and the last session.

For each codec, I would first encode all samples at the middle preset, i.e. q63 for QT AAC, V5 for LAME, q4 for aoTuV and 96kbps for Opus. Then for each sample I would conduct ABX testing and conclude one of the following levels of quality:
  • clear difference if I was very sure I heard obvious artifacts and I scored 5 out of 5 after the first batch, or if I scored near 100% after multiple batches with overall p <= 0.002;
  • marginal difference if I wasn't absolutely sure in each trial but testing showed that I was able to hear the difference, i.e. at least three batches with overall p <= 0.05;
  • no difference if testing didn't disprove that I might be just guessing (p > 0.05) or if I gave up in advance.

By default I set the audio volume to 5 notches out of 16. I tended to turn it up to 6 notches if I didn't immediately hear a difference in all samples except for #1, which I experienced as very loud already. Occasionally I would try the sample with the channels reversed (by reversing my headphones) in order to test if something new might come to my attention.
After testing all samples at the middle preset I would proceed to higher presets with the samples in which I heard any difference, until I found the minimal preset at which I heard no difference or until I couldn't go higher. A preset was judged "fully transparent" if I heard no difference in any sample, "very close to transparent" if I heard a marginal difference in at most one sample, and "untransparent" otherwise. I decided to assign QT AAC q82 an intermediate category "close to transparent" because I heard a clear but very faint difference in one sample. More on that below. The overall search path from preset to preset generally went like a binary search or similarly "jumpy".

I executed the above procedure first for QT AAC, then for LAME, then aoTuV and finally Opus. During the course of the experiment I noticed I had become better at detecting artifacts, so in the end I returned to QT AAC to verify my end results for that encoder.

QT AAC
Observed bitrate range: varies wildly around the official expected value. For example, at q63 (135kbps expected) some samples had an average bitrate of 80kbps while others went over 190kbps.
Observed artifacts: even at medium bitrates (q27) most artifacts were slight changes in timbre or texture rather than very obtrusive stand-alone sounds. The exception is sample 8 which obtained some obvious, very sharp "ticks" after encoding which were audible up to q82 at 128kbps average file bitrate.

Stage 1: all samples at q63.
I heard no differences except for a clear difference in sample 8. I decided to ignore that for the moment and to proceed my search downwards first.

Stage 2: samples 1-7 at q27.
I heard clear differences in samples 1, 2, 6, 7.

Stage 3: samples 1, 2, 6, 7 at q45.
I heard clear differences in samples 2, 6 but no difference in samples 1, 7.

Stage 4: samples 2, 6 at q54.
I heard no differences anymore and decided q54 to be fully transparent if disregarding sample 8.

Stage 5: sample 8 at q100.
No difference.

Stage 6: sample 8 at q82.
No difference.

Stage 7: sample 8 at q73.
Clear difference, I chose q83 as my search result for the time being.

Stage 8: samples 1, 2, 6, 7, 8 at q82 (verification after finishing the other encoders).
I did hear a clear difference in sample 8 afterall, though I had to listen to A and B a few times before I noticed it. I heard no difference in the other samples.
Note: I have not reviewed stages 1-4. With my trained ears I might actually hear some additional differences at q54 or even q63 but I haven't tested.

Stage 9: sample 8 at q91.
No difference. I decided q91 to be my final search result for QT AAC.

LAME
Observed bitrate range: the spread is somewhat less than in QT AAC, generally the highest and lowest average bitrates where within 30kbps of the expected bitrate for the given quality preset.
Observed artifacts: no standalone "objects", but changes in timbre or texture could be very un-subtle.

Stage 1: all samples at V5.
I heard clear differences in samples 1, 4, 6, marginal difference in sample 7 and no difference in samples 2, 3, 5, 8.

Stage 2: samples 1, 4, 6, 7 at V3.
Clear differences in samples 1, 6.

Stage 3: samples 1, 6 at V1.
Marginal difference in sample 1. I decided V1 to be my search result for the time being.

Stage 4: sample 1 at V2 (checking for consistency with aoTuV after finishing Opus).
Clear difference. I chose V0 as my final search result instead.

Stage 5: sample 1 at V0 (for completeness, shortly before starting this report).
Marginal difference (yes really, I believe I heard a difference and I identified 18 out of 25 Xs correctly: 72%, p=0.014).

Stage 6: sample 1 at c320.
No difference (at first I thought I heard a difference but ABX testing showed I didn't).

AoTuV
Observed bitrate range: average file bitrate is usually greater than the official target bitrate for the given quality preset. For example, the average bitrates at q4 were all greater than 128kbps. Upwards spread from the target bitrate seemed to be similar to QT AAC.
Observed artifacts: few and subtle. The marginal difference in sample 3 that I consistently heard up to q6 was an attenuation effect, the high frequency components were slightly softened.

Stage 1: all samples at q4.
Clear difference in sample 1, marginal difference in sample 3 and no difference in the other samples.

Stage 2: samples 1, 3 at q6.
Marginal difference in sample 3, no difference in sample 1.

Stage 3: sample 1 at q5.
Marginal difference. I decided q6 to be my search result.

Stage 4: sample 1 at q7 (for completeness, shortly before starting this report).
No difference.

Opus
Observed bitrate range: average bitrates were always very close to the target bitrate, with a spread of less than 10kbps in each direction. I would compare Opus VBR to QT AAC ABR.
Observed artifacts: texture changes, some of them very severe, including "rattling" and "grinding" sounds. Usually the timbre became more "metallic".

Stage 1: all samples at target 96kbps.
Clear differences in samples 1, 2, 4, 5, 6, 7, no difference in samples 3, 8.

Stage 2: samples 1, 2, 4, 5, 6, 7 at target 192kbps.
Clear differences in samples 4, 6, no difference in samples 1, 2, 5, 7.

Stage 3: samples 4, 6 at target 256kbps.
No differences.

Stage 4: samples 4, 6 at target 224kbps.
No differences. I chose 224kbps to be my search result.

Conclusions and recommendations
QT AAC and aoTuV are the clear winners in this comparison, with QT AAC achieving full transparency at the best compression ratio. I was a bit surprised to find that the highest quality preset is no overkill (for my ears) in LAME. Opus doesn't seem to perform exceptionally well (though better than LAME) at high bitrates although it's known to beat QT HE-AAC (more or less) at 64kbps. This is probably in part explained by the fact that Opus is still very young. Another explanation is that Opus might be more intended for low bitrates, which is somewhat suggested by the way it's described on the Opus home page.

According to the Hydrogenaudio wiki, most people find AAC to be transparent at about 150kbps, Vorbis at about 150-170kbps and LAME at about 160-224kbps. Given the results of this experiment, my ears might be slightly better than average.

If you wish to repeat this experiment, you might be able to save a lot of time by using my results as a hint where to find the most significant differences. The sample details in the appendix may help you to "look" in the right direction. In addition, you can probably start your searches for Opus and LAME at higher presets than I did.

If you just want to use this report as a hint for choosing your ideal encoder setting, I suggest that you perform a miniature version of my experiment using just a single sample in the encoder that you're interested in. If you hear a difference go up one preset until you don't, otherwise do the opposite by going down. Specifically:
For QT AAC, I would recommend listening to sample 8 and starting at q73. If you descend below q54 I recommend listening to samples 2, 6 instead.
For aoTuV, I would recommend listening to sample 3 and starting at q5. If you don't hear any difference switch to sample 1 at q4.
For Opus, you could take sample 4 at target 160kbps.
For LAME, I recommend listening to sample 1 starting at V3.

Appendix: sample details
Sample 1
Loud applause, with a "thank you" yelled through a microphone shortly after the start. The "thank you" is loud but sounds a bit muffled because of the microphone and there's a faint echo to it.
In the lossless original the applause sounds "wet"; you could compare it to rain or perhaps to oil spattering in a hot pan. In audibly different encodings it may sound dryer, noisier and coarser, perhaps like sandblasting, or very coarse and metallic (in Opus at 96kbps target bitrate).
The "thank you" should be a separate sound layered on top of the applause, and should sound fairly smooth. In audibly different encoding you may expect it to interact with the applause in several ways:
  • The applause may seem less clear, noisier or softer during the "thank you".
  • Directly after the "thank you" the applause may seem to be slightly louder and much coarser.
  • The echo to the "thank you" may seem to be amplified compared to the original and include some noise.
  • The "thank" syllable may sound slightly less smooth, a bit raspy, as if affected by the sandblasting (this is the primary way in which I made out the difference at maximum quality settings in LAME).


Sample 2
Some sawtooth-like signal with an additional trill effect that seems to contain vowels. I'm not sure whether this is a heavily filtered human voice or just something creative from a synthesizer, but either way it sounds quite interesting.
At medium bitrates in QT AAC and Opus it sounded distorted and metallic.

Sample 3
Symphonic fragment with drums, trumpets, violins, vocals and some high-pitched snare instrument which I think might be a steel guitar. There's also some high tingling in the right channel which I suspect is an artifact in the original file coming from the snare instrument. Sounds like a soundtrack to an epic 1960s movie.
In aoTuV you may find that the snare instrument (the proper sound slightly to the left, not the tingling in the right channel) is arpeggiated less sharply and sounds softer overall; I would call it a bit "timid" compared to the original.

Sample 4
Bagpipe playing a slow high-pitched melody over a constant bass. The sound is smooth overall although you'll find some irregularity especially in the second long-lasting high note. In the background there's the occasional hollow, raspy, low-pitched sound which might be either the bag being inflated by the artist or (a suggestion of) wind.
Focus on the long-lasting high-pitched notes, especially the very last one. In case of audible difference you'll find that they sound metallic and/or less smooth or even straightout distorted (Opus at 96kbps target bitrate).

Sample 5
Drums (something that sounds similar to a conga or a djembe) playing a samba-like rhythm. At the start an alto voice sings "aaaa", which is a bit of a shame because the voice will not help you to distinguish the encoded sample from the original and it partially masks the drums.
In Opus at 96kbps target bitrate the high-pitched slap beats sound more metallic than in the original.

Sample 6
Western guitar playing a country tune.
At lesser bitrates you might recognise the encoded sample directly because it sounds metallic and perhaps even a bit distorted. A high bitrates you might be able to make out the difference if you focus on the initial arpeggio and the final note. The last note of the initial arpeggio (which lasts longer than the previous notes) might sound a bit more rough than in the original. The final note might sound metallic. The latter difference is probably easier to hear than the former. You probably won't find a difference in the chords.

Sample 7
Monotone (synthetic) drum rythm with bass, big tom beating every second base beat, open-closing hi-hat in the right channel alternating with the bass beat and another closed hi-hat in the left channel beating four times for every bass beat.
You'll only hear a difference at the lesser quality settings, and you are most likely to find it in the closed hi-hat in the left channel.

Sample 8
Synthesizer music of fairly low complexity.
Frankly, the sounds aren't really important, because the main reason to listen to this fragment is the sharp ticks that are introduced by QT AAC. I don't think I need to tell you where they are because you're pretty much guaranteed to hear them at q63 and below.
Since this sample isn't available from the LAME Quality and Listening Test Information page, I made it available for download over here: https://dl.dropbox.com/u/3512486/central%20industrial.m4a

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #1
I believe you but I am skeptical, these bitrates are way too high. I used to test a lot and train my ears to hear artifacts but I gave up since I couldn't believe I find Apple's AAC to be transparent to me at ~100kbps. I now use -V73 just to cover a wider range of music and I still think it's too much, I was happy with -V63 as well.

I'd like to see your ABX logs at "even" ~96/~128/~160kbps.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #2
What versions of these encoders did you use?  For opus I hear a lot of improvement with 1.1a over the previous 0.1.5 and can see more variable in the variable bit rate.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #3
@eahm:
If by "logs" you mean the kind of logs you often see at this forum (I presume those are produced by foobar2000) I'll have to disappoint you, because ABXTester doesn't produce anything like that. Apart from that, I'll happily do some additional tests for you.

What do you mean by "even"?

@DonP:
I listed the versions of all encoders clearly in my post. See the sections "Software" and "Encoder details" close to the top.
(Edit: I checked the Opus website and libopus 1.0.2, which I used, seems to be latest version.)

 

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #4
Jplus, yes I meant these logs. Until I see a proper ABX test that tells me you really hear quality difference between lossless and 192 AAC I have to remain skeptical, AAC is soo good at low bitrates.

Even = low. It was more for the ~96.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #5
@DonP:
I listed the versions of all encoders clearly in my post. See the sections "Software" and "Encoder details" close to the top.
(Edit: I checked the Opus website and libopus 1.0.2, which I used, seems to be latest version.)




OK.. for  aoTuV you listed version 1.  Rarewares shows the current version as  6.03

I grabbed the pipes sample, and have been able to ABX  Opus 1.1a up to target bitrate=128 (so far.)  The 3 rates I tried: 70,100,128 all encoded at 40-50% over the target rate, foobar showing 180 kb/s for the section I was using for 128 kb abx.  So the encoder does seem to recognize this sample as hard.

edit: description of opus 1.1a vs 1.0x: http://jmspeex.livejournal.com/11737.html

foo_abx 1.3.4 report
foobar2000 v1.2
2013/02/08 16:03:27

File A: D:\rips\abxstuff\pipes\pipes.wv
File B: D:\rips\abxstuff\pipes\_\_\track 128kb.opus

16:03:27 : Test started.
16:04:03 : 01/01  50.0%
16:04:20 : 02/02  25.0%
16:04:50 : 02/03  50.0%
16:05:35 : 03/04  31.3%
16:05:59 : 04/05  18.8%
16:06:19 : 05/06  10.9%
16:06:40 : 06/07  6.3%
16:07:01 : Test finished.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #6
@eahm:
I'm sorry for quote-sniping you, but I see several things in your post which I think need to be addressed before I can do any additional testing:

Jplus, yes I meant these logs. Until I see a proper ABX test that tells me you really hear quality difference

Excuse me, but what is improper about my ABX tests? The only difference with the foobar2000 logs is that you can't see trial-by-trial whether I identified the X correctly or not. I explained my definitions of "clear difference" and "marginal difference". Everywhere I said I heard a "clear difference" I got 100% or near-100% score with a probability less than 0.002 that I identified them correctly by luck. For example, I might have correctly identified 18 out of 20 trials. The probability of getting that score by guessing is 0.5^20*choose(20,2)= 0.00018 (judging from an example log, foobar2000 would round that down to 0.0%). A similar but less-extremely-certain story applies to the "marginal differences".

If you think that I might have made this up then logs shouldn't change anything, because I can make those up as well.

If knowing my exact score for each individual trial is important for you, I can keep track of that during my next experiments and write it down in my own way in my next post. Would that solve your issue? Because frankly, I won't be able to run foobar2000 on my mac.

Quote
between lossless and 192 AAC

Note that from my VBR results I concluded that most files at or above 190kbps are probably transparent to me (I called it "probably safe for my ears" but that amounts to the same thing). That means that I actually don't expect to hear a difference between lossless and 192kbps AAC. If you want, I can check whether any of the QT AAC samples that I found audibly different from the lossless original had a bitrate near 192kbps.
Edit: I did this, and none of them did. The highest average bitrate was 161kbps for sample 2 at q45. If I'd hear the difference in sample 2 at q54 (which I didn't verify after my ears became more trained) then that one would come close at 186kbps.

Quote
I have to remain skeptical, AAC is soo good at low bitrates.

I completely agree with that! Note that at the start of my experiment, I heard no difference at q54 (expected bitrate 95kbps) in any of my samples except for #8, which had obvious ticks which were probably caused by QT choosing the bitrate too low. I didn't hear those ticks anymore at q91, where the average bitrate in sample 8 was still only 159kbps.

I'm not denying that QT AAC is really good even at medium bitrates (where I follow the apparent convention that 80-120kbps is medium). I'm just saying that I found a case where q82 isn't strictly transparent, so I'll have to choose q91 for my music in order to be on the safe side.

Quote
Even = low. It was more for the ~96.

So you'd like me to test at about 96kbps, 128kbps and 160kbps. I'm fine with that, but how would you want me to approach that? Use the VBR preset which has an expected bitrate near the proposed bitrate?
Why exactly would you like me to do that? Do you expect results that are somehow in conflict with my first post?


@DonP:
Alright, so libopus 1.1a is probably better and more variable than 1.0.2. That seems to confirm my suspicion that 1.0.2 didn't score very well in my experiment because it's still a very young codec. I acknowledge that I wasn't using the bleeding-edge version in my measurements and that Opus would probably have scored better if I had.
I prefer testing release versions only because you never know what rare errors an alpha encoder might have that happen not to show up in my limited set of test samples. It seems that you are concerned that Opus might look worse from my results than it deserves. Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?

As for aoTuV, XLD is probably just displaying the version number incorrectly (indeed if you search "aotuv" at the XLD homepage the last hit you'll find indicates that the default included version should be at least 4.51). My results don't seem any worse than you'd expect from aoTuV (as compared to QT AAC), so I think there's no reason for concern.

That said, XLD offers plugins for Opus 1.1a and for aoTuV 6.03b, so if more people think I really have to test those, I can without needing to jump through hoops. Please do keep in mind that what I've done here is very time-consuming though.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #7
@DonP:
Alright, so libopus 1.1a is probably better and more variable than 1.0.2. That seems to confirm my suspicion that 1.0.2 didn't score very well in my experiment because it's still a very young codec. I acknowledge that I wasn't using the bleeding-edge version in my measurements and that Opus would probably have scored better if I had.
I prefer testing release versions only because you never know what rare errors an alpha encoder might have that happen not to show up in my limited set of test samples. It seems that you are concerned that Opus might look worse from my results than it deserves. Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?


I guess my 2 points were that a) you noted that the VBR wasn't varying much, and that has been fixed (overall average on music I've coded seems pretty close to target still), and b) Someone was questioning your lack of logs, so I pointed out that with the one sample I tried, my results at an actual 180 kb/s were consistent with yours at the roughly the same rate, and log supplied.

Though in general, anyone requiring stability would be nuts to count on alpha software, I've found no case where it performs worse than the "stable" release, and it's more dependable than the development builds, where folks who had problems with the production release (including me) had been directed for quite a while.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #8
Ahh, I guess I've been too defensive. Thanks for pointing out these useful bits of information!

(Concerning the logs: I realised I could write a little script to complement ABXTester and produce a simple log of the same kind as those produced by foobar2000. So in the future that should fix the issue for those who care very much about the textual representation of my results. Example:
Code: [Select]
test
batch  score  subtotal  p
    1    4/5      4/ 5  0.15625
    2    5/5      9/10  0.009765625
    3    5/5    14/15  0.0004577637
)

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #9
Jplus, don't worry about lower bitrates just for me if you hear difference in higher ones.

I meant proper logs when I said proper ABX test. I am sure you tested but you only say "clear difference" and "no difference" here, here and here but who tells us you really do.

Everyone who talks about testing and transparency needs to post logs for every single test taken. Proper logs with percentages, seconds, test number etc.

I don't understand why this time, this thread is different.


Let me be clear and not rude. For example this test:
Stage 1: all samples at target 96kbps.
Clear differences in samples 1, 2, 4, 5, 6, 7, no difference in samples 3, 8.

I'd like to see the logs for every sample.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #10
The OP presented clear and concise criteria for testing, do we really need to hammer him with the TOS 8 card? Yes I understand the importance of unbiased log results, but being a mac user he doesn't have the ability to do foobar abx testing, if the mods truly believe his results to be biased they can delete the topic. It's not like he wandered in here screaming 'wma is better than mp3 so nyah'.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #11
About aotuv version numbers: homepage

Quote
aoTuV Release 1 (2006/08/23)
# This is the stable version. The contents are almost the same as beta4.51.


Current version is aoTuV Beta6.03.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #12
@lvqcl:
Thank you, that clarifies some things. I hope you don't mind me asking for you personal opinion: do you think I should test aoTuV 6.03b?

@Mach-X:
Thank you.

Concerning my clear and concise criteria, I should say they were also very strict. Judging from foobar2000 logs that I've seen elsewhere on Hydrogenaudio many people would be already comfortable to conclude that they really hear a difference at p=0.004 (or probably even higher). According to my criteria that would only qualify as a marginal difference.

@eahm:
I'm confused. You've said you believe me and that you're sure I did the tests. At the same time you emphasize that you're skeptical and you insist on viewing logs of every individual test.

The thing is this. I took 116 tests, give or take a few for counting errors. I've done you a great favor by compressing my results to just a single datum per test, i.e. at what level of confidence I heard a difference if any at all (you won't find 116 results in my OP but that's because I skipped over all samples that I never heard a difference in and because some "no difference" judgments were implicit). If I had posted foobar2000 logs for all of those tests, or even just the 53 that are interesting, would that really help you? Would you read and verify all of them?

There's another thing. The only service that ABXTester offers is to present me with an A, a B and five Xs. I can try to identify the Xs and then ask for my score which is shown in a popup window. I can ask for new batches of Xs as often as I want but ABXTester doesn't keep track of my running total. Any logging will have to be done manually by me. Which I did in my own way, using a calculator and the back of an envelope. Between batches I would recalculate my p-value. At some point I would decide to end the test and assign the current confidence level (clear/marginal/no difference), which I logged on my envelope with a symbol (respectively star/half star/dash).

Those symbols at the back of that envelope are the only permanent log I've kept, so there's no way I'll be able to show you "proper" logs for the tests I've already conducted. I would have to repeat the tests and manually enter my score for each individual batch into my new script in order to do that. That would be several days of work if I were to do it for all of my tests.

However, I take your skepticism seriously and I can offer you to repeat the three tests that you're most skeptical about. I guess that might be these:
  • Sample 8 in QT AAC q82 (clear difference).
  • Sample 3 in aoTuV q6 (marginal difference).
  • Sample 1 in LAME V0 (marginal difference).

Please let me know what you think of this offer.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #13
Without hard numbers, only you can draw your conclusions and they're valid just for you. You're on the right track, but you're not there.


QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #15
TLDR version: greynol is right that fb2k does something slightly different from what I do. I figured out what it calculates and will do the same from now on. So from now on my p-values will be compatible with those of everyone else at the HA forums.

I calculate the probability of a false positive, i.e. the probability that I would get the score if I were guessing randomly.

The probability of identifying a single trial correctly by luck is 0.5, and the probability of identifiying it incorrectly is also 0.5. So the probability of identifying N trials all correctly (or all incorrectly) is 0.5^N.

If you identify K out of N trials correctly (meaning some but not all of them) then the probability is greater because there are multiple ways to get exactly K out of N trials correct. For example, if you get 14 out of 15 trials correct then the single incorrect trial might be the first, the second, and so on, so there are 15 ways to get that score, so I'd have to multiply 0.5^15 by 15 to get the true probability of a false positive.

The overall formula for a false positive with K out of N trials correct is 0.5^N * (number of ways to get exactly K out of N correct). This actually also applies to getting all of the trials or none of them correct, because in both cases there's only one way to do that so you're just multiplying 0.5^N by 1. The "number of ways to get exactly K out of N correct" is known mathematically as the combination. In computer-related contexts this is often written as choose(N, K). Therefore the more formal way to write the false positive probability formula is 0.5^N * choose(N, K). This is also known as the Bernouilli distribution (actually it's a special case because the probability of success is equal to the probability of failure, but that doesn't matter now).

I checked whether foobar2000 is calculating the same thing using the first log in [a href='index.php?showtopic=98841']this topic[/a]. You are right that it isn't the same:
foobar2000 p-value at the fifth trial: 18.8%
what I would calculate: 0.5^5*choose(5, 4) = 0.15625 =~ 15.6%

I checked every fifth value and it turns out that my p-value are consistently close to, but slightly smaller than the foobar2000 p-values. Fortunately I know where this difference is coming from: foobar2000 is calculating the probability of a false positive if you correctly identify K or more trials out of N. See the math:
0.5^5*choose(5, 4) + 0.5^5*choose(5, 5) = 0.1875 =~ 18.8%

It also works for the other trials. For example, here's the tenth trial:
foobar2000 p-value: 5.5%
doing it manually: 0.5^10*choose(10, 8) + 0.5^10*choose(10, 9) + 0.5^10*choose(10, 10) = 0.0546875 =~ 5.5%

And the sixteenth trial:
foobar2000: 0.2%
manually: 0.5^16 * (choose(16, 14) + choose(16, 15) + choose(16, 16)) = 0.002090454 =~ 0.2%

I'll adopt the foobar2000 style p-values in my future tests in order to make my outcomes comparable with those of other people at the HA forums. It's to my own benefit as well as the slightly larger p-values will force me to be slightly more cautious. Thank you for making me aware of this difference!

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #16
Why are you all so hung up on the ABX logs? It's sufficient that he said he used ABX and got statistically relevant results. Until the ABX tools start to append some randomly generated strings at the end of the log and then sing the whole with some sufficiently long cryptographic key (similar to what new versions of EAC do to its logs), there is nothing that would prevent creation of fabricated logs, so their value is questionable at best.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #17
LordWarlock, I was able to hear difference between CBR 320kbps FhG and CBR 320kbps LAME on the first minute of Guns n' Roses - Don't Cry. I tested this for four straight hours but sorry I don't have any logs, you have to trust me. Go tell LAME and Fraunhofer there is something wrong with their encoders.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #18
Thanks for the clarification, Jplus. Even more thanks for sharing your findings and for doing so with such rigor.  It is clear to me that you have keen hearing and a strong handle on methodology.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #19
Not so thanks to eahm for continuing to troll against someone who has clearly got good intentions and has made a huge amount of effort to tell us about the results.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #20
Realizing that Opus was built for mobility, the apparent regression of Opus at higher bitrates (when compared to Vorby and AAC) is still disconcerting.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #21
A big thank you from me too, Jplus. I appreciate your findings. And I can't see any reason why not to trust your hard work.
lame3995o -Q1.7 --lowpass 17

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #22
LordWarlock, I was able to hear difference between CBR 320kbps FhG and CBR 320kbps LAME on the first minute of Guns n' Roses - Don't Cry. I tested this for four straight hours but sorry I don't have any logs, you have to trust me. Go tell LAME and Fraunhofer there is something wrong with their encoders.
And? Even if you provided a log (or logs) supporting this statement, I still wouldn't have any reason to believe you (or not to believe you, if I took your side...). You could type it in manually, you could create one by comparing completely different sounds, or you could plain and simple go for a brute-force method and repeat the test until you get your desired result by pure luck.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #23
And? Even if you provided a log (or logs) supporting this statement, I still wouldn't have any reason to believe you (or not to believe you, if I took your side...). You could type it in manually, you could create one by comparing completely different sounds, or you could plain and simple go for a brute-force method and repeat the test until you get your desired result by pure luck.

Of course and for the same reason I thank Jplus for putting the effort to test that much.

QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings

Reply #24
Jplus, welcome on HA forum. . It's great to see your OP.

Will I make you happier if I repeat my Opus measurements when 1.1 is ready for release?

Or You can try it now.
It's a closed circle. Everybody is waiting for a final release while a devs are waiting for You to try it, like nobody is sure when to make a first step. Would You do it?

Realizing that Opus was built for mobility, the apparent regression of Opus at higher bitrates (when compared to Vorby and AAC) is still disconcerting.

Opus is still a very young format. Jplus has tested 1.0.2 version while there is a new alpha 1.1.
1.1 has an unconstrained VBR.
There is a lot of stuff going on. Check it here http://www.hydrogenaudio.org/forums/index....st&p=823712