Skip to main content

Topic: Public AAC Listening Test @ ~96 kbps [July 2011]: Results (Read 50298 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • Nezmer
  • [*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #25
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #26
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.
  • Last Edit: 24 August, 2011, 07:49:55 AM by Garf

  • Nezmer
  • [*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #27
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.


I stand corrected.

The AAC encoder still needs `-strict experimental` to be enabled and I assumed they would distribue a basic encoder first then gradually implement optimisations.

Looking at the git log of 'aacenc.c', The last four commits contain three fixes and one library change. But before that, the psymodel seems to have been the focus of the work accomplished earlier this year.

How does all this affect the quality of the encoder? I don't know.

  • C.R.Helmrich
  • [*][*][*][*][*]
  • Developer
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #28
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.

Code: [Select]
Sample   Length[s]  nero     QT CVBR  QT TVBR  FhG      CT CBR   Anchor
--------------------------------------------------------------------------
01       30         109      108      119      120      100      102
02        9          75       94       67       77      100       76
03       13          93      112      102       97      100      101
04       28         102       99       98      113      100      103
05       30          95       97       95       99      100       98
06       20          81       98       84       90      100      105
07       22         109      107      107      125      100      103
08       28          94      105       82       95      100       97
09        9          96       98       95      106      100      104
10       30          98      106      106      101      100       99
11       20          96       97       87      104      100      100
12       15         100      110      101      100      100      100
13       10         101      101      101       95      100       99
14       10          89       97       97      105      100      104
15       19         105      109      113      117      100      101
16       28          90       96       84       91      100      101
17       20         104       97       90      105      100      104
18       18          65       93       67       84      100      102
19       16         106       98       91      101      100       96
20       30          90       96       83       83      100       97
--------------------------------------------------------------------------
Mean     20.3        95-96   101       93-94   100-101  100      100

Chris
If I don't reply to your reply, it means I agree with you.

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #29
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.


Thanks, added to the results page.

  • Zarggg
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #30
Just looking for a quick verification on whether I'm interpreting the results properly:

Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?

  • greynol
  • [*][*][*][*][*]
  • Global Moderator
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #31
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.
13 February 2016: The world was blessed with the passing of a truly vile and wretched person.

Your eyes cannot hear.

  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #32
Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?


The reason is that CVBR was at 100 kbps and TVBR at 95 kbps. That was limitation of bitrate scale.
  • Last Edit: 25 August, 2011, 01:30:22 PM by IgorC

  • greynol
  • [*][*][*][*][*]
  • Global Moderator
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #33
That assumes facts not in evidence.
13 February 2016: The world was blessed with the passing of a truly vile and wretched person.

Your eyes cannot hear.

  • Gecko
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #34
First, thank you IgorC and everyone involved!

How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only  34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)

  • Zarggg
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #35
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.

That answer is just as good for my own edification. Thanks.

  • mjb2006
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #36
Even though I sent in results, they didn't get included at all, neither accepted nor rejected. I sent them to the correct address on the last day (Aug. 20). Anyway, going through them now, I found that I kind of botched #04, which was the clip from the intro to OMD's "Enola Gay".

The second half of the clip has the hi-hats panned slightly to the right of center. All the encoders except ffmpeg(!) seem to put the hi-hats pretty much dead-center. The panning is so slight that I didn't notice the issue at all until the 5th comparison, which for me was Apple CVBR (Sample04_2). In that comparison, the panning was the only difference I noticed.

At that point, I should've checked the reference clip, but instead I checked my previous answers, and found that on one comparison (which turns out to be ffmpeg), the encoded version sounded really bad, but the hi-hats were panned slightly to the right, so I incorrectly guessed that the panning was an artifact. Oops. So for comparison #5, I guessed that CVBR was the original and I rated the actual original as inferior.

Now I feel like I should've listened to the reference clip at that point and spotted it there, then gone back and changed my answers on the previous comparisons. But instead I just left them as they were:
  • comparison #1: no differences noticed between original & Sample04_4 (FhG)
  • comparison #2: Sample04_6 (ffmpeg) rated very inferior due to ringing synths, syrupy hi-hats
  • comparison #3: Sample04_1 (Nero) rated somewhat inferior due to ringing synths
  • comparison #4: no differences noticed between original & Sample04_3 (TVBR)

If I were to go back after noticing the panning issue and listen for it in the comparisons I had already completed, I would've noticed it in #1 and #4, and would've correctly spotted and downgraded my ratings for the encoded clips. But doesn't going back like that, once I've told myself what to listen for, invalidate those results? If I only "naturally" notice the panning some of the time, shouldn't that just be accepted?
  • Last Edit: 25 August, 2011, 05:54:32 PM by mjb2006

  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #37
First, thank you IgorC and everyone involved!

 
Thank You too for your complete 20 results.

How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only  34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)

a) Both are fine.  Though I'm also interested to hear Garf on this subject.
b) Yes, there is easier way. There is "Sorted by listener" folder. Find  folder with your results ("34_GECKO"), rename it to "Sample01" and run chunky on it.
c) You should copy-paste  all results (results01, results02, ... , results20) to results_AAC_2011.txt. Without spaces or comments.  You will have 280 results totally: sample01 - 21 results, sample02 - 20 results, ...  etc. -> summary: 280 results.  If you have any issues then see "results_AAC_2011.txt".



mjb2006
I do not accept the results after the closure of the test (evening 20 Aug).
Your results would be discarded anyway.
Your results for samples 03 and 04 are invalid. Two invalid results from your total 5 results (01,02,03,04,05)-> means complete discard. Read the rules.txt
I've repeated many times to send single results as soon as possible to re-do them in case of errors.
And your results  are dated by 26 July. There is nobody to blame.
  • Last Edit: 25 August, 2011, 10:47:58 PM by IgorC

  • mjb2006
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #38
I do not accept the results after the closure of the test (evening 20 Aug). Your results would be discarded anyway.
I'm not upset, and I did not wish to imply that I was arguing about whether my results should have been considered valid. Clearly they are not.

Besides, I see now what happened. On 20 Aug I realized I would not have time to do more tests, so I checked the thread, and you had not yet made your post saying the test was closed, so I RARed my old results (file modification time 16:04:09-0600) and sent them (email time 16:05:34). I see now that you posted in that very short interval (post time 16:04:xx).

And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests. This meaning is not at all obvious when you said that sending results early "helps to prevent some simple errors related to ABC-HR application or any other at early stage," which sounds like you're referring to logistical issues and also seems to be the only time you mentioned it in the test thread, not something you "repeated many times."

Anyway, is it normal for ~27% of listeners to have their results discarded?


  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #39
Anyway, is it normal for ~27% of listeners to have their results discarded?

Yes, it is normal. The quality is pretty high.


And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests

http://www.hydrogenaudio.org/forums/index....st&p=764051
http://www.hydrogenaudio.org/forums/index....st&p=763480

  • Last Edit: 26 August, 2011, 12:43:44 AM by IgorC

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #40
a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?


Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/

Quote
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)


This must be done by hand. Chunky has a bug where by default it slams all listeners together per sample in its final result (so you end up with a result as if only a single person had taken the test).

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #41
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.


More correctly:

One did do better than the other (the means are not equal). But there's still a high enough probability that that result was due to random chance instead of one codec being better than another, so we don't want to make any conclusions based on it.

  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #42
Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/

Then I completely misunderstood it and I apologize for my wrong answer to Gecko.

  • Gecko
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #43
Thank you IgorC and Garf for answering my questions!

In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?

During the test I had the feeling that one (regular) sample was often worse and one often a tad better. Only the former assumption seems to be backed by my data (using the adjusted p-values): ffmpeg < Nero < All other

Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #44
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.
While it's the same for chunky if one particular sample has 1 or 100 results. It will throw away all particular data and will work only with average score per samples (only 20 average scores. Because there were 20 samples)

Bootstrap in change performs analysis on whole set of data (280 results for this test.) That permits to get more important statistical differences.
Literally every single result is useful.
  • Last Edit: 26 August, 2011, 06:35:13 AM by IgorC

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #45
In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?


No.


  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #46
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.


The result has nothing in particular to do with bootstrap. The Friedman's tool "classic" ANOVA shows the same results. The problem is that Chunky has a bug that was throwing away most listeners and this wasn't realized for a while.

With 280 submissions, quite some conclusions can be made.

What is bothersome is that the first samples were (once again) tested and hence weighted more. Maybe for the next time we should add a small DB and offer sample downloads one-by-one, after calculating from the DB what the least tested samples are.

  • lvqcl
  • [*][*][*][*][*]
  • Developer
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #47
Basically, the graphics suck, but they look cute 

Exactly.  So, the graphs: 

Ratings for all 20 samples: (without low anchor)




Sorted by Nero rating:




Without Nero, sorted by CT rating:




CVBR, TVBR and FhG, sorted by FhG rating:




All encoders, sorted by their ratings independently(!):


  • IgorC
  • [*][*][*][*][*]
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #48
I found the first and the last graphs to be particularly informative.

From first graph it's easy to see that half of AAC encoders did excellent on male English speech (Sample 18). The same is valid for female English speech (Sample 06).
Then it's correct to say that modern high quality AAC encoders perform very well on speech at 96-100 kbps.

I think we missed a category of sample as "speech with some background sounds". It's different from song with voice (Sample 17).
Though I haven't seen such samples among repositories with testing samples.






  • lvqcl
  • [*][*][*][*][*]
  • Developer
Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Reply #49
I think we missed a category of sample as "speech with some background sounds"

Such as French_ad? Or maybe something like rawhide is enough?