HydrogenAudio

Hydrogenaudio Forum => Validated News => Topic started by: IgorC on 2011-08-23 19:56:36

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 19:56:36
After the long time of preparations, discussions and realization of the test the results are finally here.

http://listening-tests.hydrogenaudio.org/i...-a/results.html (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/results.html)

Summary: Apple won, FhG is the second, Coding Technologies is the third and Nero is the last

I appreciate all people who has supported the test and participated in it.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: benski on 2011-08-23 20:18:15
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-23 20:27:12
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: C.R.Helmrich on 2011-08-23 20:42:46
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-23 20:47:23
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.


Yes (aggregate over all listeners). Note that the graphics are simplified plots, and don't have the correct confidence intervals for the bootstrap (because the tool doesn't support generating them) nor for ANOVA (IIRC, the plots don't consider the blocking).

This is why you'll see overlap in the graphics but not in the bootstrap nor blocked ANOVA results.

Basically, the graphics suck, but they look cute 
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: benski on 2011-08-23 20:52:58
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris


Yes, the ANOVA test is using Friedman which is ranking the codecs.  The graphs seem to be built based on parametric analysis of the test results as if they were normal data.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: benski on 2011-08-23 20:54:47
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?


Sorry, I only now read the caveat in the results page - "The graphs are a simple ANOVA analysis over all submitted and valid results. This is compatible with the graphs of previous listening tests, but should only be considered as a visual support for the real analysis.".    My initial reaction was to the box-plot graphs, not to the analysis at the bottom of the page. 

The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 21:01:26
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: benski on 2011-08-23 21:07:49
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.


Actually, it gives the same order.  CVBR > TVBR > FhG > CT > Nero
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 21:11:13
Actually, it gives the same order.  CVBR > TVBR > FhG > CT > Nero


Yes but TVBR> FhG with p = 0.00 for rank-sum. But not for bootstrap.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-23 21:11:30
The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.


(Blocked) ANOVA is a parametric, means-based test. FRIEDMAN is the name of the utility (which unsurprisingly, also supports Friedman analysis). The result posted is means-based, not rank-based. It's there mostly to allow referencing with older tests and with other statistical packages, which are more likely to support normal blocked ANOVA than the nonparametric variants. Friedman wasn't developed further because it doesn't allow p-value step-down without losing a significant amount of power for many comparisons, and because for high-bitrate tests it is no longer clear the results are normally distributed. That's exactly what lead to bootstrap.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 21:30:31
I should also mention that I've participated in this test too. Steve Forte Rio has made the ABC/HR sessions and new key for me. He has checked my results. (You can find this key in results.zip).
A Big thank You to him for that.

If somebody is interested to analyse the results:

SampleXX - original
SampleXX_1 - Nero
SampleXX_2 - Apple CVBR
SAmpleXX_3 - Apple TVBR
SampleXX_4 - FhG (Winamp 5.62)
SampleXX_5 - Coding Technologies (Winamp 5.61)
SampleXX_6 - ffmpeg AAC (low anchor)
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: zima on 2011-08-23 21:37:46
Maybe there could be a legend for X-axis, the abbreviations used, at least under the first graph?

FhG, low_anchor* and Nero are almost fine enough (*though "wait, what was it again"? ;p ), but making CT, CVBR, and TVBR clear might require going back to the test page; which I think should be superfluous.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: lvqcl on 2011-08-23 22:05:11
It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 22:11:53
zima,

will fix it later.

It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")

Yes, I've noticed that too. It's interesting that listeners still rate them differently (despite they are bit-exact). Though it's normal.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Alexxander on 2011-08-23 22:31:51
Thanks to all who participated in this test and to those who made this test possible, especially to IgorC.

My findings are in line with the general results and I am actually surprised by Nero ending up rather low. Curious to see CVBR mean is a bit higher than TVBR but I suppose this fact really has not much meaning as both fall into each other's confidence interval.

Some personal testing about a year ago with Apple CVBR at around 128kbps, I found it stunning good but never really compared it to Nero (I use Nero for 2 years now). Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?

Many thanks again!

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-23 23:02:16
Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?


This is a tough question. The quality of the tuning can make a difference. But barring any more information, I'd bet the codec that is better tuned/performing at 100kbps will perform better at 128kbps, too.

You could say that the codecs performance at 100kbps is a hint, but not proof, of how it will do at 128kbps.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Dakeryas on 2011-08-23 23:11:19
Many thanks for the test !

Interesting to notice Nero's lesser performance, even though I'm encoding at much higher bitrates, I should definitely have a look at that qtaacenc thing (huh, Apple stuff).
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-23 23:51:01
I've noticed that previous version of Nero 1.0.7.0 produces much better quality than last 1.5.4.0  at 96-100 kbps (I've made blind tests  though. Do not blame me with TOS8)

The only explanation that comes in my mind  that is probably tuning for some bitrates could produce regression for  another ones.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Gornot on 2011-08-24 00:29:19
To be perfectly honest, I am surprised that FhG did so well against Coding Technologies. Since Winamp introducted it, some of my songs seemed to retain more quality when encoded with the Coding Technologies' encoder rather than FhG.
Too bad I found out about the test two days after it was already closed. I've been anxious to see the results; interesting how Nero did the worst. Great information for future references
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: /mnt on 2011-08-24 01:23:26
Interesting results, I gotta see if the pre-echo handling with sharp attacks is improved on QuickTime though.

Sadly am not suprised that Nero lost. It still has trouble with hi-hats and with certain regression issues, that have been introduced after 1.0.0.7.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: kennedyb4 on 2011-08-24 01:42:03
If it is fair to say that many of the samples were "killer" samples, the performance of CVBR is quite good. I will still continue with TVBR as there are substantial bits saved on "easier" samples.

Thanks again for this learning experience, and for the hard work of all concerned.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Sebastian Mares on 2011-08-24 07:29:58
It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: greynol on 2011-08-24 07:49:59
I was wondering the same thing.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-24 09:35:36
It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?


Not sure about this one, I thought it should "calibrate the scale". (Because the overall quality is so high, it's less needed at the upper end)

If you don't use an anchor, what happens  is that for a minor distortion users will tend to slam down the slider. The anchor serves as a reminder "what really bad really is".

It would be more useful if the anchor stayed the same throughout the tests, I guess. Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.

FWIW, this is a somewhat relevant and interesting paper I hadn't seen before:
http://www.acourate.com/Download/BiasesInM...teningTests.pdf (http://www.acourate.com/Download/BiasesInModernAudioQualityListeningTests.pdf)
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Nezmer on 2011-08-24 11:35:08
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-24 12:46:24
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm (http://listeningtests.t35.me/html/AAC_at_128kbps_v2_public_listening_test_results.htm)

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Nezmer on 2011-08-24 18:26:11
Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.


The AAC and Vorbis encoders in FFmpeg/libav were written to produce valid bitstreams without implementing any sophisticated optimisations. So, the results here shouldn't be a surprise.


Sorry, but this just isn't true for the ffmpeg AAC encoder. Have you actually looked at it? It's reasonably sophisticated, more sophisticated than FAAC for example. It has a real psymodel, 3 different quantization loop algorithms, proper short block switching, etc.

Even so, there's no particular reason to believe a non-sophisticated AAC encoder must terribly suck. Again, FAAC is good reference.
http://listeningtests.t35.me/html/AAC_at_1...est_results.htm (http://listeningtests.t35.me/html/AAC_at_128kbps_v2_public_listening_test_results.htm)

As far as I can tell, the problem is that it is utterly riddled with bugs and was probably never properly tested and debugged. It might be misdesigned too, but I feel like I'm sticking out my neck here because I could be wrong on that - maybe the current design works fine if you fix the bugs.

The ffmpeg AAC encoder is crap because it's buggy and insufficiently tested. Not because it's missing sophisticated algorithms.


I stand corrected.

The AAC encoder still needs `-strict experimental` to be enabled and I assumed they would distribue a basic encoder first then gradually implement optimisations.

Looking at the git log of 'aacenc.c', The last four commits contain three fixes and one library change. But before that, the psymodel seems to have been the focus of the work accomplished earlier this year.

How does all this affect the quality of the encoder? I don't know.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: C.R.Helmrich on 2011-08-24 18:27:36
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.

Code: [Select]
Sample   Length[s]  nero     QT CVBR  QT TVBR  FhG      CT CBR   Anchor
--------------------------------------------------------------------------
01       30         109      108      119      120      100      102
02        9          75       94       67       77      100       76
03       13          93      112      102       97      100      101
04       28         102       99       98      113      100      103
05       30          95       97       95       99      100       98
06       20          81       98       84       90      100      105
07       22         109      107      107      125      100      103
08       28          94      105       82       95      100       97
09        9          96       98       95      106      100      104
10       30          98      106      106      101      100       99
11       20          96       97       87      104      100      100
12       15         100      110      101      100      100      100
13       10         101      101      101       95      100       99
14       10          89       97       97      105      100      104
15       19         105      109      113      117      100      101
16       28          90       96       84       91      100      101
17       20         104       97       90      105      100      104
18       18          65       93       67       84      100      102
19       16         106       98       91      101      100       96
20       30          90       96       83       83      100       97
--------------------------------------------------------------------------
Mean     20.3        95-96   101       93-94   100-101  100      100

Chris
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-25 07:29:58
Some bit-rate statistics which were presented in previous test results. Feel free to double-check and add them to the results page. All data were obtained using foobar v1.1.8 beta 5. If a mean bit-rate is given as a range, it means my own calculations differ from the ones reported by foobar.


Thanks, added to the results page.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Zarggg on 2011-08-25 18:06:10
Just looking for a quick verification on whether I'm interpreting the results properly:

Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: greynol on 2011-08-25 18:12:59
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-25 18:28:25
Am I correct in concluding that QuickTime Constrained VBR performed slightly better than QuickTime True VBR, but not by enough to make an obvious difference?


The reason is that CVBR was at 100 kbps and TVBR at 95 kbps. That was limitation of bitrate scale.
(http://s2.ipicture.ru/uploads/20110707/P1gi45j4.png)
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: greynol on 2011-08-25 19:04:07
That assumes facts not in evidence.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Gecko on 2011-08-25 20:24:24
First, thank you IgorC and everyone involved!

How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only  34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Zarggg on 2011-08-25 22:47:26
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.

That answer is just as good for my own edification. Thanks.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: mjb2006 on 2011-08-25 22:50:49
Even though I sent in results, they didn't get included at all, neither accepted nor rejected. I sent them to the correct address on the last day (Aug. 20). Anyway, going through them now, I found that I kind of botched #04, which was the clip from the intro to OMD's "Enola Gay".

The second half of the clip has the hi-hats panned slightly to the right of center. All the encoders except ffmpeg(!) seem to put the hi-hats pretty much dead-center. The panning is so slight that I didn't notice the issue at all until the 5th comparison, which for me was Apple CVBR (Sample04_2). In that comparison, the panning was the only difference I noticed.

At that point, I should've checked the reference clip, but instead I checked my previous answers, and found that on one comparison (which turns out to be ffmpeg), the encoded version sounded really bad, but the hi-hats were panned slightly to the right, so I incorrectly guessed that the panning was an artifact. Oops. So for comparison #5, I guessed that CVBR was the original and I rated the actual original as inferior.

Now I feel like I should've listened to the reference clip at that point and spotted it there, then gone back and changed my answers on the previous comparisons. But instead I just left them as they were:

If I were to go back after noticing the panning issue and listen for it in the comparisons I had already completed, I would've noticed it in #1 and #4, and would've correctly spotted and downgraded my ratings for the encoded clips. But doesn't going back like that, once I've told myself what to listen for, invalidate those results? If I only "naturally" notice the panning some of the time, shouldn't that just be accepted?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-26 03:30:18
First, thank you IgorC and everyone involved!

 
Thank You too for your complete 20 results.

How do I perform and interpret the analysis on a different set of data (e.g. only my personal results)?. Here's what I got so far:
1. From the provided results.zip copy the "Sorted by sample" folder to a new location and delete all unwanted test results (e.g. keep only  34_GECKO_test??.txt).
2. Use chunky to gather the ratings: chunky.py --codecs=1,Nero;2,CVBR;3,TVBR;4,FhG;5,CT;6,ffmpeg -n --ratings=results --warn -p 0.05 --directory="d:\foo"
3. Take chunky's output "results.txt" and feed it to bootstrap: bootstrap.py --blocked --compare-all -p 100000 -s 100000 results.txt > bootstrapped.txt

a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?
b) Can step 1. be done more efficiently?
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)

a) Both are fine.  Though I'm also interested to hear Garf on this subject.
b) Yes, there is easier way. There is "Sorted by listener" folder. Find  folder with your results ("34_GECKO"), rename it to "Sample01" and run chunky on it.
c) You should copy-paste  all results (results01, results02, ... , results20) to results_AAC_2011.txt. Without spaces or comments.  You will have 280 results totally: sample01 - 21 results, sample02 - 20 results, ...  etc. -> summary: 280 results.  If you have any issues then see "results_AAC_2011.txt".



mjb2006
I do not accept the results after the closure of the test (evening 20 Aug).
Your results would be discarded anyway.
Your results for samples 03 and 04 are invalid. Two invalid results from your total 5 results (01,02,03,04,05)-> means complete discard. Read the rules.txt
I've repeated many times to send single results as soon as possible to re-do them in case of errors.
And your results  are dated by 26 July. There is nobody to blame.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: mjb2006 on 2011-08-26 05:18:57
I do not accept the results after the closure of the test (evening 20 Aug). Your results would be discarded anyway.
I'm not upset, and I did not wish to imply that I was arguing about whether my results should have been considered valid. Clearly they are not.

Besides, I see now what happened. On 20 Aug I realized I would not have time to do more tests, so I checked the thread, and you had not yet made your post saying the test was closed, so I RARed my old results (file modification time 16:04:09-0600) and sent them (email time 16:05:34). I see now that you posted in that very short interval (post time 16:04:xx).

And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests. This meaning is not at all obvious when you said that sending results early "helps to prevent some simple errors related to ABC-HR application or any other at early stage," which sounds like you're referring to logistical issues and also seems to be the only time you mentioned it in the test thread, not something you "repeated many times."

Anyway, is it normal for ~27% of listeners to have their results discarded?

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-26 05:24:37
Anyway, is it normal for ~27% of listeners to have their results discarded?

Yes, it is normal. The quality is pretty high.


And I didn't realize that you would be contacting people about errors and offering them the chance to re-do those tests

http://www.hydrogenaudio.org/forums/index....st&p=764051 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89765&view=findpost&p=764051)
http://www.hydrogenaudio.org/forums/index....st&p=763480 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89765&view=findpost&p=763480)

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-26 07:16:19
a) Do I need to look at "Unadjusted p-values:" or "p-values adjusted for multiple comparison:" if I am just checking my own results? In other words: does the "multiple comparisons" refer to multiple listeners or multiple samples (or something else)?


Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/ (http://www.xkcd.com/882/)

Quote
c) How do I run chunky over all results to get one merged results file like "results_AAC_2011.txt" in results.zip? Right now I get per sample results averaged over all listeners (and results for individual samples which could be merged by hand)


This must be done by hand. Chunky has a bug where by default it slams all listeners together per sample in its final result (so you end up with a result as if only a single person had taken the test).
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-26 07:20:53
CVBR and TVBR are statistically tied.  One did not do better than the other, not even slightly.


More correctly:

One did do better than the other (the means are not equal). But there's still a high enough probability that that result was due to random chance instead of one codec being better than another, so we don't want to make any conclusions based on it.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-26 08:51:57
Always look at the adjusted p-values. Multiple comparisons doesn't refer to listeners or samples, but simply the fact that every codec is compared to every other codec. Bootstrap show "15 comparisons" for this test, so the p-values must be adjusted for this.

This is a cartoon explaining what happens if we WOULDN'T do that:
http://www.xkcd.com/882/ (http://www.xkcd.com/882/)

Then I completely misunderstood it and I apologize for my wrong answer to Gecko.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Gecko on 2011-08-26 10:28:53
Thank you IgorC and Garf for answering my questions!

In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?

During the test I had the feeling that one (regular) sample was often worse and one often a tad better. Only the former assumption seems to be backed by my data (using the adjusted p-values): ffmpeg < Nero < All other

Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-26 11:20:35
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.
While it's the same for chunky if one particular sample has 1 or 100 results. It will throw away all particular data and will work only with average score per samples (only 20 average scores. Because there were 20 samples)

Bootstrap in change performs analysis on whole set of data (280 results for this test.) That permits to get more important statistical differences.
Literally every single result is useful.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-27 17:39:16
In the output of bootstrap, is there any semantic difference between "a is better than b" and "b is worse than a"?


No.

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: Garf on 2011-08-27 17:48:48
Given the high level of quality, I'm surprised that such a strong ranking could be established. I would have expected more ties.

That's actually how bootstraps works.


The result has nothing in particular to do with bootstrap. The Friedman's tool "classic" ANOVA shows the same results. The problem is that Chunky has a bug that was throwing away most listeners and this wasn't realized for a while.

With 280 submissions, quite some conclusions can be made.

What is bothersome is that the first samples were (once again) tested and hence weighted more. Maybe for the next time we should add a small DB and offer sample downloads one-by-one, after calculating from the DB what the least tested samples are.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: lvqcl on 2011-08-27 20:22:04
Basically, the graphics suck, but they look cute 

Exactly.  So, the graphs: 

Ratings for all 20 samples: (without low anchor)

(http://img855.imageshack.us/img855/6668/94838365.png)


Sorted by Nero rating:

(http://img839.imageshack.us/img839/1456/56516717.png)


Without Nero, sorted by CT rating:

(http://img155.imageshack.us/img155/338/65393755.png)


CVBR, TVBR and FhG, sorted by FhG rating:

(http://img849.imageshack.us/img849/7480/48971450.png)


All encoders, sorted by their ratings independently(!):

(http://img560.imageshack.us/img560/816/32513938.png)
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-27 21:46:03
I found the first and the last graphs to be particularly informative.

From first graph it's easy to see that half of AAC encoders did excellent on male English speech (Sample 18). The same is valid for female English speech (Sample 06).
Then it's correct to say that modern high quality AAC encoders perform very well on speech at 96-100 kbps.

I think we missed a category of sample as "speech with some background sounds". It's different from song with voice (Sample 17).
Though I haven't seen such samples among repositories with testing samples.





Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: lvqcl on 2011-08-27 22:20:50
I think we missed a category of sample as "speech with some background sounds"

Such as French_ad (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=40022&view=findpost&p=352243)? Or maybe something like rawhide (http://www.ff123.net/samples.html) is enough?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-08-27 22:39:11
Yeah
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: no404error on 2011-09-04 04:01:31
CVBR, TVBR and FhG, sorted by FhG rating:

Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-10-01 20:21:34
BTW if someone want to organize the next public test it will be great. It's not necessary should be me though I still will be glad to conduct public multiformat test at 100 kbps (probably next year ). 
I'd prefer to see a trusted HA member with at least 2-3 years of registration. It would be great if Steve Forte Rio, Garf, /mnt or AlexB conducted a test since they always help to organize them. Maybe Sebastian Mares would like to come back
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: jukkap on 2011-10-01 20:49:13
How about 48kbps HE AAC ? Or low bitrate multiformat with few HE AAC encoders and Ogg Vorbis and possibly some other format included ?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-10-01 21:58:32
Well, the last time we have tested LC-AAC encoders at 96 kbps to pick up most optimal of them to include it into the future multiformat test at the same bitrate (96 kbps).  This public test (multiformat at 96 kbps) has the highest priority (unconditionally) for me  and  I'm willing to conduct it in future in case if there won't be another volunteer/organizator to do this test.

Obviously if someone will want to organize public test  at different bitrate then it will be up to him/her or other members as well. Anyway some members (me too) will help in any scenario.

Can You say when your encoder will be ready?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: jukkap on 2011-10-02 05:10:20
Can You say when your encoder will be ready?


It is ready and should be public within a week or two.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-10-10 23:17:54
How about 48kbps HE AAC ? Or low bitrate multiformat with few HE AAC encoders and Ogg Vorbis and possibly some other format included ?

What about 64 kbps? Last FhG HE-AAC encoder doesn't have setting for 48 kbps but 64 kbps.

The last 64 kbps public test is already partially outdated due to new FhG HE-AAC and  your new upcoming Dolby HE-AAC.

Volunteer to conduct it?
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: jukkap on 2011-10-20 10:04:09
The last 64 kbps public test is already partially outdated due to new FhG HE-AAC and  your new upcoming Dolby HE-AAC.

Volunteer to conduct it?


Dolby Pulse is finally out. I'm afraid I don't have experience nor skills to conduct a public listening test. I rather trust it to specialists.

Anyway, I will support if anyone else is interested conducting a new HE AAC test. I am interested to see how the new encoder included with my product really is performing against Nero, iTunes, and Fraunhofer.
Title: Public AAC Listening Test @ ~96 kbps [July 2011]: Results
Post by: IgorC on 2011-10-23 07:14:32
Late but still here.

Some participants have answered a few questions.

16 persons.

Gender:
15 - male
1 - female

Age (years):
Min – 20
Max – 50
Average – 28.53


Headphones/loudspeakers:
15 persons  - headphones
1 person - loudspeakers

Headphones Sennheiser (HD 280pro, 650, 600, CX500), Razer Moray+, Beyerdynamic (DT800),  heco victa 300 stereo speakers, Panasonic RP-HC500, Creative EP-630 earphones, Audio-Technica ATH-M50, Grados SR80, Sony MDR-V150 Headphones, Technics RP-8801 ...
   

Soundcard:
On-board – 6 persons
Not on-board – 10 persons

Creative X-Fi Xtreme Gamer, 82801H (ICH8 Family) HD Audio Controller (rev 03),  Apogee Duet interface, on-board VIA HD (on the Asus P7P55D Pro), creative x-fi,
Creative Audigy 4, Realtek ALC888, Little Dot Mk IV headphone amp    running from M-Audiophile 24/96 soundcard,  Realtek internal soundcard on laptop + FiiO E7 DAC/Amp, Realtek ALC887, ...
On-board soundcards are fine. Nowdays on-board solutions are actually very good.

Operating System:
Win7 – 53.3%
WinXP – 20%
Linux – 20%
MAC  - 6.7%


Computer:
PC – 57%
Notebook – 43%

The noise of fans (aproxímate data):
Less or more – 50%
Low – 43%
High – 7%

Quite room?:
Yes – 86%
No – 14%

Time of testing:
Morning – 15%
Afternoon – 18.3%
Evening – 50.7%
Night - 16%


The place (room):
Home – 82%
Data Sever/ Computer room or Office – 18%

Previous participation in public tests:
9 persons – it was a first time.
2 persons – 1 previous public test.
5 persons – 4-5 or more previous public tests.