Skip to main content

Topic: AAC @ 128kbps listening test discussion (Read 63774 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.
  • rjamorim
  • [*][*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #300
Quote
Can someone enlighten me on the origins of Velvet?
http://lame.sourceforge.net/download/samples/velvet.wav

All I know is that it was submitted by Roel (r3mix).

Does anybody know artist (Velvet Underground?), title and album of this song? Also, what would be the style (no way to figure out from just the introduction)

ff123 already enlightened me about it. Thank-you very much.

Details are available at the listening test results page.
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org

  • bond
  • [*][*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #301
Quote
Nope. I couldn't decrypt your sample 09 results. It's the only result file that gave me problems in the entire test. I sent it to schnofler so that he can investigate. Sorry about that.

damn, i shouldnt have tried to manipulate the resultfiles
  • Last Edit: 01 March, 2004, 12:18:30 PM by bond
I know, that I know nothing (Socrates)

  • rjamorim
  • [*][*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #302
[span style='font-size:14pt;line-height:100%']A VERY IMPORTANT STATEMENT[/span]

OK. It seems I f-ed up very badly this time.

First, let me specify what ISN'T wrong: The ranking values are absolutely correct, as well as the screening methodology and the statystical calculations.

What is wrong: The error bars.

I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines.

So, the error bars are to this day the same as the ones used in his 64kbps listening test. And it affects all my listening tests. Both the overall plots and the individual ones.

I can't express how sorry I am.

Tomorrow I'll start fixingall the test results pages. Until I announce the results have been fixed, please disregard them.

In case someone is in a hurry to check the corrected zoomed result plot for the AAC test:
http://pessoal.onda.com.br/rjamorim/screen2.png
The only thing that changed is that iTunes is now clearly first place and Nero is second place.

Again, I'm terribly sorry. I can already feel my credibility going down the drain.

Kind regards;

Roberto Amorim.
  • Last Edit: 05 March, 2004, 07:14:19 PM by JohnV
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #303
Quote
What is wrong: The error bars.

I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines.

The fault is also mine for not making it perfectly clear how I was drawing the error bars.  Plus I violated an Excel/software rule by not using a spreadsheet as a spreadsheet should be used, instead hard-coding in the error bar values.

Quote
Again, I'm terribly sorry. I can already feel my credibility going down the drain.


Your integrity is intact.  Credibility is a matter of trust.  If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility.

I suggest keeping both the old (incorrect) overall graphs and showing the new, corrected overall graphs side by side, to show the before and after.  I think the individual sample graphs can just be replaced.

ff123

Edit:  You should probably rename the old overall graph and then use the original name of the graph for the corrected one.  That way, websites which link to your overall graphs will be automatically updated.
  • Last Edit: 05 March, 2004, 02:23:15 AM by ff123

  • rpop
  • [*][*][*][*][*]
  • Global Moderator
AAC @ 128kbps listening test discussion
Reply #304
Quote
Your integrity is intact.  Credibility is a matter of trust.  If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility.

Your integrity is, indeed, intact. I've seen a few other listening tests online, and discussion of their results always stops soon after the tests, with the page receeding in internet history. Updating these tests now goes a long way toward proving their reliability will be maintained in the future .
Happiness - The agreeable sensation of contemplating the misery of others.

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #305
Quote
In case someone is in a hurry to check the corrected zoomed result plot for the AAC test:
http://pessoal.onda.com.br/rjamorim/screen2.png
The only thing that changed is that iTunes is now clearly first place and Nero is second place.

Aaaaaah, this explains my previous complaint that the graph didn't seem to align with your written statement about the test significance

Now it does. iTunes indeed almost beats Nero by a significant margin.

As far as the moral winner is concerned, though:

  • Continuum
  • [*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #306
Quote
As far as the moral winner is concerned, though:


"Moral winner"?

  • rjamorim
  • [*][*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #307
Quote
Now it does. iTunes indeed almost beats Nero by a significant margin.

Erm.. I use Darryl's method to evaluate ranking positions.

Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org

  • Gabriel
  • [*][*][*][*][*]
  • Developer
AAC @ 128kbps listening test discussion
Reply #308
Quote
I can already feel my credibility going down the drain


Finding, admitting, correcting your own errors only increases credibility I think.

  • guruboolez
  • [*][*][*][*][*]
  • Members (Donating)
AAC @ 128kbps listening test discussion
Reply #309
Your credibility, your honesty and your honor are now stronger. Thank you.

  • ScorLibran
  • [*][*][*][*][*]
  • Banned
AAC @ 128kbps listening test discussion
Reply #310
You have nothing to worry about, Roberto...you're credibility is quite secure.  Anyone who conducts tests like this will occasionally have a mistake.  It's inevitable.  You took the best approach in resolving it.  Our trust in you is only higher now. 

Quote
Quote
Now it does. iTunes indeed almost beats Nero by a significant margin.

...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.

That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated).  Glad to know it was correct.

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #311
Quote from: ScorLibran,Mar 5 2004, 07:12 AM
...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.

That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated).  Glad to know it was correct. [/quote]
To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap.  Or to put it another way, 19 times out of 20, those results would not occur by chance.  Any overlap reduces that confidence.  If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance.  A reasonable way to describe this situation would be to say that the results are suggestive (if not significant).  Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy.

If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant.

Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more.  I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis.  But others, for example on slashdot, can (and do) complain about this sort of thing.

ff123

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #312
I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'?

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #313
Quote
Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

In that test I used an "eyeball" method to rank the codecs when trying to determine an appropriate overall ranking.  People (including me) didn't like the subjectivity involved in that method, so I changed to the method used now, which is to perform another ANOVA/Fisher LSD once the means for each music sample are determined.  The assumption this method makes is that each sample is equally important to the final overall results.  This may not actually be true if, for example, there are lots of people listening to some samples and only a few listening to others.  Also, the choice of samples greatly affects the overall results.

But at least it seems to produce reasonable results, and it's removed the subjectivity involved in the earlier method.

Quote
I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'?


The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

ff123

  • JohnV
  • [*][*][*][*][*]
  • Developer
AAC @ 128kbps listening test discussion
Reply #314
Quote
To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap.  Or to put it another way, 19 times out of 20, those results would not occur by chance.  Any overlap reduces that confidence.  If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance.  A reasonable way to describe this situation would be to say that the results are suggestive (if not significant).  Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy.

If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant.

Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more.  I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis.  But others, for example on slashdot, can (and do) complain about this sort of thing.

ff123

Right, well, with 95% confidence for the tested 12 samples:
iTunes is better than Real,FAAC and Compaact
Nero is better than Real and Compaact

With lower confidence for the tested 12 samples:
Nero is better than FAAC (small overlap)

With even lower confidence for the tested 12 samples:
iTunes is better than Nero (a bit bigger overlap than with Nero-FAAC)

Correct?
  • Last Edit: 05 March, 2004, 11:21:34 AM by JohnV
Juha Laaksonheimo

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #315
Quote
The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right).

  • Zed
  • [*]
  • Banned
AAC @ 128kbps listening test discussion
Reply #316
Quote
Quote
Now it does. iTunes indeed almost beats Nero by a significant margin.

Erm.. I use Darryl's method to evaluate ranking positions.

Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.

how about this one?

where is the truth?

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #317
Quote
Quote
The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right).

Yes.  If the error bars do not overlap, that is a difference to 95% confidence.  And yes, iTunes almost beats Nero with 95% confidence.

  • eagleray
  • [*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #318
Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"?

I hope the experts around here do not think this is too off the wall.  For that matter I don't know if there is a way to make any recording sound "better" than the original.

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #319
Quote
this one?

where is the truth?

The biggest weakness of this test IMO is that there were only 3 samples tested, and they made it even worse by combining them into one medley.  Other problems:  IIRC, people were asked to rank the codecs from best to worst, not to compare and rate against a known reference.  I believe the reference was hidden as one of the samples to be ranked.

But the 3 sample medley is really the killer.  They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

ff123

  • rjamorim
  • [*][*][*][*][*]
AAC @ 128kbps listening test discussion
Reply #320
Hello.

Thank-you very much for your support

I have been correcting the plots (will upload them later) and so far, it seems very few will change:

-At the first AAC@128kbps test, it only becomes more clear that QuickTime is the winner.
-At the Extension test, it seems Vorbis and WMAPro are no longer tied to AAC and MPC, and now share second place. I'll leave it to others to discuss.
-The 64kbps test results stay the same: Lame wins, followed by HE AAC, then MP3pro, then Vorbis. LC AAC, Real and WMA are still tied at fifth place, and FhG MP3 is still way down the graph.
-The MP3 test stays the same as well.

Regards;

Roberto.
  • Last Edit: 05 March, 2004, 12:24:14 PM by rjamorim
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #321
Quote
Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"?

I hope the experts around here do not think this is too off the wall.  For that matter I don't know if there is a way to make any recording sound "better" than the original.

Yes, the listener is asked to rate the sample against the reference.  The reference is 5.0 by default, so any difference, even if it "sounds better" than the reference must be rated less than 5.0

ff123

  • Zed
  • [*]
  • Banned
AAC @ 128kbps listening test discussion
Reply #322
Quote
But the 3 sample medley is really the killer.  They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

but small number of the ears is also the killer i guess...

  • ff123
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #323
Quote
Quote
But the 3 sample medley is really the killer.  They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

but small number of the ears is also the killer i guess...

They had about 3000 listeners for both the 64 kbit/s and 128 kbit/s tests.  If they had distributed 50 separate samples instead of the one medley, they could have gotten more than 50 listeners per sample.  That's more than enough to make a statistical inference.  In fact, one can do quite well with far fewer.

ff123

  • Garf
  • [*][*][*][*][*]
  • Developer (Donating)
AAC @ 128kbps listening test discussion
Reply #324
The test also seems at least 1.5 years old. Lots has happened in that time with AAC.
  • Last Edit: 05 March, 2004, 01:08:06 PM by Garf