HydrogenAudio

Hydrogenaudio Forum => Listening Tests => Topic started by: guruboolez on 2004-07-12 00:50:54

Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 00:50:54
[span style='font-size:14pt;line-height:100%']PRELIMINARY NOTES[/span]

• My access to internet is now very limited. Therefore, the encoder I’m using for my tests are not necessary the most recent available on the web. Here, tests were done when vorbis 1.1 RC1 was released, but I didn’t have access to this information…

• This test is something like a work-in-progress. I plan to add more results with time.



[span style='font-size:14pt;line-height:100%']I. PURPOSE OF THE TEST[/span]

Like many people of this board, my principal motivation for audio encoding lie in the possibility to listen and enjoy music in high quality directly from computer, which allows a very fast browsing and the access to an entire record collection. High quality encoding is a requirement, security a need. I used successively lame mp3, musepack audio and now lossless, which offer the security of identical digital-data with CD.
Nevertheless, lossy encoding is still interesting: modern hard disks are not necessary big enough for all collections, and I think that there’s some benefits to feed expensive digital jukebox with “better than just good” quality audio encodings, like AAC/Vorbis 128 – fine but perfectible.

The choice of the best lossy encoder isn’t really problematic. Musepack (mpc) is still winning most approvals, and is considered as fully transparent with --standard preset. Some elements encouraged me to seriously question this leading position of mpc.

• 1/ by testing occasionally the standard preset of mpc, I discovered that small differences are sometimes audible with usual music. Now if mpc isn’t fully transparent at 175 kbps, this format is definitively comparable (it doesn’t mean “equal”) to other lossy solution, which are suffering from the same report.

• 2/ the leading position of mpc was admitted long time ago. It was defined as “best lossy format” when challengers where not very strong: beta of vorbis, lame < 3.90, suboptimal aac encoders. But now, there are powerful vorbis encoders (the recent “megamix” merging looks like a serious challenger), optimized AAC encoders (QuickTime CBR and Nero VBR), and mature MP3 solutions (VBR presets of lame). The leading position must therefore be questioned again, at least by people able to detect differences.

• 3/ This challenge becomes necessary with the growing numbers of device supporting new audio formats like AAC and Vorbis. MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.


In consequence, I’ve tried to oppose to mpc --standard other serious encoding solutions, in order to have a better, modern and personal idea of the relative quality of this encoder compared to modern and convenient challengers.



[span style='font-size:14pt;line-height:100%']II. CHALLENGERS[/span]

Against musepack --standard, I decided to oppose two formats: MP3 with lame 3.97a3 and OGG VORBIS with the recent combined encoder named “megamix”. Explanations.

• first, no [span style='font-size:12pt;line-height:100%']AAC[/span] encoder in the arena. I was tempted to use [span style='font-size:12pt;line-height:100%']Nero AAC[/span], but the last version I have (2.6.2.0) have some recognized quality problems and is promise to an imminent conceptual death, with the Third version of Ivan Dimkovic encoder. No need to test something outdated… I was also tempted to take[span style='font-size:12pt;line-height:100%'] QuickTime AAC[/span], though it’s not VBR and not very flexible (nothing between 160 and 192 kbps: annoying for fair comparison with MPC --standard). But this encoder is not really suitable in my opinion for HQ listening, at least when user is found of opera and when most of his CD absolutely need a real gapless playback. AAC will be add later, but for now, it’s absent from this test.


• the choice of [span style='font-size:12pt;line-height:100%']lame MP3[/span] version is highly problematic too. Three choices are possible: the last “tested” release (3.90.3), the last gold release (3.96) or the last alpha release (3.97 alpha 3). I’ve decided to not use [span style='font-size:12pt;line-height:100%']3.90.3[/span]. I know that for some people this encoder is the best mp3 codec ever released; I also know that for historical reason 3.90.3 is probably the safest choice. But the difference between 3.90.x dead branch and the active 3.9x one is not only related to quality: 3.9x are much faster (not a luxury considering slowness of 3.90.x presets), more complete (full and redesigned VBR preset scale: the nice –V 5 used in Roberto listening test is for example a new feature inaccessible for 3.90.x), and last but not least in perpetual evolution. There’s nobody to correct flaws on 3.90.x, whereas bug audible with 3.9x could be corrected or lowered by Gabriel, Robert, Takehiro and other developers.
I definitively forget the choice of 3.90.x for another important reason: there’s no VBR preset corresponding to the MPC –standard bitrate. –alt-preset standard is clearly too high, --medium too low, -Y switch a hack, and ABR is probably not efficient enough. With 3.9x branch, there’s an existing preset between –standard and –medium: -V 3. And –V 3 average bitrate should be close to the MPC –standard one.

Then: [span style='font-size:12pt;line-height:100%']3.96 “gold” or 3.97 alpha[/span]? I’ve decided for the alpha release. I know the risks (for regression but also for progress). But I also know that 3.96 is buggy on –fast mode: it decides me to use a corrected release, even if the test doesn’t concern the fast mode of lame.


• the choice of [span style='font-size:12pt;line-height:100%']vorbis[/span] version is less problematic. Recent tests were done. CVS/GT3b2 couldn’t resists against aoTuV/GT3/QK32 dream team (aka [span style='font-size:12pt;line-height:100%']megamix[/span]), at least up to 5,99. And even higher, GT3b2 (previous reference encoder for high bitrate) doesn’t really sound superior (except maybe for one family of problems: micro-attacks). I also recall that I’ve began this test by being unaware of the release of 1.1 RC1. This last encoder nevertheless seems to be inferior to “megamix” (the essential but maybe ‘excessive’ tuning of Garf, used at bitrate > -q 5,00, are apparently missing from this RC1 version). The use of “megamix” is therefore pertinent, and my test is probably not outdated by this enjoying pre-release of oggenc 1.1


• I don’t forget the promising [span style='font-size:12pt;line-height:100%']WMApro[/span]: I was really pleased and even enthusiastic by the quality reached by this format with classical music at mid-bitrate. Nevertheless, I didn’t include this format in the test. First, I had to limit the number of competitors. Then, I’m not familiar with this encoder and don’t know what setting is the best (which VBR mode? And is WMApro VBR implementation reliable, or isn’t ABR 2-PASS preferable, etc…). Last: still no hardware device for WMApro (though it’s not a reason to exclude an audio format from a test including MPC, it’s a disappointing situation).



[span style='font-size:14pt;line-height:100%']III. SAMPLES[/span]

Mid/High bitrate tests are, for me at least, especially painful. It doesn't mean that I hate them, quite the opposite. ...
Samples only concern « classical music », with one exception. I deliberately limited my choice on the music I like. It's not by snobbism; and it's not an egocentric attitude: other music is much harder for me to ABX, and my motivation would quickly disappear with music I don't really like. In other word, the impact of these results is VERY LIMITED: they concern my subjectivity (and only mine), and a particular genre of samples (natural instruments, recording according to high-fidelity principles - and not to the marketing “loud” one).
There are solo instruments (organ with Dom Bedos; harpsichord with Fuga; trombones with Orion II), instruments with small accompaniment (cymbals with Krall and Marche Royale, drums with Marche Royale, 2nd part), orchestra (Weihnachts-Oratorium and Platée), chorus (Ich bin der Welt abhanden gekommen) and voice ( “Dover, giustizia, amor” ). Additional information (artist, performer…) are available on file tags.



[span style='font-size:14pt;line-height:100%']IV. SETTINGS[/span]

Comparing VBR encoders/settings is problematic. The ideal thing is to fix a target bitrate, and then to find the corresponding preset for each encoder. I followed the usual (and IMHO the best) methodology: the setting must be related to a wide selection of music, and not to the selected samples.
The targeted bitrate is the average bitrate of MUSEPACK --standard preset. The average bitrate can’t be evaluated precisely: it’s something comprise between 170 and 180 kbps. 175 kbps approximately. I have verified this value with classical music library, and people have reported similar value with completely different music.
The remaining task is now to find the corresponding VBR settings for LAME MP3 and Vorbis “megamix”.
The problems are beginning…


[span style='font-size:12pt;line-height:100%']4.1. VORBIS SETTINGS[/span]

• The biggest problem lies in the average bitrate’s difference of vorbis, occurring at the same setting, depending on the kind of encoded music. Classical is bitrate friendly compared to most other stereo and modern material. With CVS encoder, I estimated this difference  at 10…15 kbps on average for –q 5…6. With “megamix” (or other GT3b2 based encoders), this difference might reach 25…30 kbps for the same setting. I don’t know what to do…
- by testing vorbis with a –q value corresponding to 175 kbps for classical but 200…210 kbps on pop/rock… people may blame me for opposing to musepack an advantaged vorbis challenger.
- by testing vorbis with a –q value corresponding to a 175 kbps for pop/rock but 140 kbps on classical, the test will be pointless for me (the winner between mpc@175 and vorbis@145 isn’t very hard to guess…).
- by testing vorbis with a half-baked –q value, I fear that the test won’t corresponding to neither of both situation.

• The second big problem is related to vorbis rupture in the linearity of the quality scale. Between -5,99 and -6,00, there’s a consequent bitrate difference (~10 kbps), also corresponding to a serious quality difference, at least with vorbis 1.00 – 1.01 (including GT3b2). aoTuV (and therefore “megamix”) is based on the same code, but the tuning tried to correct or to minimize the quality gap between the two settings. I discovered that for classical music, the fair vorbis setting is very close to this 5,99 value. 6,00 is slightly to high, and I could disadvantage mpc by comparing it to vorbis –q 6,00. On the other side, I have the feeling that -q6,00 would show the full potential of vorbis, and that the extra 8…10 kbps could be worth for daily use. Would someone renounce to the correction of a quality bug at low prince (+5% increase in filesize), especially with archiving in mind? Seriously, I don’t think so.

For all these reasons, I’ve decided to put in the arena vorbis megamix at three different settings:
[span style='font-size:12pt;line-height:100%']-q 6,00[/span]: clearly to “heavy” compared to mpc --standard with non-classical music, but interesting to test against -q 5,99 (to see if the frontier between these two settings still exists with aoTuV/Megamix/1.1)
[span style='font-size:12pt;line-height:100%']-q 5,99[/span]: the corresponding setting for a matching bitrate with mpc –standard for classical music (still too heavy with other music), but maybe suboptimal quality for vorbis
[span style='font-size:12pt;line-height:100%']-q 5,50[/span]: more universal setting for acceptable test against mpc --standard. It would be interesting to compare the quality difference between 5,50/5,99 and 5,99/6,00. I suspect (and fear) a much greater jump between the last pair than with the first one.


[span style='font-size:12pt;line-height:100%']4.2. LAME SETTINGS[/span]

I discovered that bitrate of [span style='font-size:12pt;line-height:100%']–V 3[/span] preset (lame 3.97a3) is really close to the average bitrate of mpc --standard. This applies at least for classical music (I don’t have enough material to measure average bitrate on other musical genre). –V 3 will therefore be tested.
I’ve also decided to add [span style='font-size:12pt;line-height:100%']–V 2 (--preset standard)[/span]. The bitrate is higher, but I really want to see if this historical leading preset of lame MP3 is competitive against musepack. It would also be interesting to see how will perform lame –V 2 compared to vorbis megamix, also playable on portable player, but with bad consequences on battery life.


[span style='font-size:12pt;line-height:100%']4.3. BITRATE TABLE[/span]

Instead of posting of a bitrate table of the short samples used for the test, I prefer posting data about more audio material. Average bitrate for ~20 albums (classical for most), and additionnal datas for track coming from 50 different CDs (+15 other in mono) are available on the following tables:
OpenOffice: http://audiotests.free.fr/tests/200...RATE175kbps.sxc (http://audiotests.free.fr/tests/2004.07/hq1/BITRATE175kbps.sxc)
Excel: http://audiotests.free.fr/tests/200...RATE175kbps.xls (http://audiotests.free.fr/tests/2004.07/hq1/BITRATE175kbps.xls)




[span style='font-size:14pt;line-height:100%']V. RESULTS AND CONCLUSIONS[/span]


(http://audiotests.free.fr/tests/2004.07/hq1/resultshq1.png)


First comment: I've add 10 points to each note. I had to find a solution to prevent misinterpretation of notes which could first appear as excessively severe. I didn't use low anchor for this test, and slight flaws sometimes appear as terribly annoying on such tests, lowering very much the notes. By artificially changing all notes, I also had in mind to disconnect the notation I used from the EBU scale (4= "perceptible but not annoying"; 3 = "slightly annoying", etc...).



With 10 results only, I couldn’t make strong conclusions. But some elements of conclusions are now appearing:

• [span style='font-size:12pt;line-height:100%']MPC –standard[/span] has serious chance to be the best of the three competitors. Eight time on the first place, one time second, and never on the last. Very good performances. We could also note that –standard setting wasn’t sufficient for reaching the “transparency” level (except for the organ sample, with negative ABX tests). Nevertheless, I could seriously expect full transparency with higher setting: none of this sample (except maybe the chorus one) showed severe artifacts, but just slight differences. It’s typically the kind of “problems” that disappear with a higher bitrate. Anyway, I’m impressed, because I didn’t thought that MPC –standard was so in advance...

• [span style='font-size:12pt;line-height:100%']LAME MP3[/span] has few chances in my opinion to compete with vorbis and musepack at ~175 kbps. The new –V 3 setting sit on the last place eight times: too much… even with a limited set of samples. It doesn't mean that -V 3 sounds bad, but it's just inferior to modern lossy format at similar bitrate. But with improvements, who knows...
But the –V 2 setting (aka –alt-preset standard) is apparently competitive, and could fight (and sometime win) with vorbis “megamix” –q5.50 and –q5.99. Only problem: bitrate is not the same anymore (195 kbps vs something comprise between 162 and 180 kbps, but with classical music only). But it’s imperative to precise that LAME –V2 and –V3 suffers from huge artifacts (the harpsichord and the organ samples are severely wounded to my ears), whereas vorbis artifacts were never so bad (except, maybe, with Orion II sample – micro-attack problems).
To be short, LAME –V2 (--preset standard) is apparently competitive with VORBIS “megamix” –q 5,99, at least with classical music. It would be interesting to see how will perform both contenders with other kind of music at the same setting, which implies a completely different bitrate range (+10..15% with vorbis, and maybe – x% with lame).

• I expected a lot from the [span style='font-size:12pt;line-height:100%']vorbis mixture[/span]. The progress between “megamix” and CVS are really impressive compared to CVS encoder, and I really wondered how it’ll perform against other challengers. I’m finally disappointed. For some reasons:
- First, the coarse sounding problem of the format is still audible with “megamix” up to 5,99. No need to suspect any of GT3b2 or QK32 tuning to ruin the benefits of original aoTuV in this area: the noise problem is particularly audible on “tonal” moments, encoded with pure aoTuV code (bit to bit identical samples between aoTuV encoder and megamix one). This additional noise is probably not too disturbing on daily listening, but on direct comparison with other challengers, the contrast is still annoying. The problem not really lies on noise, but on coarse rendering of voice or instruments: lack of subtlety, fat texture… I think that this problem is a legacy of internal change occurring during RC3 development of Vorbis, in spring 2002. I think I’ve established this fact at ~128 kbps some months ago (http://www.hydrogenaudio.org/forums/index.php?showtopic=18359) (correct me if I’m wrong), and I suppose that’s still true at ~160…170 kbps, even with aoTuV (based on the same buggy “final” CVS code).
- Second reason to be disappointed: due to this remaining coarse problem occurring up to –q5,99, there’s still a consequent quality gap between this setting and the rounded -q 6,00. It’s my fault: I’ve expected from aoTuV tuning to erase the existing frontier between –q 5,99 and –q 6,00: this encoder only reduced the gap. There are ~10 kbps difference between 5,50 and 5,99 but few quality improvements. There are also 10 kbps difference between 5,99 and 6,00, but huge quality progress are audible. For a daily use of vorbis encoder, there’s no real problem with this difference: the 10 additional kbps of –q6,00 are obviously worth if someone is looking for high quality or archiving, and there’s no need to hesitate. But for my test or any similar one, this difference is much more problematic. On one side, I couldn’t oppose mpc –standard to megamix –q 6,00 on fair bases (average bitrate doesn’t match anymore). And on the other side, it’s pointless to compare mpc –standard to an handicapped vorbis setting (5,99). It’s like using musepack at –quality 4.99, which also suffers from problems (and bitrate gap) that don’t exist anymore at –quality 5.00. Cruel dilemma…
- Third reason to be disappointed: even at –q 6,00 (and 10 exceeding kbps), megamix couldn’t apparently reach the quality of musepack –standard. More samples are of course needed to enforce this beginning of conclusions, but I really fear the solution doesn’t lie on a selection of samples, but rather on further development.



As I said it at the very beginning, I consider this test as a first step. Additionnal results should and will normally complete this first phase. I expect a quick release of Nero AAC encoder in 3.0.0.0 version to add some spice to the test. External test, opposing vorbis megamix to the new 1.1 must also be done, in order to be sure that megamix is the best vorbis encoder at this bitrate.

I'd also like to see this test followed by other people. It would help to compare different HQ encoders on empirical bases. Feel free to post some results, even for one sample, on this topic.



[span style='font-size:14pt;line-height:100%']APPENDIX. SAMPLE LOCATION[/span]

I've upload all samples on a temporary link. I couldn't keep them on-line too long. So don't wait if you're planning to do personal tests. ABX logs are available in each archive. Samples are in OptimFROG (http://losslessaudiocompression.com/) lossless audio format.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 00:52:57
[span style='font-size:14pt;line-height:100%']Additionnal results (2004.08.22):[/span]

(http://audiotests.free.fr/tests/2004.07/hq1/resultshq18_8add.png)


[span style='font-size:21pt;line-height:100%']Cumultative results:[/span]

(http://audiotests.free.fr/tests/2004.07/hq1/resultshq18.png)


see this post on page 4 (http://www.hydrogenaudio.org/forums/index.php?showtopic=23355&view=findpost&p=236220) for more details.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: westgroveg on 2004-07-12 01:34:25
Quote
The problem not really lies on noise, but on coarse rendering of voice or instruments: lack of subtlety, fat texture… I think that this problem is a legacy of internal change occurring during RC3 development of Vorbis, in spring 2002.

From what I can understand this this problem disappears after q5.99.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: phong on 2004-07-12 01:36:20
Outstanding work as usual guru.  I don't know if it would really solve the fairness issue, but could you increase the mpc setting to 5.1 or 5.2 or something to make it the same bitrate as megamix at -q 6? The bitrates could be matched that way without having to put either codec on the bad side of one of those annoying "thresholds".
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 01:41:33
Quote
From what I can understand this this problem disappears after q5.99.
[a href="index.php?act=findpost&pid=225055"][{POST_SNAPBACK}][/a]

Yes, the complete range between -q -1 and -q 5,99 is affected by this phenomenon. It's easy to notice with CVS encoders (except 1.1). RC3 and inferior release are probably free of this problem, and aoTuV/1.1 lower the amplitude of coarsness.
It's a great shame that this quality frontier is located so high in the bitrate scale. At -q4 or -q5, it would be less annoying. But here, this fat sound also affect encoding at 170...210 kbps, HQ setting which should be free of this kind of problem.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: westgroveg on 2004-07-12 01:44:05
I think to be a fair test MPC should use q7/Insane profile & LAME 3.90.3 should also be included using --alt-preset standard this would put all formats at a close bit-rate.

The results of LAME 3.90.3 against 3.96 are not convincing enough.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 01:49:06
Quote
Outstanding work as usual guru.  I don't know if it would really solve the fairness issue, but could you increase the mpc setting to 5.1 or 5.2 or something to make it the same bitrate as megamix at -q 6? The bitrates could be matched that way without having to put either codec on the bad side of one of those annoying "thresholds".
[a href="index.php?act=findpost&pid=225056"][{POST_SNAPBACK}][/a]

It's a solution, but I don't like it. It's not to the reference to be fit to the challengers, but the opposite. Most people are using --standard preset with mpc. They won't use --quality 5.2 or 5.4 and wasting bits.
The first step of excellence for mpc is --standard, which correspond to ~175 kbps on average with 1.14 encoder. If the first step of excellence for vorbis megamix lies on -q6, which correspond to 185...210 kbps, it's a vorbis handicap (developers choice - good or wrong, I can't say), which proves that the first encoder is more efficient than the second one. In other words, there's an advantage of using mpc: the optimal quality is accessible on lower bitrate. A test shouldn't break this balance.

Anyway, even with lower bitrate, mpc seems to maintain some distance with megamix -q 6,00. I don't expect great changes by using a slightly higher setting for musepack.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: westgroveg on 2004-07-12 01:59:20
Quote
Anyway, even with lower bitrate, mpc seems to maintain some distance with megamix -q 6,00. I don't expect great changes by using a slightly higher setting for musepack.



Quote
We could also note that –standard setting wasn’t sufficient for reaching the “transparency” level (except for the organ sample, with negative ABX tests). Nevertheless, I could seriously expect full transparency with higher setting: none of this sample (except maybe the chorus one) showed severe artifacts, but just slight differences.


I think it would be interesting to see.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 02:07:31
Quote
Quote
I don't expect great changes by using a slightly higher setting for musepack.



Quote
I could seriously expect full transparency with higher setting: none of this sample (except maybe the chorus one) showed severe artifacts, but just slight differences.



It might appear as a contradiction, but according to my past experience, problems are never solved by adding few kbps. A more consequent inflation (from standard to extreme, there's 30 kbps difference) is - I'm sure - needed in most cases to "solve" problems (i.e. lowering the distortion level below the threshold of hearing of the tester).

Adding 0.2...0.5 point to a quality level is rarely convincing: look on the difference between vorbis 5.50 and 5.99: near inexistant.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-12 02:54:56
Quote
Quote
From what I can understand this this problem disappears after q5.99.
[a href="index.php?act=findpost&pid=225055"][{POST_SNAPBACK}][/a]

Yes, the complete range between -q -1 and -q 5,99 is affected by this phenomenon. It's easy to notice with CVS encoders (except 1.1). RC3 and inferior release are probably free of this problem, and aoTuV/1.1 lower the amplitude of coarsness.
It's a great shame that this quality frontier is located so high in the bitrate scale. At -q4 or -q5, it would be less annoying. But here, this fat sound also affect encoding at 170...210 kbps, HQ setting which should be free of this kind of problem.
[a href="index.php?act=findpost&pid=225058"][{POST_SNAPBACK}][/a]


Is it coincidence that the problem goes away after q5.99, which also happens to be the point at which lossless channel coupling begins?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: kjoonlee on 2004-07-12 03:16:08
Quote
Is it coincidence that the problem goes away after q5.99, which also happens to be the point at which lossless channel coupling begins?[a href="index.php?act=findpost&pid=225074"][{POST_SNAPBACK}][/a]

Lossless channel coupling can be used below q5.99 as well. Q5.99 and below can use a mixture of lossy and lossless coupling if neccessary. Q6 is the point at which lossy channel coupling is no longer used.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-12 03:17:50
Quote
Quote
Is it coincidence that the problem goes away after q5.99, which also happens to be the point at which lossless channel coupling begins?[a href="index.php?act=findpost&pid=225074"][{POST_SNAPBACK}][/a]

Lossless channel coupling can be used below q5.99 as well. Q5.99 and below can use a mixture of lossy and lossless coupling if neccessary. Q6 is the point at which lossy channel coupling is no longer used.
[a href="index.php?act=findpost&pid=225077"][{POST_SNAPBACK}][/a]

So, is it coincidence that the problem goes away at the point at which lossy channel coupling is no longer used?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: kjoonlee on 2004-07-12 03:21:10
Quote
So, is it coincidence that the problem goes away at the point at which lossy channel coupling is no longer used?

Could be, because q5.99 might have been using lossless coupling exclusively.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Faelix on 2004-07-12 03:33:18
Quote
MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.


It would be wonderful if this best case were true, but no: on my Palm I can only listen to MP3, Ogg Vorbis and WMA. And I know the same applies to PocketPC, besides some obscure AAC player. Musepack is unfortunately really confined to computers.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: QuantumKnot on 2004-07-12 05:15:02
Very interesting test.  It does confirm the well-known weakness of Vorbis on classical music and more work needs to be put in to correct this.  I'm not sure what is causing the difference in quality between -q 5.99 and 6.  The switching off of lossy stereo at -q 6 is one but point stereo only causes stereo collapse on high frequency bands.  Noise normalisation  also affects higher frequencies and turns off at -q 7 I think so that may not be the reason either.  hmm....I don't know. 
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Dologan on 2004-07-12 06:27:17
Quote
Quote
MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.


It would be wonderful if this best case were true, but no: on my Palm I can only listen to MP3, Ogg Vorbis and WMA. And I know the same applies to PocketPC, besides some obscure AAC player. Musepack is unfortunately really confined to computers.
[{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=225084")

Hopefully, not for long.  See [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=23362]here[/url]
.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Gabriel on 2004-07-12 09:06:48
Very interesting...
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-12 11:58:17
Thank you for sharing.

Do you have details on the ABX tests ? Did you do them for every sample ? Do you train before beginning ? How much ABX sessions do you perform ? What were the results ?

Last time you posted something like this, no one cared to perform a statistical analysis in order to rank the encoders with 95 % confidence bars. I guess I'll have to do it myself, but since I don't know how, it will take some time.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-12 12:00:29
By the way, did you use ABC/HR ?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 12:11:49
Quote
Do you have details on the ABX tests ?


If you're talking about ABX log and comments, they're all in the .zip archive, accompagning each sample. Not the best idea I must say. I'll upload the log files in a separate and slim archive.


Quote
Did you do them for every sample ?


Yes, but for some files, I've renounced to ABX encoded files against encoded files. Sometimes, difference is very small. These kind of tests need much more concentration. I've nevertheless try to compared encoded files each others when they're sharing the same kind of flaw, in order to have a better idea of which sounded better.

Quote
Do you train before beginning ?


No. I didn't use the latest ff123 ABC/HR soft (offering a training module). The only training I've done was with the Diana Krall samples. It's a sample I've discovered some times ago, when I noticed that mpc --standard produces audible distortions on the cymbals. I've first began my test mith this sample as dilettante, without ambition, comparing MPC against one Vorbis encoding and one MP3 encoding. Then, I have decided to avoid some possible criticism about bitrate by using a wider set of encodings for vorbis and mp3, in order to see how are performing these file formats even at higher bitrate: at their optimal quality (the "excellence step" for each format: --standard, --alt-preset standard, and -q 6,00).

Quote
How much ABX sessions do you perform ? What were the results ?


Generally, I've stopped when pval was low enough. FOr some files, I've ruined the results by doing some mistake. Angry, I've damaged even more the results. Therefore, for some files, I've went up to 50 trials in order to reach again satisfying pval.
But you can find precise values by downloading the archive on my ftp.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: robUx4 on 2004-07-12 12:13:56
Could you consider adding WavPack hybrid mode (only using the lossy part) for similar bitrates ? Because if it doesn't perform bad, it could be a serious alternative (you can have both a lossy and lossless file).
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 12:18:23
Quote
Could you consider adding WavPack hybrid mode (only using the lossy part) for similar bitrates ? Because if it doesn't perform bad, it could be a serious alternative (you can have both a lossy and lossless file).
[a href="index.php?act=findpost&pid=225188"][{POST_SNAPBACK}][/a]

Hybrid encoders have poor performances at this bitrate. At least with classical: they sound terribly noisy, and coarsness of vorbis is nothing comparing to them. These encoders (DualStream and WavPack lossy) are more interesting at ~300 kbps (or maybe lower, with very loud music, like metal). Otherwise, I had include on of these hybrid encoder.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-12 12:25:56
Log files are more easily accessible >> HERE << (http://audiotests.free.fr/tests/2004.07/hq1/log/)
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: manusate on 2004-07-12 13:17:17
Very interesting as always, Guruboolez. Thank you very much.



Enjoy!
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: dev0 on 2004-07-12 13:44:34
[span style='font-size:8pt;line-height:100%']Celsus' trolling attempt has been split into the Recycle Bin (http://www.hydrogenaudio.org/forums/index.php?showtopic=23383).[/span]
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-12 19:59:47
Since you used sequencial ABX tests, with a max number of trials equal to 50, and stopping at p<=0.05, then, according to this post (http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&view=findpost&p=151958), the corrected p value that you got for the ones that are successful is
p=0.1579
We can see from your logs that among the 60 possible original vs encoded ABX tests, you succeeded 21 of them with p<=0.05. If you were guessing, 9 successes would have been expected instead of 21.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-12 20:18:09
I fed this table in ff123's analyzer (http://ff123.net/friedman/stats.html) :

Code: [Select]
MP3-V2    MP3-V3    MPC-q5    MGX-q5.5  MGX-q5.99 MGX-q6    
2.00      1.50      3.00      2.00      2.00      3.20      
1.50      1.00      4.00      2.90      2.90      3.50      
3.00      2.50      2.80      3.00      3.30      4.00      
3.00      2.00      4.00      2.00      2.00      2.30      
1.50      1.00      4.90      2.50      2.50      3.30      
3.00      1.80      3.80      2.20      2.40      3.00      
1.50      1.20      3.50      1.80      2.30      3.40      
1.50      2.70      4.00      2.00      2.00      2.30      
3.00      2.80      4.20      1.60      1.50      3.00      
3.00      2.30      4.00      2.30      2.50      3.50      



I chose Anova, p=0.05, which gives

Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 10
Critical significance:  0.05
Significance of data: 1.24E-08 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total               59          45.38
Testers (blocks)     9           3.67
Codecs eval'd        5          26.01    5.20   14.92  1.24E-08
Error               45          15.69    0.35
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.532

Means:

MPC-q5   MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
 3.82     3.15     2.34     2.30     2.23     1.88  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
MPC-q5   0.015*   0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.004*   0.002*   0.001*   0.000*  
MGX-q5.9                   0.880    0.679    0.088    
MP3-V2                              0.792    0.119    
MGX-q5.5                                     0.192    
-----------------------------------------------------------------------

MPC-q5 is better than MGX-q6, MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3
MGX-q6 is better than MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3


Conclusion : if I understand properly the above, for Guruboolez' ears and samples,
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-13 02:08:00
First, thanks for the analysis. I can't do this. But...

I wonder: lame -V 3 appeared to sound the worst on 8/10 samples; and on one of the two remaining samples, -V 3 obtained the same note than vorbis -q 5,50 and a lower note than -q 5,99. Lame -V3 is sometimes showing weird artifacts (organ, harpsichord), not audible with vorbis.
To be short, lame -V 3 is eight time worse than vorbis -q 5,99, one time eaqual, and one time better, and have the stronger artifacts. That's why it makes no doubt than -V 3 is not competitive against other contenders.


So how is it possible that a statistical tool conclude on the "identity" of both encoders? For me (I'm unfortunately not statistician, but I was the tester, and not Mr Friedman ) it devoids the common sense or at least my overall impression.

Are this kind of analysis adapted to results performed by ONE listener on MULTIPLE samples? I saw:
Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of ***listeners***: 10


Could someone enlight me?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-13 02:17:35
Quote
Are this kind of analysis adapted to results performed by ONE listener on MULTIPLE samples? I saw:
Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/<!--QuoteEBegin-->Blocked ANOVA analysis<!--QuoteEBegin--><!--QuoteEBegin-->Number of ***listeners***: 10


Could someone enlight me?
[a href="index.php?act=findpost&pid=225435"][{POST_SNAPBACK}][/a]


The tool does make the assumption that if you were to draw a histogram of the music samples by "difficulty," (average rating across all codecs) you would end up with a bell curve.  But even if this assumption is violated, it is robust enough that you'd probably still get a reasonable answer.

Short answer:  you can replace "listeners" with "music samples" to give an indication of which encoder you personally prefer, with Pio's important qualifications that the results apply only for you, and only for the group of samples you tested.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-13 02:30:21
But how do you explain the fact than this analysis completely change the conclusions of the listeners? In this exemple, how could lame -V 3 appear as equal to vorbis -q 5,99, for me and for the tested sample, if for me and for the tested samples -V 3 is inferior 80% of the time. It's something I can't understand.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-13 04:03:42
Quote
But how do you explain the fact than this analysis completely change the conclusions of the listeners? In this exemple, how could lame -V 3 appear as equal to vorbis -q 5,99, for me and for the tested sample, if for me and for the tested samples -V 3 is inferior 80% of the time. It's something I can't understand.
[a href="index.php?act=findpost&pid=225440"][{POST_SNAPBACK}][/a]


MGX-q5.99 is better than MP3-V3 with a p-value of 0.088, so it doesn't meet statistical significance, but the numbers suggest it is better.  You'd probably get more definitive results with a handful more samples.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-13 04:13:17
OK. Another question: are these "confidence values" linked to the notation, or to the ABX results?
If I had choosen to be close to the EBU (or ITU, I never know) ranking system, with most notations comprise between 4 and 5 (rather than 1 and 4), wouldn't the confidence margin be ruined?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-13 05:07:31
Quote
OK. Another question: are these "confidence values" linked to the notation, or to the ABX results?
If I had choosen to be close to the EBU (or ITU, I never know) ranking system, with most notations comprise between 4 and 5 (rather than 1 and 4), wouldn't the confidence margin be ruined?
[a href="index.php?act=findpost&pid=225456"][{POST_SNAPBACK}][/a]


ABX results are not considered at all when the ANOVA results are computed.

It doesn't matter at all whether you use a ranking scale from 1 to 5 or from 1 to 10.  The only thing that matters is the relative difference between the codecs.  Also, the fact that the analysis is "blocked" means that the program accounts for fact that some music samples (the difficult ones) have lower average ratings than others.

The single best way to improve confidence in your results is to listen to as many different samples as possible.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-13 11:15:21
Quote
Quote
how could lame -V 3 appear as equal to vorbis -q 5,99, for me and for the tested sample, if for me and for the tested samples -V 3 is inferior 80% of the time. It's something I can't understand.
[a href="index.php?act=findpost&pid=225440"][{POST_SNAPBACK}][/a]


MGX-q5.99 is better than MP3-V3 with a p-value of 0.088,
[a href="index.php?act=findpost&pid=225454"][{POST_SNAPBACK}][/a]


In other words, it's not impossible that -V3 was inferior 80 % of the time by chance, because the difference between the notations are not so big compared to the random variations of your notations.
The result might have been different if I chose a threshold superior to 0.05. Which means that though not significant regarding this threshold, -V3 is nonetheless likely inferior to q5.99. It's likely, but not certain.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-13 12:01:24
Anyway, I plan to progressively add more results with time
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: 2Bdecided on 2004-07-13 13:07:45
Fascinating thread. Thank you guruboolez!

D.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: westgroveg on 2004-07-13 13:16:56
Quote
Anyway, I plan to progressively add more results with time
[a href="index.php?act=findpost&pid=225516"][{POST_SNAPBACK}][/a]

Great, thanks a lot for sharing your results with us guruboolez.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: phong on 2004-07-13 15:53:15
This may be the thread that pushes me into actually reading some vorbis code.  It would be interesting to find the real culprit behind this 5.99 -> 6.0 discontinuity, or at least eliminate some possibilities.  For example, it would be interesting to produce an encoder (just for testing purposes) that turned on lossless channel coupling at 5 instead of 6.  Based on others' posts though, I doubt that's the culprit. 

So where's the gremlin in our cherrios?

Another issue is the whole point of the -q settings...  According to vorbis docs, if you pick a -q setting, future versions of vorbis will have the same "quality" at that setting but at a lower bitrate.  In the tuned versions that are being produced, mostly the quality has increased but at the expense of increasing bitrate.  "In theory" the whole scale could be adjusted so that the same -q levels produced the same bitrates, or if there were some way to quantify quality, they could produce the same quality at a lower bitrate.  "In practice" that seems technically difficult, not to mention there is no consistent definition of what each -q level is supposed to achieve, or a standard corpus of music to benchmark bitrates on.

A common question is what the "transparency setting" for a given codec is.  Strictly speaking, the answer always is "listen for yourself".  For mp3 or mpc, the practical answer is "lame --preset standard" or "mpc --standard".  For vorbis, noone can agree because nobody ever decided on any particular "excellence step" (to steal guru's terminology, which I hope becomes a meme).  Some will say "start with -q 4 and work your way up", others will recommend -q 5 or -q 6 (which, from these and previous tests, is the one I think is best supported by the evidence.)  Even at -q 6, does vorbis even approach the consistency of mpc --standard, or even lame aps?

I guess the good news is that there's lots of interest in tuning vorbis suddenly after what seems like years of inactivity.  Maybe there is finally some progress to some sort of excellence step.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-13 16:09:47
Quote
This may be the thread that pushes me into actually reading some vorbis code.  It would be interesting to find the real culprit behind this 5.99 -> 6.0 discontinuity, or at least eliminate some possibilities.  For example, it would be interesting to produce an encoder (just for testing purposes) that turned on lossless channel coupling at 5 instead of 6.  Based on others' posts though, I doubt that's the culprit.
[a href="index.php?act=findpost&pid=225576"][{POST_SNAPBACK}][/a]

Uncoupled vorbis encoders were released by QuantumKnot and Aoyumi (or Nyaochi, or Harashin, can't remember), and the coarseness of vorbis disappeared, even at lower -q setting. But this bitrate is seriously higher.

Anyway, the aoTuV tuning severly reduces this problem. But some traces remains...
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ScorLibran on 2004-07-14 04:07:43
Thanks for the time and effort you put into this test, guru.  It provides invaluable info for those of us interested in these codecs, but without enough time to perform the tests ourselves.

I'd like to perform a similar test using a sample set of rock music.  Though since my hearing sensitivity isn't NEAR what yours is, I may end up not being able to distinguish any differences at this bitrate.  I can at least try, though.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-14 04:09:57
@Guruboolez

Do you think Megamix II would improve the results of this test?

Edit: Sorry, I should open that same question up to QuantumKnot
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: QuantumKnot on 2004-07-14 04:16:35
Quote
A common question is what the "transparency setting" for a given codec is. Strictly speaking, the answer always is "listen for yourself". For mp3 or mpc, the practical answer is "lame --preset standard" or "mpc --standard". For vorbis, noone can agree because nobody ever decided on any particular "excellence step" (to steal guru's terminology, which I hope becomes a meme). Some will say "start with -q 4 and work your way up", others will recommend -q 5 or -q 6 (which, from these and previous tests, is the one I think is best supported by the evidence.) Even at -q 6, does vorbis even approach the consistency of mpc --standard, or even lame aps?

[a href="index.php?act=findpost&pid=225576"][{POST_SNAPBACK}][/a]


One of the problems with Vorbis quality is that it doesn't seem consistent.  At -q 4.35, Roberto's 128 kbps listening test showed that aoTuV beta 2 was quite good in quality.  But as we go up the q scale, bitrate gets consistently higher, yet problems still exist here and there.  There doesn't seem to be a particular q that is transparent.  Either it is pre-echo that kills transparency or coarse rendering or something else.  I think more tuning needs to be done in the q 5,6,7 range to iron out all these problems.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: QuantumKnot on 2004-07-14 04:18:20
Quote
@Guruboolez

Do you think Megamix II would improve the results of this test?

Edit: Sorry, I should open that same question up to QuantumKnot
[a href="index.php?act=findpost&pid=225773"][{POST_SNAPBACK}][/a]


I think only the wonderful ears of guruboolez or other golden-eared members can answer that question with certainty.  For me, the only concern is whether or not I've missed something again while doing the merging.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-14 11:32:56
Quote
@Guruboolez

Do you think Megamix II would improve the results of this test?

Edit: Sorry, I should open that same question up to QuantumKnot
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=225773")

One of the last file I've add to this first bunch of results is Orion II.wav, which is problematic with vorbis non-GT3 (something like micro-attacks are generated by the trombone). On this sample, the results would probably be much better:

[a href="http://audiotests.free.fr/tests/2004.06/results_megamix_q5.png]http://audiotests.free.fr/tests/200..._megamix_q5.png[/url]
http://audiotests.free.fr/tests/200..._megamix_q6.png (http://audiotests.free.fr/tests/2004.06/results_megamix_q6.png)

As you can see, I've heard serious improvements with GT3 in a very recent past.

But I don't know if I must retest this sample again: is it acceptable?

A second result might improve with megamix: I think it's with the Weihnachts-Oratorium sample. There's a short passage with brass, and IIRC the feeling I had during the blind test, a slight blurring was audible with the vorbis encodings. But here I don't think that results could be much better.

For the eight other files, I don't know. Maybe the additionnal tunings performed by the SVN team have audible consequence on quality with all samples. Good or bad. Megamix II is released before I saw any test of this 1.1 RC1, and before I have not tested it.

[edit: in bold]
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: QuantumKnot on 2004-07-14 11:41:28
Quote
For the eight other files, I don't know. Maybe the additionnal tunings performed by the SVN team have audible consequence on quality with all samples. Good or bad. Megamix II is released before I saw any test of this 1.1 RC1, and before I have tested it.
[a href="index.php?act=findpost&pid=225848"][{POST_SNAPBACK}][/a]


The impression I got from Monty's announcement and the commit logs is that 1.1 RC1 is essentially aoTuV beta 2 with some fixes for bitrate management and a tonality bug of some sort.  Low pass filter cutoffs have changed (about 18 kHz now for q 4 as opposed to 20 kHz) but I'm not sure if that was in aoTuV beta 2 or is a new tweak.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-15 03:04:05
I would really like to see FAAC in future tests, unless it is already known to be so inferior as to not be worth testing.

It's free (sort of), it's gapless, and there are nice encoder/frontends for it.

I could not even guess what quality setting would produce results equal to Lame -APS or Vorbis -q6, or if any quality setting would achieve this level of quality.

Edit: I suppose what I'm really saying is that it would be nice if it were being actively tuned the way Vorbis now is.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-15 03:39:13
Well, I can't test every encoder. I'm not considering faac, because it's not really optimized to my taste. It also suffers from severe problems even at high bitrate, especially with vocal or other tonal signals  due to weird short-block artifacts (see the warbling with compostelle.flac (http://membres.lycos.fr/guruboolez/AUDIO/test_03/compostelle.flac)).
BTW, even in developer's opinion (Krzysztof aka knik), faac isn't optimized for high bitrate:

Quote
I don't think faac is very optimized for high bitrates (and it's still not very optimized at all). I usually use it at ~125kbps.

Author: knik
Date:  11-09-03 16:55
source (http://www.audiocoding.com/phorum/read.php?f=1&i=4137&t=4016#reply_4137)

faac has improved with time, but this bug is still present.
Nero AAC or a hypothetical gapless QuickTime AAC encoder are preferable in my opinion, though they are not free, and not as friendly as as CLI encoder like faac.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-15 03:41:52
Nero would be great, except you have to buy a rather large software package, and then use external software to encode from FLAC or anything worthwhile.

If only iTunes/Quicktime were gapless...
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: QuantumKnot on 2004-07-15 04:12:46
Quote
Edit: I suppose what I'm really saying is that it would be nice if it were being actively tuned the way Vorbis now is.
[a href="index.php?act=findpost&pid=226062"][{POST_SNAPBACK}][/a]


If someone gave me an iPod, I'd probably be compelled to working on FAAC since it's in my interest to have a free VBR AAC encoder.  Just kidding.  I wouldn't have much of a clue anyway.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-16 23:22:53
indybrett, or someone having some experience of faac > what setting could give me an approximate bitrate for ~175 kbps? I've only a little experience of faac, and according to this, the quality scale (-q) doesn't apparently correspond to a target bitrate (i.e. -q 100 doesn't seem to output 100 kbps, at least with some material - cf. Roberto's 128 AAC test: setting was -q 115, and not -q 128).

I'm interested to give to faac a chance (at least in a preliminary test), but I'm not really motivated to find the ideal setting for that. I also don't want to test something and be flamed for using false or wrong settings. Help would be appreciated
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-17 02:41:21
-q 135 should be pretty close to 175kbs on rock music.

I don't even know if FAAC can be transparent. At the very least, it would be nice to know how far it has to go to compete with Vorbis and MPC.

Edit: might need to go a bit higher. Maybe -q 140. Depends on the material.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: LoFiYo on 2004-07-17 04:29:18
-q135 should sound OK on non-killer rock/pop samples to most people, but to Guruboolez, IMO it is probably not up to par.

<$0.02>My ears are not especially sensitive or trained at all, but on a personal sample taken from the soundtrack CD of E.T., I had to go up as high as -q210 until the quality reached un-ABXable transparency. </$0.02>

edit: I used the first stable release of version 1.24 from Rarewares (file date = April 25, 2004 - 11:24am).
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-17 12:37:39
Thanks for report. I've very quickly tried to see what bitrate correspond to -q135 with faac 1.24. It's apparently close to 150 kbps on average with classical (~150 kbps on common orchestral/lyrical; ~140 with less complex music as piano; ~160 with four solo harpsichord tracks I've encoded so far).

It's the same problem with vorbis megamix II. To avoid some contestation, I've choose three different vorbis settings. Adding two more encodings (faac/-q classical & faac/-q rock) would make my test more difficult, especially for building a correct hierarchisation for each sample (it's a long task with 5 contenders, it will be much longer with 7).
I'll check the overall quality of faac in preliminary tests (the next week). If the encoder is competitive, I'll see what I could do. If the codec isn't competitive enough, I'll probably wait for another AAC encoder, and why not, test faac later in a similar test opposing different AAC encoders at the same bitrate.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2004-07-17 16:45:31
I did another test with a more common CD.

Pink Floyd, The Dark Side of the Moon.

FAAC 1.24
-q 150

Average Bitrate = 175kbs.

Lowest Bitrate = 163kbs
Highest Bitrate = 194kbs
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: harashin on 2004-07-22 11:59:42
Hello. Although this test had been done last week, I couldn't publish this until today for some reasons. This time I did ABX only since doing ABC/HR on non-killer samples at high bitrate is too hard to me.

This test features three encoders with following settings right now I'm interested in.

LAME 3.90.3 --alt-preset standard (avg. 203kbps, setting for my portable player)
Musepack 1.14 --standard --xlevel (avg. 188kbps, setting for my archiving)
Ogg Vorbis 1.1RC1 -q 6.00 (avg. 191kbps)

(http://cyberquebec.ca/harashin/highbitrate.png)

These samples IMO not kinds of killer samples such as castanets nor badvilbel, were cropped from tracks I usually listen to. The samples are found here (http://harashin.host.sk/samples/samples.html)

123RedLight;
Vocal problems. Vorbis behaved well.

AngelWalk;
Vocal problems. All samples weren't hard to ABX.

BWV1005_vn;
I heard some kind of distortion. Not easy to ABX for all encoders.

BWV565_org;
Noise or distortion.

BWV847_cemb;
Pre-echo or something. Musepack was harder to ABX.

ElBimbo;
Pre-echo for clavichord(?) sound. Vorbis was harder to ABX.

GrosseFuge;
Distortion. Musepack was good.

LadyMacbeth;
Pre-echo for percussions and distortion for trumpets.

Liebestod;
I found distortion during quiet part.(last few seconds)

Marteau;
I heard pre-echo and some problem for mezzo soprano.

nigai_namida;
Vocal problems. LAME was harder to ABX.

Edit: Results are included with each sample.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-22 12:15:52
Thanks a lot for results and samples. I can't download all of them yet; I've just finished marteau.flac (Boulez I suppose;)). Apparently, the file is corrupted: "error while decoding metadata".
Could someone confirm?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: harashin on 2004-07-22 12:24:28
Quote
I've just finished marteau.flac (Boulez I suppose;)).

Yes, of course. 
Quote
Apparently, the file is corrupted: "error while decoding metadata".
Could someone confirm?
[a href="index.php?act=findpost&pid=228063"][{POST_SNAPBACK}][/a]

Oh, excuse me. I've not been yet familiar with my new hosting space. I'll upload them zipped.

Edit: Updated, played correctly here.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-23 23:42:35
Thank you for your work. However, considering the number of trials varying, it seems that you performed, as Guruboolez, sequencial ABX testing. It turns all the p-values useless. What was the maximum number of trials that you fixed before giving up ?

I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 00:39:51
Quote
I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.
[a href="index.php?act=findpost&pid=228445"][{POST_SNAPBACK}][/a]

It's easy to say, but harder to do.
By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion), there's a BIG risk: finish the test with unsignificative results.
With eight trials, situation is simple:you can't miss two of them; with 16 trials, errors are less crucial... but with 6 contenders, there are 96 trials to perform for one sample, 960 for ten, and the listening fatigue is very hard or maybe impossible to avoid. Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. As a tester, I prefer finish a test with 24/30 than with a poor 10/16 due to bad concentration or something else. And in order to avoid fatigue (and therefore errors), I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 01:53:50
This way of doing things completely screws the results. I recall that the corrected p-values given there (http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&st=0&p=151958&#entry151958), though obtained with simulations, are exact.
When you are ready to go for 100 trials, and get p=0.05 in the ABX program, your test has failed, because your real p value is not 0.05, it is 0.2 !! p values displayed in ABX programs are only valid for tests run either without looking at the results before the test is over, either fixing the number of trials, and for tests run for the first time ! If you undergo the test for the second time, or if you look at the results before the end and the number of trials is not fixed, then the p values given are plain wrong !
It was discussed in the thread linked above, and in the other thread linked again from there.

Quote
It's easy to say, but harder to do.[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=228449")


Exactly ! Getting a real p=0.05 is much harder than getting p=0.2, because there is less room for errors. That's why only a real p of 0.05 is considered as valid.  Otherwise, it is too easy.

Quote
By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion), [a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]


I use 16 for easy tests, and 8 for hard ones.

Quote
there's a BIG risk: finish the test with unsignificative results. [a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]


The signifiance is given by the real p value, and nothing else. If you finish the test with p above 0.05, it just means that there are more than 1 out of 20 changes that you were guessing.
The risk increases because the test is more significant, that's all.

Remember :

Low p value versus high p value
Meaningful result versus not meaningful result
Low probability of guessing versus high probability of guessing
Hard test versus easy test
High risk of failure versus low risk of failure.

All these sentences are mathematically equivalent. If you want to make the test easier, then you want to make it less meaningful.

Quote
Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.


You can't avoid this, the test must last longer, in order to allow the hearing system to rest.

Quote
By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.
[a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]


This is cheating. The p value suffers from random variations. What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.
Remeber when Gabriel got p = 0.003 without listening to anything ? [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&view=findpost&p=151932]http://www.hydrogenaudio.org/forums/index....ndpost&p=151932[/url]
If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally, unless you use corrected p value table linked above.

Your results in this test are entierly based of the ABC/HR ratings you gave, granted that you didn't know which codec was which.
The analysis of your ABX values led to no conclusion. I could only show that you got more successes than expected randomly, but since I don't know the standard deviation of the probability of getting p < 0.05 in a 50 trials sequencial test (I just know it value), I can't tell if you got significantly more positive results than expected, or randomly more positive results than expected !
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 02:40:10
Quote
Quote
By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=228449")


This is cheating.

No, it's not cheating.
- Listening tests in my appartment are far from ideal listening conditions. There's a lot of noise: computer, fridge, phone, street, road, neighbour. It's easy to miss one trial due to the lack of concentration following a sudden and disturbing noise; then it's easy to miss the ten following trials due to anger.
- Sometimes, first trials are bad for other reasons: focusing on a wrong problem for exemple [that's why the training module of abc/hr 1.1 is precious].
- Don't also forget that all encoded files are not tested (ABX module) in the same conditions: I'm more familiar with the original when I'm testing the sixth and last encoding. For that reason, it's easier to fail on ABXing the first file than on the third. My ears are also more saturated on the sixth file than on the third. I could feel necessary to perform again the ABX session for the first file, failed not because it was especially difficult but because I was not very familiar with the reference. The score will report the failure, but not the reason of the failure. There are many reason that could explain a failure, and rather than redoing again the whole test, it's preferable for practical reason to "push" the number of trials. It's easy to understand I suppose. Doesn't sound as "cheating" for me... Stopping at 10/16 when I know that I could obtain a much better score is not far from cheating too...

An ABX score is not only a score. There's an history behind. It's like the final score of a soccer match: it doesn't tell anything of the quality of the winner/looser. A team could dominate a full match, and finish as looser. The score would conclude on the winner's superiority, but the full match would show the contrary. Same applies to a listening test: bad results could have another reason than difficulties.

If I decide to stop a test after 16 trials, whatever the history of this test, and to publish the result, people would say: "look, guruboolez rated 1L = 3.2, but he is probably guessing according to the ABX score". At least, it's the conclusion of the statistician...
If I decide to follow the test, it's not to prove to myself that the difference is really audible, but to publish a decent score. The ABX log files doesn't tell anything on the testing conditions. It doesn't help to understand why score is low ; it doesn't show that the tester ABXed sucessfully the 32 last trials but missed the 10 first one, but just reveal an enigmatic, unusual, "randomly" stopped 32/42. In my opinion, 32/42 is a much better score than 14/16, if 32 last trials were correct, and if the 10 first one were bad because I've focused my attention on a wrong problem.


Quote
What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.

It's a wrong interpretation.
I sometimes miss a test for one file, and stop it at 9/16 (it's an exemple). After I have finish with success other files, it happens that I resume this bad test. I can't reset the score, and my new attempt begins with 9/16 and not 0/0. If I decide to add 20 more trials, the final trials number will be x/36, which is unusual. You could conclude on a random stop while it's an intended one.
When this kind of situation happens (and it happens very frequently), I generally add few words on comment. But sometime I forgot to do it, or I could be too bored to do it.
I often stop a test when I succeed in 16 consecutive trials, whatever the final score looks like. The ABX log won't show that.
Again, reader could be fooled by pure numbers.

Quote
Remeber when Gabriel got p = 0.003 without listening to anything ? [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&view=findpost&p=151932]http://www.hydrogenaudio.org/forums/index....ndpost&p=151932[/url]

Yes, I remember. But he performed the test in one session and stopped randomly IIRC. As I said it before, multiple scenario are possible for a same score. Gabriel had maybe stopped randomly after 26 trials, but for someone else, 26 trials could mean (10+16, with a new test inside the global one and 16 fixed trials).

Quote
If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally
Again, it's easy to fix principles... But there's a human tester behind score or pval. I could go with 32 trials in order to minimize the impact of ABX errors, but for one or two contenders, and certainly not with 6, at least not at this bitrate.

Quote
The analysis of your ABX values led to no conclusion

[a href="index.php?act=findpost&pid=228474"][{POST_SNAPBACK}][/a]

I never rated the reference. I've found all 60 encoded files, and rated them very carefully (rating was the most important task of the test, and hierarchy its purpose). I think that's meaningful enough. I don't see what kind of additionnal conclusion you're trying to build with undescribed ABX scores.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 14:03:07
I understand what you are saying. But imagine that you are the one reading the test results. When you see a 20/36 result, how do you know if it was a one shot test (failed, sequencial or not), or a 4/20 failure followed by a 16/16 success ?
According to rule 8, the results must prove to the reader that the difference was audible. Maybe, thanks to the internal way the test was done, it is the case, but there is no published result that proves it.

Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?
First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.
So we are left trying to compute the p value of the result, that is the probability of getting p <= 1/65536 in a serial of ABX tests beginning with 20 and 16 trials, and with an unknown sequel if the second test had failed (maybe the guy would have tried 12, then 8 and claimed a success, we don't know). The real result of this test takes at least one hour to compute for the people in this board who have enough math knowledge to sort it out, and it is inaccessible for most members, who didn't study probabilities and statistics.
For example, I can't tell if this 4/20-then-16/16 result has any significance. I don't know if its p value is above or below 0.01. I think that it is probably below 0.05, but I can't prove it in 5 minutes.

A binomial table giving the p value for fixed ABX tests have been made, it is linked in the FAQ about ABX, and the results are the ones given in all ABX software. It allows any people to perform tests and publish the results. By not following the standard methodology, one makes his results unreadable for most of the community, and give much analysis work to the math people of the forum.
We have a tool allowing anyone to analyse ABX test results, use it !

In your ABX logs, we can see that you performed a total of 1088 sessions. If you had fixed the number of trials to 8 for each sample and codec, you would only have performed 10 x 6 x 8 = 480 sessions, all codecs would have been tested, all results would have been understandable, and, most important, the victory of MPC and Vorbis would not have been changed ! They won with a confidence of 95 % even if all ABX tests have failed. The rankings show it.

In conclusion, we can't deduce anything from Harashin's result right now. I just hope that the information that he will give us about his methodology will help to find some significance in the results, and that he has not done all this in vain.
About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer (http://ff123.net/friedman/stats.html), to provide some useful information.

Thanks to this discussion, I think we should be able to write some instructions for ABX testing and include them in the forum rules, as well, maybe, as a tutorial for analyzing ABC/HR results.
By the way, shouldn't the Anova analyzer be included in ABCHR software ? ABX software gives the significance of the result, why shouldn't ABC/HR do the same ?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 14:37:10
Quote
About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer, to provide some useful information.

I understand. I've tried to do my best to publish "valid" results, in order to avoid possible criticism like "mhhhh, he rated some files, but it doesn't proves us that he could really hear a difference". But in my opinion, even if ABX tests could be interpreted as sequential due to the disparity of the total number of trial, even if the pval drop from 0.05 to 0.2 because I didn't respect the number of trials I've preliminary fixed, the ABX scores I have obtained are certainly better than nothing. If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".
Quote
Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?
First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.


If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.
All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it. Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong: multiple ABX sessions are not always a good thing. Difference are sometimes very subtle, and couldn't resist to an intensive test like ABX. That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 15:03:39
Quote
If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Quote
If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.
All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


I don't think so. Audipophiles are not evil. When they say they can hear a difference, they don't lie in order to fool us, they really believe they do.
The widespread existence of strong placebo effect has lead us not to listen to opinions, but facts. Opinions about sound quality are honest. Often wrong, but 99.9 % of the times honest. So are log files. We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

Quote
Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

Quote
That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

Quote
I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.
[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated    !
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 15:38:14
Quote
Quote
If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=228578")


Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing? I wonder... I can't obtain this kind of results by ABXing MPC Q10 in ideal conditions, but I can obtain it in bad condition with mpc Q5. It's definitevely not luck.

Quote
We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

Problem is that SCORE are not simple FACTS.
What are you trying to prove when
- REF vs CODEC_A = 14/16
- REF vs CODEC_B = 18/22
- REF vs CODEC_C = 8/8
and when rating are:
- CODEC_A = 3.5/5
- CODEC_B = 2.0/5
- CODEC_C = 4.5/5
What are you're conclusions here ? I'm interested.

Quote
Quote
Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.

Quote
Quote
That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.

Quote
Quote
I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.
[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]


Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.


I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc... ff123 already [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=19190&view=findpost&p=189263]pointed out those limits[/url]). All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea). The "recommanded encoder and setting" threads could simply be erased, except maybe for the old-tested 3.90.3. GT3b2/aoTuV/Megamix... recommandations are all based on invalid tests. Enforce rule#8 conditions, and the only "valid" tests you'll see will be for 32 or 64 kbps. HA will be a place for low quality audio encoding reliable knowledge, and a vast desert of uncertainty because no tester on this board will have courage enough for risking the publication of a listening test following the "rules".
Tests and knowledge evolution were possible in this board because absolutely strict conditions were never requested. Limits on rigour were always accepted, for practical reasons, and even with this cool attitude, few testers are posting results. I don't know what you or someone else on this board will expect for more exactness. Chaos? Assumptions only?

Quote
Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated

Don't be sarcastic or insincere. I doubt that something considered as invalid and "soonly" illegal could be really appreciated.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-24 17:02:36
Quote
By the way, shouldn't the Anova analyzer be included in ABCHR software ? ABX software gives the significance of the result, why shouldn't ABC/HR do the same ?
[a href="index.php?act=findpost&pid=228569"][{POST_SNAPBACK}][/a]


The Anova analyzer is meant to look at the results of either multiple listeners or multiple samples.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 17:29:52
Guruboolez, I've got no time to answer your last post right now, but since you ended with a negative feeling, I'd like all the same to clarify one point now : I think that you don't understand the meaning of the Anova analysis performed by Roberto in his tests, and that I performed on yours.

For a confidence level chosen, it gives the result of the test. For example, in this test, it shows that you found MPC superior to Vorbis, and Vorbis superior to the rest with p<0.05. It means that there is less than one chance out of 20 that you rated them higher accidentally. Which make your results (as well as Roberto's tests ones) perfectly valid. That's why I was thanking you. I'm not used to thank poeple in a sarcastic way, not to post ambiguous messages.

I just wanted to point out that we are making a fuss on ABX methodology, and that it has nothing to do with your test results, because people who can't bother to read all that we post would otherwise think that I discuss your conclusions while I'm just discussing the possible analysis of your ABX results, that nearly nobody read anyway, since they are hidden as an addition to your log files.

Ratings are meaningfull without the need of ABX tests, because there is no way (or to be precise, a way inferior to the p value) that one codec comes first every time if you don't know which one it which, since the ABCHR software hides them. Anova is a way of computing the p value for this event.

So to make it short, we have MPC>Vorbis>other codecs for p < 0.05 in your test (I didn't compute the results for other p values).
The ABX results reported in your logs don't provide much more information, or it is hidden to me, nor do Harashin's ones.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 17:31:52
I said precisely

Quote
Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated    !
[a href="index.php?act=findpost&pid=228587"][{POST_SNAPBACK}][/a]


ABX results are one thing, results are not meaningful, so claiming they are positive is a rule 8 violation.
ABC/HR ratings are another thing. Results are meaningful, work is appreciated.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 17:50:12
Quote
ABX results are one thing, results are not meaningful, so claiming they are positive is a rule 8 violation.
ABC/HR ratings are another thing. Results are meaningful, work is appreciated.
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=228633")

ABC/HR rating without ABX confirmations are few things... It's a blind test OK, but not a double blind one. Such tests won't be really and genuinely accepted. Look at LAME (3.90.3 vs new realese) testing phase for exemple:

Quote
4. Your test results have to include the following:

    * ABX results for
      3.90.3 vs. Original
      3.96 vs. Original
      3.96 vs. 3.90.3
    * ABC/HR results are appreciated especially at lower bitrates, but shouldn't be considered a requirement.
    * (Short) descriptions of the artifacts/differences

[a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=20715]http://www.hydrogenaudio.org/forums/index....showtopic=20715[/url]

Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-24 17:57:05
When comparing different codecs in abchr.exe, the purpose of the abx module is really just to help clarify in the listener's mind how he thinks things should be rated in the abc/hr module.  Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs.  I agree with that.

ABX:  purpose is to determine if an individual can reliably detect a difference between 2 files using multiple trials.

ABC/HR:  purpose is to determine preference between 2 or more codecs, but not necessarily reliably!  Multiple listeners or multiple samples increase reliability for ABC/HR in the same way the multiple trials increase reliability for ABX.  Generally, it is more important that multiple samples be tested than multiple people.

The helper role of the abx module in abchr.exe version 1.1 (I need to spend a little time to clean up the last few minor bugs) is further emphasized since it unhides the hidden reference after a successful abx run.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 18:18:12
Quote
(...) Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs.  I agree with that.

I also agree. That's why I spent much more times and attention in the rating phase. By testing many encoders, I'm only interested about the hierarchy (the best, the second best, etc...). ABX scores can't reveal anything about quality, even about difficulty. I also agree that harashin's results don't give me any information about relative quality of three different format; I just know that there are serious chance that he heard difference between encoding files and the reference.
In my opinion, ABX phase is useful for three things:

• helping me to refine the notation (I often lower or higher some notation after ABX tests).

• giving to myself insurance that I wasn't dreaming about possible artifacts when I've rate different encoders (i.e. avoid placebo). Useful when during the ABC/HR phase, I've ranked two or more files with a slight difference: if I can't ABX this difference, I often change the note and give the same to both files [or sometimes, even if I failed t oABX the difference, I just maintain a slight difference of 0.1 point for the file I still suspect to sound better]

• giving to others the feeling (or in best case the proof) that the difference were really audible. I'm sorry to repeat it again, but I consider something like 45/60 better than nothing. At least when I ended the test by a nice consecutive series of correct trials.


P.S. What is the meaning of "HR" in "ABC/HR" name?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-24 18:23:35
Quote
P.S. What is the meaning of "HR" in "ABC/HR" name?


Hidden Reference

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 20:49:43
Quote from: guruboolez,Jul 24 2004, 03:38 PM
Quote from: Pio2001,Jul 24 2004, 03:03 PM
Quote from: guruboolez,Jul 24 2004, 02:37 PM
"well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".[{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=228578")
Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing?


In the case of a sequencial ABX test, pval can't be 0.009 for 39 out of 59, since it is the pval for a fixed ABX test. Saying pval = 0.009 is misinformation. The max number of trials must be known and the [a href="http://stud4.tuwien.ac.at/~e0025119/CorrPVal5.xls]corrected p-val table[/url] must be extended to this number to get the right value.

Quote from: guruboolez,Jul 24 2004, 03:38 PM
What are you trying to prove when
- REF vs CODEC_A = 14/16
- REF vs CODEC_B = 18/22
- REF vs CODEC_C = 8/8
and when rating are:
- CODEC_A = 3.5/5
- CODEC_B = 2.0/5
- CODEC_C = 4.5/5
What are you're conclusions here ? I'm interested.


My conclusions are that codec B must be a bit underrated, since an "annoying difference" couldn't be distinguished from the original 4 times (unless the tester states that he hit the wrong button). I don't know how to interpret the ABX scores, since I don't know if they were run in a sequencial of a fixed way. From a fixed point of view, however, the confidence level is high.


Quote from: guruboolez,Jul 24 2004, 03:38 PM
No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.


Exactly. I'm not going to spend a whole week-end trying to analyse partly sequencial ABX results with additional conditions, with pages of calculus, while we have a binomial table that gives us the result at once if the number of trials is fixed in advance, especially after most people on this board have hammered (but I'm not sure if I repeated it in the ABX tutorial) the necessity of fixing the number of trials before the test begins OR not looking at them during the test for the results to be valid.

Quote from: guruboolez,Jul 24 2004, 03:38 PM
Quote
Quote from: guruboolez,Jul 24 2004, 02:37 PM
That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. [{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=228578")

I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.


So what ? Long term or short term doesn't change the methodology... Either the number of trials is fixed, either you don't look at the results until the test is finished, either you fix a maximum number of trials and use the [a href="http://stud4.tuwien.ac.at/~e0025119/CorrPVal5.xls]corrected p-val table[/url]. The three methods are valid for short or long term tests.

Quote from: guruboolez,Jul 24 2004, 03:38 PM
I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc...


Roberto's results are perfectly valid :
-Tests were double blind
-Pval is strictly inferior to 0.05 (<0.01 is a good thing, <0.05 is requested)

Quote from: guruboolez,Jul 24 2004, 03:38 PM
ff123 already pointed out those limits (http://www.hydrogenaudio.org/forums/index.php?showtopic=19190&view=findpost&p=189263)).


The limits pointed out by FF123 have nothing to do with the results of the test, but about the scope of the test. In the same way, your test is valid in itself, because you get a success with p < 0.05, but the scope is very narrow, because you were the only one listening, and it is not sure that someone else would get the same (valid) results. It's like saying "this man is taller than this woman". The test consists in measuring them. The results are :
Man 181 cm
Woman 176 cm.
The right conclusion is "this man is taller than this woman". The results is valid, proven by a repeatable experiment on the same couple of persons. But the scope is very narrow, we can't conclude that every man is taller than any woman.

Quote from: guruboolez,Jul 24 2004, 03:38 PM
All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea).


You just published it (implicitly) at the top of this thread. Your results is valid, V.A.L.I.D. Can't you read the Anova log I posted and its conclusion ?
Here's the first link I found about Anova searching the web : http://www.psychstat.smsu.edu/introbook/sbk27.htm (http://www.psychstat.smsu.edu/introbook/sbk27.htm)
The column it talks about refers to another software, and the value discussed is the P-Value.

Quote
If the number (or numbers) found in this column is (are) less than the critical value ( ) set by the experimenter, then the effect is said to be significant. Since this value is usually set at .05, any value less than this will result in significant effects, while any value greater than this value will result in nonsignificant effects.
If the effects are found to be significant using the above procedure, it implies that the means differ more than would be expected by chance alone. In terms of the above experiment, it would mean that the treatments were not equally effective. This table does not tell the researcher anything about what the effects were, just that there most likely were real effects.
If the effects are found to be nonsignificant, then the differences between the means are not great enough to allow the researcher to say that they are different. In that case, no further interpretation is attempted.


Quote from: guruboolez,Jul 24 2004, 05:50 PM
ABC/HR rating without ABX confirmations are few things... It's a blind test OK, but not a double blind one. [{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=228636")


First, yes it is. Your computer is hiding the names of the samples, and you have no other way of finding the reference than your ears. Therefore the test IS double-blind.
A simple blind test would be a listening test between a pressed CD and an original one, for example, with someone putting the CD in the drive to make you listen to it. Listening to what he does with the Cd that he takes from the drive, you might tell if he is replacing the same one in it, or putting it aside and inserting the other. This is a simple blind test. For it to become double blind, you'd have to use 10 identical CD Players, with a CD hidden into it. You're left alone in the room, and must tell which drives have an original and which ones have a copy in it. This is a double blind test. Because you can't be influenced by the operator. Fortunately, computers allow us to hide and play samples without any mean for us to guess which one is played.

Quote from: guruboolez,Jul 24 2004, 05:50 PM
Such tests won't be really and genuinely accepted. Look at LAME (3.90.3 vs new realese) testing phase for exemple:

Quote
4. Your test results have to include the following:

    * ABX results for
      3.90.3 vs. Original
      3.96 vs. Original
      3.96 vs. 3.90.3
    * ABC/HR results are appreciated especially at lower bitrates, but shouldn't be considered a requirement.
    * (Short) descriptions of the artifacts/differences

[a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=20715]http://www.hydrogenaudio.org/forums/index....showtopic=20715[/url]

Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...
[a href="index.php?act=findpost&pid=228636"][{POST_SNAPBACK}][/a]


This is because until now, Roberto was the only one to use Anova anlysis on ABC/HR tests. Remember your last test. You posted some rankings, and they were discussed. I was on the verge of brandishing rule 8, but I rather asked if someone could compute the result and post the graph with bar errors. No one did.
This time you tested Lame vs Vorbis vs MPC at high bitrate. Since I found this test very important, and I saw that no one was capable of computing the results the last time, I read Roberto's pages more carefully, and found FF123's Anova analyzer.

When, rating MPC superior to Mp3 9 times out of 10, you get p <0.05 in ABC/HR Anova analysis, it is mathematically equivalent to succeed in a fixed ABX test with p < 0.05.

The ABC/HR results tell this, not the ABX ones. They show, among other things, that you can consistently hear the difference between MPC and MP3 with the settings you chose, on the samples you chose.
It has not been much pointed out outside Roberto's tests, but ABC/HR can be a substitute for ABX. I think that it's time to explain this in a tutorial. Your test proves the great usefulness of this method of testing, even for one people with several samples, instead of several people and several samples.
It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-24 21:43:16
Quote
It should even work with one people and one sample, but with multiple ABC/HR sessions. I think it should be considered in future ABC/HR software.
[a href="index.php?act=findpost&pid=228680"][{POST_SNAPBACK}][/a]


You're talking about mutliple trials of rating a codec in the abc/hr module.  For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2.  At the end of N trials, one could average the ratings.  On the face of it, it would seem the more codecs there are, and the less difference between them, the more benefit one could get from a procedure like this.  Imagine testing just two, but very different quality codecs.  Then it doesn't make much sense to repeat the ratings:  they will be rated exactly the same every time.

So I tend to think that rating more music clips is probably better than trying to get the variability out of the ratings for a single music clip.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-24 22:04:52
Quote
Quote
What are you're conclusions here ? I'm interested.


My conclusions are that codec B must be a bit underrated, since an "annoying difference" couldn't be distinguished from the original 4 times (unless the tester states that he hit the wrong button).

The problem is, that it could be. Some reasons are easy to explain.

Imagine that you're testing many formats in the same test. The first step is to rate each file. The first one (1L) is excellent, very hard to distinguish (4.5/5). You're not even sure that the difference really exist. The second file suffers by comparison: coarseness is clearly audible (2/5). Second step now: ABX. The first file is hard to ABX, a lot of concentration is needed. I could distinguish a slight amout of pre-echo on a precise range, that's all. 14/16 [16 as fixed value]: not bad. Second file, should be much easier to ABX. But the six first trials are bad (2/6). Why? Because all my attention is focused on pre-echo I can't hear, simply because the file doesn't suffer from this problem. By changing a bit the selected range, focusing my attention on another problem, I'll find again the annoyance I've immediately detect the first time and perform a very nice 16/16 in 2 minutes. Final score is 18/22.
You're conclusion is still the same : "codec B must be a bit underrated"?

There's a serious problem with test including more than one encoded file: conditions are not eaqual for all. By changing the order, you could change the results of ABX score. Beginning by an easy test could help you to warm up your ears, give you trust, but an easy 'victory' could also handicapped you by giving excessive confidence, etc... You could be tired after two files if you're beginning by the two most difficult, etc...
Of course, the solution would be to rest your audition as often as you want, to take care about your concentration... being like a sportsman during a competition. Problem is that some people (including me) can't always spent three or four hours just to achieve one single test including 6 contenders.

Quote
Exactly. I'm not going to spend a whole week-end trying to analyse partly sequencial ABX results with additional conditions (...) especially after most people on this board have hammered (but I'm not sure if I repeated it in the ABX tutorial) the necessity of fixing the number of trials before the test begins OR not looking at them during the test for the results to be valid.

Nobody forces you to analyse these ABX results.
What kind of conclusions could you build by computing ABX scores (I'm serious, I still don't understand)? What could you conclude when you see that one file was ABXed at 10/16 and the other one at 15/16? That the second one have stronger flaws? That's a wrong conclusion. The tester is not a robot, is not living in a studio and is not a champion. He can't necessary maintain the same level of concentration during a whole test; he can't necessary maintain his ears at the same level of freshness ; he logically don't have the same familiarity with the reference during the first ABX session than during the sixth and last one...
By fixing a strict number of trials, you're solving problems if and only if the tester had maintained the same listening abilities (generic term including freshness, concentration, motivation, patience, silence in the room) during the whole test.
If the tester admits that his listening conditions have changed during one test, there's no need to spend one week-end or simply one minute to compute some additional datas based on ABX scores, which represent nothing (at least, they're not only reflecting the level of difficulties of the samples, but could also reflect the variations of the listening conditions themselves).

Quote
Roberto's results are perfectly valid :
-Tests were double blind
-Pval is strictly inferior to 0.05 (<0.01 is a good thing, <0.05 is requested)


And what about number of listeners? What about samples? Many people, including JohnV, ff123 and others have precised that different samples might seriously change the results. Roberto's test are probably valid (he can't use 100 samples and force 200 members of HA to participate to this test), but conclusions builded upon the final results are often... questionnable. Faac tied with Nero AAC, or WMA@128 close to have "perceptible but not annoying" difference.
Quote
Your results is valid, V.A.L.I.D. Can't you read the Anova log I posted and its conclusion ?

OK, I was a bit angry. Sorry


Quote
(...) Therefore the test IS double-blind. (...) A simple blind test would be (...)

Thank you for the explanation. I thought that double blind test was a single blind test repeated twice.

Quote
When, rating MPC superior to Mp3 9 times out of 10, you get p <0.05 in ABC/HR Anova analysis, it is mathematically equivalent to succeed in a fixed ABX test with p < 0.05.


But it's only true at some conditions, isn't it? The level of degradation (artifact) could also play a role I suppose.
Quote
It has not been much pointed out outside Roberto's tests, but ABC/HR can be a substitute for ABX. I think that it's time to explain this in a tutorial.

I'm learning different things (though it's sometimes confusing). A tutorial should be necessary.

If I have further questions, I'll probably ask them in french (private message): comprehension should be easier for me.

Anyway, thanks for the long explanations  And sorry again for the irritating tone of my previous posts.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-24 23:56:01
Quote
You're talking about mutliple trials of rating a codec in the abc/hr module.  For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2.


Yes, exactly.

Quote
Imagine testing just two, but very different quality codecs.  Then it doesn't make much sense to repeat the ratings:  they will be rated exactly the same every time.[a href="index.php?act=findpost&pid=228686"][{POST_SNAPBACK}][/a]


But in this case, it would provide both the ratings and the ABX results at once, with nearly no more work than for two ABX tests. We just have to find the reference in addition.
Recognizing them 8 times out of 8, without ranking reference would replace ABXing the first against the reference, ABXing the second against the reference, but, the ranking being consistent, it would also replace ABXing them between each other without us having to do it !

I tried this data in your analyzer :

Code: [Select]
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00



I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?).
The results were

Reference is better than Codec2, Codec1
Codec2 is better than Codec1


All this for p < 0.01

Thus the analyzer recognized that having no ranked reference for codec 1 8 times out of 8 meant that Reference is better than Codec 1 with p < 0.01.
It recognized that Reference is better than codec 2 with p < 0.01, so far we have the same information that with two ABX tests.
And it also says that Codec 2 is better than codec 1 with p < 0.01. This is right since the listener obviously distinguished the codecs (rating codec 1 3.00 and codec 2 4.00) 8 times out of 8 without mistake.

By the way, your analyzer is bugged : it doesn't work if the first rating  for codec 1 is 3.00. I had to set 3.01 instead.


I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference)

Code: [Select]
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      4.00   3.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00


The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange.
However, the Friedman / non parametric analysis detects the problem and says that only the reference was recognized as superior to the codecs with p < 0.01.


Hey ! What's the problem with the Anova analysis ??

Code: [Select]
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      3.00   4.00


It says from the above that Reference is better than Codec2, Codec1, and that Codec2 is better than Codec1, all with p < 0.001 ! It is plain wrong ! The above results can happen by chance !


The Friedmann analysis seems to work well (it says that the above data is not significant).
So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure :

Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 10
Critical significance:  0.05
Significance of data: 1.44E-05 (highly significant)
Fisher's protected LSD for rank sums:  16.398

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
55.00    49.00    31.50    30.50    26.50    17.50  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.473    0.005*   0.003*   0.001*   0.000*  
MGX-q6            0.036*   0.027*   0.007*   0.000*  
MP3-V2                     0.905    0.550    0.094    
MGX-q5.9                            0.633    0.120    
MGX-q5.5                                     0.282    
-----------------------------------------------------------------------

MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3


Fortunately, it still says that MPC and Megamix q6 are the winners. However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance !
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-07-25 01:43:22
Quote
So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure ...)
However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance !
[a href="index.php?act=findpost&pid=228705"][{POST_SNAPBACK}][/a]


I have simulated the addition of more results (i.e samples). I've just reproduced the scores obtained for the 10 first samples.
With 70 results (=10 x 7), the Friedman conclusion:
Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 70
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
Fisher's protected LSD for rank sums:  43.386

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
385.00   343.00   220.50   213.50   185.50   122.50  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.058    0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MP3-V2                     0.752    0.114    0.000*  
MGX-q5.9                            0.206    0.000*  
MGX-q5.5                                     0.004*  
-----------------------------------------------------------------------

MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MP3-V2 is better than MP3-V3
MGX-q5.99 is better than MP3-V3
MGX-q5.5 is better than MP3-V3


With 7 times the same bunch of results, MPC can't still be said better than Vorbis -Q 6 with confidence. Even if 56 samples were superior with MPC and only 14 superior with Vorbis... Weird.

It's only with 8 times the same results that significance is reached:
Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 80
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
Fisher's protected LSD for rank sums:  46.381

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
440.00   392.00   252.00   244.00   212.00   140.00  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.043*   0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MP3-V2                     0.735    0.091    0.000*  
MGX-q5.9                            0.176    0.000*  
MGX-q5.5                                     0.002*  
-----------------------------------------------------------------------

MPC-q5 is better than MGX-q6, MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MP3-V2 is better than MP3-V3
MGX-q5.99 is better than MP3-V3
MGX-q5.5 is better than MP3-V3


Now, if I suppose that the following scores I've initially planned to add to the first bunch of 10 results will not really differ from the 10 first, I need to find and test about 70 additional samples to claim that MPC is superior to vorbis "megamix" -q 6,00 without risking the banishment. Forget guruboolez's test: I've other things to do in my life


With ANOVA analysis, the situation is less pathetic:
Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 20
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              119          90.75
Testers (blocks)    19           7.35
Codecs eval'd        5          52.03   10.41   31.50  0.00E+00
Error               95          31.38    0.33
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.361

Means:

MPC-q5   MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
 3.82     3.15     2.34     2.30     2.23     1.88  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
MPC-q5   0.000*   0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MGX-q5.9                   0.826    0.546    0.013*  
MP3-V2                              0.701    0.023*  
MGX-q5.5                                     0.057    
-----------------------------------------------------------------------

MPC-q5 is better than MGX-q6, MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3
MGX-q6 is better than MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3
MGX-q5.99 is better than MP3-V3
MP3-V2 is better than MP3-V3


If the next 10 samples I'll test have the same notation as the 10 first one, then I could conclude about mpc superiority.


May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers...
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-25 06:52:51
Quote
I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?).


Non-parametric means that you're giving each codec a ranking (i.e., first, second, third, etc.) instead of a rating on a scale from 1.0 to 5.0.  Ranking can be more robust than rating, but also less sensitive.

Quote
By the way, your analyzer is bugged : it doesn't work if the first rating  for codec 1 is 3.00. I had to set 3.01 instead.


You're running into a divide by 0 problem.  If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem.  It's not a bug in the program -- that's the way the calculations work.  If you use real data, you should never see this kind of behavior.

Quote
I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference)

Code: [Select]
<!--QuoteEBegin-->Reference Codec1 Codec2<!--QuoteEBegin-->5.00      3.01   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00 <!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->


The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange.


Also not a bug.  Set another row to be like row 2 and you'll see the p-value start to creep up.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-25 07:05:00
Quote
May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers...
[a href="index.php?act=findpost&pid=228716"][{POST_SNAPBACK}][/a]


The Friedman non-parametric analysis makes fewer assumptions about the data, and is therefore more robust, but can also be less powerful than ANOVA.  If one wanted to be ultra-conservative, he would do a non-parametric Tukey's analysis, which corrects for the fact that there are multiple codecs being ranked.  But for abc/hr, there's little reason to use friedman.  I should probably change the default.

ff123

Edit:  I should also probably add the Tukey's analyses back in to the web page.  They're in the actual command line program.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-25 16:45:28
Quote
Quote
By the way, your analyzer is bugged : it doesn't work if the first rating  for codec 1 is 3.00. I had to set 3.01 instead.


You're running into a divide by 0 problem.  If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem.  It's not a bug in the program -- that's the way the calculations work.  If you use real data, you should never see this kind of behavior.[a href="index.php?act=findpost&pid=228760"][{POST_SNAPBACK}][/a]


Why ?
Is it forbidden to find one codec consistently rated 3 and the other always 4 ? I didn't set 0 anywhere, just 3.00 and 4.00.

EDIT : and I still don't understand how two people rating codecs 3 and 4 can lead to a confidence superior to 99.9 % !
I guess that the analyzer finds the coincidence very big : 3.00 and 3.00 again, while it could have been 3.05, or 2.99, that can't be a coincidence !
I should absolutely be avoided ! Real People will never rate a codec 2.99.  Will the analyzer drop the accuracy if we set 3 without dot and digits ? Or will it have to be rewrited in order to work with integer precision instead of one hundredth ?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: bleh on 2004-07-25 17:10:03
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence.  That's how the test works.  However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea.

Also, the division by zero came from the fact that all scores were the same (mean - each score = 0).  Again, this is either a symptom of the sample size being too low or the probability of a real difference being staggeringly high.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-25 18:38:13
Quote
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence.  That's how the test works. [a href="index.php?act=findpost&pid=228855"][{POST_SNAPBACK}][/a]


In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this.

Quote
However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea.[a href="index.php?act=findpost&pid=228855"][{POST_SNAPBACK}][/a]


What's the meaning of the p values then ?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-25 21:58:52
Quote
Quote
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence.  That's how the test works. [a href="index.php?act=findpost&pid=228855"][{POST_SNAPBACK}][/a]


In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this.


No, not true.  One of the assumptions that ANOVA makes is that the scale is continuous.  ABC/HR's scale is not continuous, but it is close enough, since it has many intervals in between the major divisions.  As I said, in real-world data, you are not likely to see a table of scores like the one you posted.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-25 22:20:29
For anybody interested in seeing exactly where in the calculations this thing blows up with the sort of data Pio supplied, download this spreadsheet, which shows how things are computed:

http://ff123.net/export/anova.zip (http://ff123.net/export/anova.zip)

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-07-26 00:53:47
In the meantime, I found some infos on the web :

Anova : http://www.psychstat.smsu.edu/introbook/sbk27.htm (http://www.psychstat.smsu.edu/introbook/sbk27.htm)
Friedman : http://www.graphpad.com/articles/interpret...A/friedmans.htm (http://www.graphpad.com/articles/interpret/ANOVA/friedmans.htm)

In short, it says that the Friedman analysis only cares about the ranking of the samples in each line.
If, in one line, codec 1 is rated 5.00 and codec 2 4.99, for the Friedman analysis, it is exactly the same thing as if they were rated 2000 and 1, as long as codec 1 is first and codec 2 second. It doesn't care about the scores at all.

The anova analysis computes the variance of the results that each codec got. Then it computes the variations between the codecs. If it finds the variation between codecs being abnormally high compared to the variance of the ratings of the codecs, it tells that the codec is superior, or inferior.
If it finds that the difference between the codecs is similar to the differences between each people or sample, it says that the variation was to be expected, and rates the codecs equal.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: deaf on 2004-07-26 02:38:27
It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test. Looking at charts of test results, like the latest one of low bitrate, even if the confidence intervals overlap, we still rate one codec better over the other, even that the probablility of being wrong increases. Without violating rule #8, we do make comments on it.
It has been discussed several times how to deal with differences in bitrate of the samples. I have not made an effort to find much about it and it is controversal because it is subjective, but may I suggest an XY chart of bitrate vs. rating of this result? A "how much bang for the money" style. Maybe others aware of some scientific way of calculating mean/std circles for each codec regardless, that quality is not linear even not proportional to size/bitrate. Could that give another perspective of the results?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-26 03:10:19
Quote
It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test.


For a group test, to get more sensitive results, decrease the number of codecs being tested.  If you only compare 2 codecs, for example, you can get very fine detail.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-07-26 03:40:15
Quote
Edit:  I should also probably add the Tukey's analyses back in to the web page.  They're in the actual command line program.
[a href="index.php?act=findpost&pid=228761"][{POST_SNAPBACK}][/a]


Added to the web page analyzer.  I also made the Parametric Tukey's HSD the default, which is the conservative option, but the most statistically correct, especially with large numbers of codecs being compared.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-08-04 23:23:37
FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ?

Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-08-05 01:04:35
Quote
FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ?

Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ?
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=231968")


The answer to the latter question:  the Friedman (non-parametric) method does not do a Bonferroni-type correction for multiple comparisons (like the Tukey methods do).

I don't really know the answer to the first, but I can guess:  there would have to be a separate LSD number for each comparison (for 2 codecs there can only be one comparison, for 3 codecs there are 3 comparisons, for 4 codecs 6 comparisons, etc.).  Since there is only one LSD number, all of the comparisons would have to be exactly alike to match the binomial table.  But that would almost never happen.

The way to get a better match to the binomial table would be to do a comparison like the resampling method used by the bootstrap program here:

[a href="http://ff123.net/bootstrap/]http://ff123.net/bootstrap/[/url]

This method essentially performs many simulations, and produces a separate confidence interval for each comparison.  The downside to using this type of method is that you can't really use the nice graphs any more (which we can draw because there is only one size error bar which applies to all comparisons), and have to stick to showing the results in tabular format.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-08-22 13:53:50
[span style='font-size:21pt;line-height:100%']   ...::: 8 additional results :::...[/font][/span]



[span style='font-size:14pt;line-height:100%']I. TESTING CONDITIONS[/span]

Few changes since last bunch of test: same hardware, same kind of music (classical), same software. I’ve nevertheless drawn the conclusion of past discussion with pio2001, and fixed a number of trials for all ABX test: 12 trials, no more, no less. This drastic condition implies a lot of concentration, many rests, and is therefore very time-consuming. Tests are less enjoying in my opinion (motivation is harder to find). Other consequence of this: there are now 5.0 [transparent] notation. If I failed [EDIT: "completely failed"] to ABX something, I cancelled my ABC/HR notation and gave a nice 5.0 as final note. I nevertheless kept trace of my initial feeling in the "general comment".



[span style='font-size:14pt;line-height:100%']II. SAMPLES[/span]

I tried to vary as much as possible the samples (kind of instruments/signal). There aren't known-killers. All samples should be ‘normal’, with no correspondences to typical lossy/perceptual problems (as sharp attacks and micro-attacks signal for exemple).

Eight more samples. Two are from harashin:
- Liebestod: opera (soprano voice with orchestra)
- LadyMacbeth: festive orchestra, with predominant brass and cymbals

Six others are mine:
- Trumpet Voluntar: trumpet with organ (noisy recording)
- Vivaldi RV93: baroque strings, i.e period instruments (small ensemble)
- Troisième Ballet: cousin of bagpipes, playing with a baroque ensemble
- Vivaldi – Bassoon [13]: solo bassoon, with light accompaniment
- Seminarist: male voice (baritone) with a lot of sibilant consonants and piano accompaniment
- ButterflyLovers: solo violin playing alternately with full string orchestra 



[span style='font-size:14pt;line-height:100%']III. RESULTS[/span]

[span style='font-size:12pt;line-height:100%']3.1. eight new results[/span]




[span style='font-size:14pt;line-height:100%']IV. STATISTICAL ANALYSIS[/span]


I fed ff123’s friedman.exe application with the following table:
Code: [Select]
 
LAME_V2   LAME_V3   MPC_Q5    OGG5.5    OGG5.99   OGG6.00  
2.00      1.50      3.00      2.00      2.00      3.20      
1.50      1.00      4.00      2.90      2.90      3.50      
3.00      2.50      2.80      3.00      3.30      4.00      
3.00      2.00      4.00      2.00      2.00      2.30      
1.50      1.00      4.90      2.50      2.50      3.30      
3.00      1.80      3.80      2.20      2.40      3.00      
1.50      1.20      3.50      1.80      2.30      3.40      
1.50      2.70      4.00      2.00      2.00      2.30      
3.00      2.80      4.20      1.60      1.50      3.00      
3.00      2.30      4.00      2.30      2.50      3.50      
2.00      2.00      4.00      2.50      2.50      3.50      
3.50      2.50      5.00      1.50      1.50      4.00      
1.50      1.00      4.00      2.00      2.50      3.00      
1.40      1.20      3.50      1.70      2.00      2.20      
4.00      3.00      5.00      4.00      4.00      4.50      
2.50      1.30      3.50      1.70      1.70      2.70      
3.00      1.20      3.00      1.40      2.00      2.20      
3.50      3.00      3.00      2.00      2.00      5.00      

[span style='font-size:9pt;line-height:100%'][interesting to note: the conclusions and values computed by the tool are exactly the same if I keep the original notation [e.g. 12.3 and not 2.30].[/span]

The ANOVA analysis conclusion is:

Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 18
Critical significance:  0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              107         102.73
Testers (blocks)    17          23.75
Codecs eval'd        5          49.48    9.90   28.53  0.00E+000
Error               85          29.49    0.35
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.390

Means:

MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
 3.84     3.26     2.47     2.31     2.17     1.89  

---------------------------- p-value Matrix ---------------------------

        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
MPC_Q5   0.004*   0.000*   0.000*   0.000*   0.000*  
OGG6.00           0.000*   0.000*   0.000*   0.000*  
LAME_V2                    0.430    0.137    0.004*  
OGG5.99                             0.481    0.034*  
OGG5.5                                       0.153    
-----------------------------------------------------------------------

MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3
OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3
LAME_V2 is better than LAME_V3
OGG5.99 is better than LAME_V3


And now, the “most statistically correct” (http://www.hydrogenaudio.org/forums/index.php?showtopic=23355&view=findpost&p=228977) tukey-parametric analysis one:

Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Tukey HSD analysis

Number of listeners: 18
Critical significance:  0.05
Tukey's HSD:   0.574

Means:

MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
 3.84     3.26     2.47     2.31     2.17     1.89  

-------------------------- Difference Matrix --------------------------

        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
MPC_Q5     0.589*   1.378*   1.533*   1.672*   1.956*
OGG6.00             0.789*   0.944*   1.083*   1.367*
LAME_V2                      0.156    0.294    0.578*
OGG5.99                               0.139    0.422  
OGG5.5                                         0.283  
-----------------------------------------------------------------------

MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3
OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3
LAME_V2 is better than LAME_V3


According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough?
The ANOVA analysis is slightly more acceptable: it concludes on megamix 5,99 superiority for the 18 samples, but still not on megamix 5,50 one (66% of favorable samples).

But both analysis concludes on:
1/ full MPC -Q5 superiority (even against Vorbis megamix1 -Q6
2/ megamix1 Q6 superiority on lame -V2 and V3 and on megamix Q5,50 and Q5,99
3/ LAME V2 > LAME V3

More schematically:
• ANOVA: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3
• ANOVA: OGG_Q5,99 > LAME V3
• ANOVA: LAME_V2 > LAME V3

• TUKEY_PARAMETRIC: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3
• TUKEY_PARAMETRIC: LAME_V2 > LAME V3


In other words, it means that for me, and after double blind tests on non-critical material:
- musepack --standard superiority is not a legend, and isn't infirmed by recent progress made by lame developers and vorbis people.
- lame --standard preset is still competitive against vorbis, at least up to q5,99, which still suffers from audible and sometimes irritating coarseness. Nevertheless, quality of lame MP3 quickly drops below this standard preset. It's interesting to note, in case of hardware playback.
- vorbis aoTuV/CVS 1.1 begins to be suitable for high quality at q 6,00, but absolutely not below this floor.


[span style='font-size:14pt;line-height:100%']APPENDIX. SAMPLE LOCATION AND ABX LOGS[/span]

ABX logs are available here:
http://audiotests.free.fr/tests/2004.07/hq1/log/ (http://audiotests.free.fr/tests/2004.07/hq1/log/)
The eight new log files are merged in one single archive (http://audiotests.free.fr/tests/2004.07/hq1/log/ABX%20log%208%20new.zip)

Samples are not uploaded. I could do it. Is someone interested?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: eagleray on 2004-08-22 14:31:58
So many numbers...ouch.

Thany you Guruboolez for all your work.  There is definitely a shortage of encoder comparison tests at relatively high bitrates, and no shortage of opinions.

One thing continues to bug me:

Someone with really good hearing, including the training to listen for artifacts, can do a valid abx comparison and produce results at a good confidence level.

Someone like me can not.

Am I better off using the encoder that the person with good hearing can identify?  In other words, even if I can not objectively identify the differences in abx testing, is there some subjective additional level of enjoyment of the music, other than a possible placebo effect?  Is there any way to verify this?

There is the final unfortunate truth:  MP3 hardware support is universal, Ogg Vorbis hardware support is relatively limited along with the battery life issue, and MPC is confined to playback on computers.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-08-22 15:16:55
Quote
According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough?


I verified with the bootstrap program:

http://ff123.net/bootstrap/ (http://ff123.net/bootstrap/)

that statistically speaking, if you adjust for the fact that there are actually 15 comparisons with 6 codecs, then ogg5.99 must be considered tied to lamev3.  The bootstrap (simulation) program is almost as good as one can do for adjusted p-values.

Nice comparison, guru.

Code: [Select]
                             Adjusted p-values
        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3
MPC_Q5   0.021*   0.000*   0.000*   0.000*   0.000*
OGG6.00    -      0.001*   0.000*   0.000*   0.000*
LAME_V2    -        -      0.633    0.367    0.021*
OGG5.99    -        -        -      0.633    0.128
OGG5.5     -        -        -        -      0.367

                            Means
MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3
3.844    3.256    2.467    2.311    2.172    1.889


ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: kuniklo on 2004-08-22 17:28:01
Thanks very much for taking all the time to do these comparisons Guru.  So much has changed since all the original high-bitrate comparisons were made that it's very useful to get new data.  I guess I'll continue using mpc myself. 
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-08-23 00:44:30
Thank you very much for you work and analysises !


Quote
In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. [...]It’s totally insane in my opinion…
[a href="index.php?act=findpost&pid=236220"][{POST_SNAPBACK}][/a]


I don't see what's wrong with it. If you interpret is a an ABX test, you got Megamix superior to Lame with a score of 13/18. The p value is 0.048, which is already very borderline for a valid result.
But here, 6 codecs are compared, which gives a total of 15 possible codecs comparisons. If you are answering at random, it is perfectly expectable to get, among 15 possible 1-to-1 codecs comparisons, one of them positive, with p=1/15. This would be considered complete chance, with p clearly higher than 0.5, and not 1/15.
In the same way, the 13/18 result that you got has not a probability of being guessed equal to 0.048, but much higher. It says that it is equal to 0.633. So if this can happen more than one time out of two, I should be alble to reproduce it easily with random results.

First try with completely random numbers, generated by my calculator :

Code: [Select]
Joke1   Joke2   Joke3  Joke4  Joke5   Joke6
3.40    1.70    4.70   1.30   2.10    1.10
4.70    4.70    1.60   3.60   2.30    1.20
3.70    1.90    2.50   1.10   2.50    4.30
2.50    3.10    3.90   3.40   2.40    4.00
1.30    4.60    4.40   3.40   1.50    2.50
4.00    1.20    2.40   4.90   4.30    1.50
3.40    2.50    4.50   1.40   3.10    2.00
1.20    3.30    4.50   4.10   2.50    1.90
4.50    4.30    4.70   4.70   5.00    4.30
4.50    4.10    3.10   4.50   2.60    3.40
2.60    2.30    1.80   4.80   3.00    1.90
2.40    2.20    2.10   4.00   2.60    2.80
1.20    1.80    1.10   1.10   3.90    3.30
4.90    1.30    2.40   4.60   4.20    2.20
2.50    2.10    4.70   4.00   4.80    1.50
1.90    3.10    3.80   1.50   3.90    2.80
1.30    4.70    3.40   3.10   3.20    2.70
4.30    2.30    3.70   1.80   1.30    4.10


No score as good as 13/18.

Second try :

Code: [Select]
Joke1 Joke2 Joke3 Joke4 Joke5 Joke6
14    31    50    17    34    29
13    22    21    36    23    23
50    48    17    31    14    11
28    49    24    50    43    50
12    48    23    33    22    43
40    28    25    15    47    33
23    13    37    29    38    30
41    40    19    25    33    18
28    48    40    12    13    44
32    25    40    26    49    17
11    29    43    15    36    47
41    18    22    22    24    44
15    13    25    13    39    48
16    17    17    40    37    24
30    29    49    29    12    43
33    40    14    49    42    48
19    47    11    47    40    31
42    34    41    24    25    21


Here, you can see that Joke6 is better than Joke1 13 times out of 18, and with random numbers ! This is not an insane result. Two tries were enough for it to happen.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-08-23 09:42:16
I don't understand what these random numbers should proove.
I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Pio2001 on 2004-08-23 11:58:24
If someone runs an ABX test of which I am the listener, and he plays 18 times A, and I say 78% of the times that it is B, is it obvious that B was absolutely played 78 % of the times ?

But again, this discussion only matters for the interpretation of the Anova and Tukey analyses. But here, you got some ABX results, whose meaning goes much beyond what Anova and Tukey says. We must consider the ABX results separately from the ABC/HR analyses, in order to draw a general conclusion.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-08-23 16:03:13
Quote
I don't understand what these random numbers should proove.
I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples.
[a href="index.php?act=findpost&pid=236400"][{POST_SNAPBACK}][/a]


The adjustment for multiple samples can be harsh.  That's why it's good to keep the number of comparisons down to a minimum.  If you had just compared Ogg5.99 against lameV3, it's likely you would have come up with a significant difference.  But with so many comparisons, the statistical "noise" gets larger.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-08-23 19:58:34
Quote
(...) That's why it's good to keep the number of comparisons down to a minimum. (...)
[a href="index.php?act=findpost&pid=236453"][{POST_SNAPBACK}][/a]


But is it really to the tester to adapt his test to the analysis tool? Or isn't it more logical to ask to the analysis tool to deal with the conditions of the test?

It's sounds like methodological problems introduced with VBR tests at a target bitrate: there's sometimes a big temptation to select specific samples (not too high, not to low) in order to match the targeted bitrate, rather than choosing the samples we really want to test, which could be more interesting. If a tester choose to avoid some samples for this reason, the risk is to limit the impact (and maybe the significance) of the test.

Same thing here. It's probably better to limit the number of comparison for many reasons. But on the other side, it'll be harder to have solid ideas about relative performances of different encoders.
With my test for exemple, I have now solid ideas about:
- big difference existing between vorbis -q6 and lower profile, including 5,99
- very limited difference between vorbis 5,50 and 5,99 (therefore, there's few thing to expect by increasing bitrate by 0.2...0.5 level)
- serious differences between lame --preset standard and -V2

If I had removed three contenders, keeping one lame setting and one vorbis quality level, the three previous conclusions wouldn't be possible. And if I had tested separately vorbis and lame in two different session, I couldn't seriously compare both results each others (such comparisons need at least the same low and high anchor, which make two separate tests with three contenders each + 2 anchor much longer than one single test with 6 contenders).


In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ff123 on 2004-08-23 21:35:39
Quote
In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis.


That's fine.  A tester can set up any test he likes, but the fact is that the test conditions affect the subsequent analysis.  So you've got to be aware of this when you set up your test.  In this particular case, if you really wanted to be certain that ogg5.99 is really better than lameV3 (for your ears and samples), then you should run another test with
just the two codecs to confirm it.

That's the way statistics works.  You go into a test with your criteria for significance set prior to running the test (meaning you should choose which analysis you're going to run prior to the test as well; i.e., ANOVA or Tukey's, etc.).  And then live with the results.

ff123
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: echo on 2004-08-23 23:30:44
Quote
- serious differences between lame --preset standard and -V2

Huh? I'm pretty sure you meant -V3 here. 

I'd also like to point out for proper wording that it is not the ANOVA test that shows that Codec A is rated better than codec B or codec C. ANOVA just shows that differences between the samples exist or not by means of the p value. It is the Friedman post-hoc test (or the Tukey test) that shows exactly where these differences are.

Thanks for a nice test guru!
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: eagleray on 2004-08-24 01:22:49
From the discussion I notice the difficulty of ABX comparisons at high bitrates among highly developed codecs.

By the way Guruboolez, how is the pate (sp)?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Adie on 2004-11-19 15:59:17
Quote
Quote
Quote
MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.


It would be wonderful if this best case were true, but no: on my Palm I can only listen to MP3, Ogg Vorbis and WMA. And I know the same applies to PocketPC, besides some obscure AAC player. Musepack is unfortunately really confined to computers.
[{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=225084")

Hopefully, not for long.  See [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=23362]here[/url]
.
[a href="index.php?act=findpost&pid=225125"][{POST_SNAPBACK}][/a]


You can always listen using BetaPlayer which handles mpc on ppc (it's great i can play movies from pc over wifi)

BTW. I'm using vorbis 1.1 from arewares and setting "-q5 --advanced-encode-option impulse_noisetune=-5" is really almost transparent to me in most cases. It causes bigger vbitrate fluctuation because it uses short blocks more frequently and thus adds more "texture" to encoding. I've ABXed it using SoundStorm connected with spdif to sony amplituner with pascal speakers and found that mpc at q6 sounds duller than vorbis (but around q8,9 mpc is also good) . Sometimes I have to switch to q6, but only when source cd is very good mastered. Tried mpc 1.14beta
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: music_man_mpc on 2004-11-19 16:34:37
Quote
I've ABXed it using SoundStorm connected with spdif to sony amplituner with pascal speakers and found that mpc at q6 sounds duller than vorbis (but around q8,9 mpc is also good) . Sometimes I have to switch to q6, but only when source cd is very good mastered. Tried mpc 1.14beta
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=255004")

I find this very hard to believe.  Could you use [a href="http://ff123.net/abchr/abchr.html]ABC/HR[/url] and post your results please?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Adie on 2004-11-19 21:52:03
Quote
Quote
I've ABXed it using SoundStorm connected with spdif to sony amplituner with pascal speakers and found that mpc at q6 sounds duller than vorbis (but around q8,9 mpc is also good) . Sometimes I have to switch to q6, but only when source cd is very good mastered. Tried mpc 1.14beta
[{POST_SNAPBACK}][/a] (http://index.php?act=findpost&pid=255004")

I find this very hard to believe.  Could you use [a href="http://ff123.net/abchr/abchr.html]ABC/HR[/url] and post your results please?
[a href="index.php?act=findpost&pid=255010"][{POST_SNAPBACK}][/a]


As soon as I will return to my hometown. I'm currently studying in other city.

Update: I've abxed mpc q6 a little using ABC/HR and tested with vorbis q5 impulse, I must admit that dullnes in high freq wchich I heard b4 was virtually inaudible.
I think I should try with mpc q5 because vorbis had 165kbps and mpc 224kb. And of course try to abx vorbis. I'll try to find more time and test it more.
***
ABC/HR Version 1.0, 6 May 2004
Testname: roxette

1L = C:\CHIP\temp\019B4DD3,06.mpc.wav
2L = C:\CHIP\temp\019B4DD3,06.ogg.wav

---------------------------------------
General Comments:

---------------------------------------
1L File: C:\CHIP\temp\019B4DD3,06.mpc.wav
1L Rating: 4.5
1L Comment:
---------------------------------------
2L File: C:\CHIP\temp\019B4DD3,06.ogg.wav
2L Rating: 4.7
2L Comment:
---------------------------------------
ABX Results:
Original vs C:\CHIP\temp\019B4DD3,06.mpc.wav
    7 out of 10, pval = 0.172

***
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: shadowking on 2004-11-20 00:47:34
You haven't abxed it at all, 7/10  is useless.. try 8/8 , 14/16,  16/16

You need pval <3% over several trials to be creadible here.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Adie on 2004-11-20 09:07:49
Quote
You haven't abxed it at all, 7/10  is useless.. try 8/8 , 14/16,  16/16

You need pval <3% over several trials to be creadible here.
[a href="index.php?act=findpost&pid=255072"][{POST_SNAPBACK}][/a]

I've admitted in previous post that I haven't heard almost any difference this time. So these results are ok (just they are not prooving anything). I will try with other samples or admit that I can't get credible results. Anyway whatever comes I will stick with vorbis because of smaller filesize and portability. I wish there was a foobar2000 port to PPC.
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: guruboolez on 2004-11-20 09:15:36
Quote
(...) Anyway whatever comes I will stick with vorbis because of smaller filesize and portability. [a href="index.php?act=findpost&pid=255118"][{POST_SNAPBACK}][/a]

For transparent encodings?
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: Adie on 2004-11-20 09:31:03
Quote
Quote
(...) Anyway whatever comes I will stick with vorbis because of smaller filesize and portability. [a href="index.php?act=findpost&pid=255118"][{POST_SNAPBACK}][/a]

For transparent encodings?
[a href="index.php?act=findpost&pid=255119"][{POST_SNAPBACK}][/a]


Transparency bitrate level for vorbis(with impulse_noisetune) seems to be lower than mpc. Realisticaly speaking I should use a losless encoder to do that but these are not portable nor produce small filesizes (with vorbis you can have one virtually transparent file and listen it on your computer, pda, xbox and few mp3 players).
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: abbe learning on 2006-08-25 19:10:14
Quote
Quote
MPC is still confined to computer, or in best case on PDA – and is maybe doomed to this limited usage.


It would be wonderful if this best case were true, but no: on my Palm I can only listen to MP3, Ogg Vorbis and WMA. And I know the same applies to PocketPC, besides some obscure AAC player. Musepack is unfortunately really confined to computers.
[{POST_SNAPBACK}][/a]
(http://index.php?act=findpost&pid=225084")

Hopefully, not for long.  See [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=23362]here[/url]
.



This is ALMOSt true. For some of the better portable players an open source firmware update exists. It is called "Rockbox" and you can find it here: http://www.rockbox.org/ (http://www.rockbox.org/).

Of course your point is still valid, as the majority of players do not support MPC, but as you see a few can be modded to do so.

I haven´t got any portable player at the moment because I listen to very good audio quality at home and don´t want to sacrifice that experience on the road. Then I learned about Rockbox and the quality of the D/A converter in some of the players, that can be modded by Rockbox, and when I have som money that will be my solution to portable audio (with some more than decent headphones, naturally  )

Yours for now,

Abbe
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: ExUser on 2006-08-25 19:17:55
Holy thread necromancy!
Title: MPC vs OGG VORBIS vs MP3 at 175 kbps
Post by: indybrett on 2006-08-25 20:16:37
Rockbox, eh?  Never heard of it