Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Public Listening Test [2010] (Read 176092 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Public Listening Test [2010]

Reply #100
Am I understanding correctly, that the constrained VBR setting in QT is more likely to be tested than the "constrained" iTunes VBR? I'd prefer iTunes because then the test would tell if using QT is worthwhile.


QT, constrained VBR, at medium quality is identical to iTunes' VBR. The only exception is the iTunes plus preset, which is identical to QT, 256 kbit/s constrained VBR, max quality.

In general, does anyone have an opinion of which method should be used for checking the AAC bitrates? Should we blindly trust the applications that just read the header data?


If you want to verify, which application reports with the highest precision, extract a raw AAC stream from an MP4/M4A (e.g. with mp4box) and divide the number of bytes by number of seconds.

Even if it is more work, I'd vote for hand selected quality settings. Run each encoder in a FOR LOOP with a few increments from q 0.38 to q 0.45 and then choose the version closest to 128 kbit/s, instead of using the same preset for the whole test. Primary goal is to see how encoders compare at 128kbit/s. Knowing how good their VBR algorithms scale up for problematic content cannot be tested in a fixed bitrate comparison, anyway.

Public Listening Test [2010]

Reply #101
Alex B,

I've counted the votes and opinions of HA's experienced members regarding 96/128 kbps poll topic. The results are 50/50. 
So both are still fine.

CVBR vs TVBR settings to test:
identical to iTunes  CVBR = qtaacenc --cvbr --normal
TVBR = qtaacenc --tvbr --highest

I compare file sizes this way the results are independent from application (foobar, MrQ,etc...)

Public Listening Test [2010]

Reply #102
Even if it is more work, I'd vote for hand selected quality settings. Run each encoder in a FOR LOOP with a few increments from q 0.38 to q 0.45 and then choose the version closest to 128 kbit/s, instead of using the same preset for the whole test. Primary goal is to see how encoders compare at 128kbit/s. Knowing how good their VBR algorithms scale up for problematic content cannot be tested in a fixed bitrate comparison, anyway.

The test should have real life condition. Usually user won't use the different -q value per song. It's not real.

Public Listening Test [2010]

Reply #103
QT, constrained VBR, at medium quality is identical to iTunes' VBR. The only exception is the iTunes plus preset, which is identical to QT, 256 kbit/s constrained VBR, max quality.

IMHO, then the constrained vbr samples should be encoded with iTunes and labeled as such. It would be a bit stupid move to not include a well-known brand like iTunes. Of course a test is not a commercial product that needs marketing, but surely it would be good if test would gain more publicity.

Quote
If you want to verify, which application reports with the highest precision, extract a raw AAC stream from an MP4/M4A (e.g. with mp4box) and divide the number of bytes by number of seconds.

I could do that. My sets contain only 25 + 25 carefully selected files so that would not be too much work.

Quote
Even if it is more work, I'd vote for hand selected quality settings. Run each encoder in a FOR LOOP with a few increments from q 0.38 to q 0.45 and then choose the version closest to 128 kbit/s, instead of using the same preset for the whole test.

I hope you don't mean that each sample & each encoder should be adjusted individually to produce 128 kbps or as close as possible.

Quote
Primary goal is to see how encoders compare at 128kbit/s. Knowing how good their VBR algorithms scale up for problematic content cannot be tested in a fixed bitrate comparison, anyway.

The primary goal is probably going to be
"to see how encoders compare at a setting that produces an average bitrate of 131 kbps or as close as possible when a big varied audio library is encoded."

The exact target bitrate is depended on those encoders that cannot be adjusted precisely. Their average bitrate should be calculated and then the  encoders that can be freely set should be tested in order to find the matching setting.

Public Listening Test [2010]

Reply #104
Bitrate limit, How much VBR is allowed?

The discussion has indicated that everyboy agree to test VBR without any limits.

Public Listening Test [2010]

Reply #105
I've counted the votes and opinions of HA's experienced members regarding 96/128 kbps poll topic. The results are 50/50. 
So both are still fine.
Well, if you don't want to end up frustrated because the results of the test's long and hard work are a 4.5 statistical tie, especially if you apply strict post-processing, go for 96k.  The latter hasn't been publicly tested recently, contrary to 128k.

Moreover, bear in mind that with rigorous post-processing, however noble the idea is, you'll have even fewer results, widening the statistical error margins even further, making the all-tied end result a self-fulfilling prophecy.  Sebastian's most recent public listening test, 128k MP3, conducted little over a year ago, averaged just 27 test results per sample ...  Would you really want to put in so much energy in testing and post-processing for only 10 to 20 results?

Another reminder quote from Sebastian:
Well, this was definitely the last test at 128 kbps, that is for sure.

Public Listening Test [2010]

Reply #106
... , go for 96k.  The latter hasn't been publicly tested recently, contrary to 128k.

When was the last time an AAC-only test with only critical samples was done at 128 kb?

Quote
Moreover, bear in mind that with rigorous post-processing, however noble the idea is, you'll have even fewer results, widening the statistical error margins even further, ...

Incorrect. By rigorous post-screening, you reduce the statistical error margins because the results become more consistent. Of course, you shouldn't end up with only a handful of accepted listeners.

Quote
Would you really want to put in so much energy in testing and post-processing for only 10 to 20 results?

Yes, anything higher than 10 after post-screening is a very useful number.

Quote
Another reminder quote from Sebastian:
Well, this was definitely the last test at 128 kbps, that is for sure.


I would say the same if I would have to test with non-critical samples.

Quote from: IgorC link=msg=0 date=
The discussion has indicated that everyboy agree to test VBR without any limits.

Careful! That discussion had nothing to do with the choice of encoders for this test. I was just asking if people mind excessive bitrates in VBR encoders.

Quote
Imagine situation that there is codec A which has very noticeble artifacts on sample and ranked at pretty low score while codec B did a good job on lowpassing an old noisy record. Codec B could be ranked higher than reference lossless -> reference can be ranked lower than 5.0 as described in http://www.rarewares.org/rja/ListeningTest.pdf

This would not be a result which we are interest in! For an encoder, sounding better than the original is not a goal. If it sounds better than the original, it's not transparent and hence, must be graded lower than 5.0.

Quote from: Alex B link=msg=0 date=
It would be a bit stupid move to not include a well-known brand like iTunes. Of course a test is not a commercial product that needs marketing, but surely it would be good if test would gain more publicity.

Agreed. Of QT and iTunes, the latter is arguably the more widely used software for encoding (after all, AAC is the default codec for CD ripping in iTunes). Why not just use iTunes CVBR in the test and leave the ABR-CVBR-TVBR discussion for a separate test. The corresponding poll shows a tie.

Quote
The primary goal is probably going to be
"to see how encoders compare at a setting that produces an average bitrate of 131 kbps or as close as possible when a big varied audio library is encoded."

The exact target bitrate is depended on those encoders that cannot be adjusted precisely. Their average bitrate should be calculated and then the encoders that can be freely set should be tested in order to find the matching setting.

Agreed. I think it's time to decide on the "varied audio library" for calibrating the VBR coders. That would also allow me to tune the average bitrate of Fraunhofer's AAC VBR encoder to match those of iTunes and DivX, for example (in case one day Fraunhofer's encoder will be compared against those encoders).

A spontaneous proposal from my side is Pink Floyd's 2-CD best-of "Echoes" because musically, it's very diverse (loud and quiet, tonal and noisy stuff), and it has no silence between tracks (it's all one mix).

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #107
The rules:
Code: [Select]
Remove all listeners from analysis who
1. graded the reference lower than 4.5
2. graded the low anchor higher than all competitors.
3. didn't grade the low anchor.
4. didn't grade any of competitors.

Would you remove results for all sample from that user?

Chris has suggested to change 1st rule to "graded the reference lower than 5.0" here
I tend to disagree.
Imagine situation that there is codec A which has very noticeble artifacts on sample and ranked at pretty low score while codec B did a good job on lowpassing an old noisy record. Codec B could be ranked higher than reference lossless -> reference can be ranked lower than 5.0 as described in http://www.rarewares.org/rja/ListeningTest.pdf

The result could be:
Code: [Select]
Codec A - 3.0
Reference A -5.0

Codec B - 5.0
Reference B - 4.9


It's valid result for me.

Also imagine people guessing for one encoder that is transparent and you discard only those guesses where they were unlucky. All those lucky guesses stay and the encoder gets lower grade than what it deserves. I still hold the opinion that an ABX test for each codec on each sample should be mandatory when one wants to give a grade.

Public Listening Test [2010]

Reply #108
muaddib,

It's important observation indeed.
I think we are all humans and have right to make a mistake sometimes but not always. I would accept valid results from listener with previous invalid ones.
I propose the next solution.
a) If one listener will submit invalid result then he will be informed and will have one more and unique posibility  to submit result for one particular sample (but now with ABX log).
b) Now if the listener will submit 3 or more invalid results then only ABX results will be accepted from him/her.


About lucky guessing.
Those lucky guesses will be canceled mutually between them because all encoders will have the same probability to be lucky guessed. Obviously the average scores will be a bit lower but it can't affect one particular codec without affected the rest of competitors... because it's exactly how lucky guess works -> no privileges for one particular competitor.
Now, why do I say no to ABX but to ABC/HR in this particular test ?
1. From my previous blind tet experience (and I want to listen the opinions of other listeners here) I can say ABX is actually exhaustive activity. The listener (at least me) will rather lose a concentration after ABX all competitors against lossless and won't be able to grade competitors between them.
2. We will test a difficult samples so it should be more easy to spot the artifacts.
I would rather accept the possibility of lucky guess but also higher possibilities to get useful results.


Public Listening Test [2010]

Reply #110
Chris,

I think TVBR vs CVBR should take the place. People want to see it (see poll). The poll is closed and I think we shouldn't discuss it anymore. It doesn't mater if it's tied. There is simply more people want to see it.  It's very long and popular discussion about of efficiency of T/CVBR here on HA.

Public Listening Test [2010]

Reply #111
I was under the impression that it is common for portable audio players to have a vbr limit?


That's all cleanly defined in the MPEG specification, nothing to worry about.

Is this not a concern? Is unrestrained VBR truly unrestrained or just very high?


Unconstrained VBR the context of this discussion means unconstrained variability, not unconstrained max bitrate.

Public Listening Test [2010]

Reply #112
1. From my previous blind tet experience (and I want to listen the opinions of other listeners here) I can say ABX is actually exhaustive activity. The listener (at least me) will rather lose a concentration after ABX all competitors against lossless and won't be able to grade competitors between them.
2. We will test a difficult samples so it should be more easy to spot the artifacts.
I would rather accept the possibility of lucky guess but also higher possibilities to get useful results.

ABX is an exhausting activity if it is hard for someone to spot the artifact, that is if a sample is difficult for him. Difficulty is determined by an observer.
ABX can be used to help users decide what grade to give.
For example, a user should give a grade bellow 4 only if one listening of original is enough for doing 5/5 ABX.
Grade between 4 and 4.5 should be given only if a user doesn't make a mistake and if it is not hard for him to get 5/5, but is allowed to listen the original before each try.
Grade bellow 3 should be given only if there is no need to listen to the original for doing 5/5 ABX.
And so on...
This way the SDG becomes less vague.
I am not here proposing a rule to refuse all results without ABX, but propose a description of a listening test procedure that would make people give valid results.

Public Listening Test [2010]

Reply #113
The rules:
Code: [Select]
Remove all listeners from analysis who
1. graded the reference lower than 4.5
2. graded the low anchor higher than all competitors.
3. didn't grade the low anchor.
4. didn't grade any of competitors.

Would you remove results for all sample from that user?

Initially, I would have said yes, but since the rules seem to have become more strict, I'd say we only remove the results for that particular sample.

Quote
Also imagine people guessing for one encoder that is transparent and you discard only those guesses where they were unlucky. All those lucky guesses stay and the encoder gets lower grade than what it deserves. I still hold the opinion that an ABX test for each codec on each sample should be mandatory when one wants to give a grade.

Good point. It's a listening time - reliability tradeoff. Certainly, an extra ABX test for each sample will get rid of such "false results", but it will make the test much longer. Due to the latter, I'm not sure yet whether I want extra ABX tests.

Igor, I wasn't questioning the necessity of comparing CVBR and TVBR. Of course people want to see it (myself included). But I think we should make a separate test out of it. I don't mind at all if we would make that one public as well. We could even do that Apple-only test before the multi-company test. Then we can take the winner of that test (if there is one) and put it on the multi-company test. Btw, this should have been an option in the poll, I think.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #114
Chris, we were talking about encoders during like month and maybe it is a little bit late for such drastic changes.
I think there isn't enough good chance for more than one AAC test. Take into account  that it's not multiformat test and we will get less results. Even less if it will be only Apple AAC multi-settings test. After just one codec-specific test the interest in public test can drop drasticly.

Conducting a single AAC public test is already will be hard task in my opinion.

We can make poll to ask people or something else.

I think it will be better if we will determinate together the list of AAC encoders  and its new deadline maybe untill 7 of february?.

I don't see the reason why we should conduct separate tests. The number of good aac encoders is actually small.
Nero, TVBR, CVBR,  (CT vs Divx pre test).

Public Listening Test [2010]

Reply #115
Three cheers for rpp3po!

Even if it is more work, I'd vote for hand selected quality settings. Run each encoder in a FOR LOOP with a few increments from q 0.38 to q 0.45 and then choose the version closest to 128 kbit/s, instead of using the same preset for the whole test. Primary goal is to see how encoders compare at 128kbit/s. Knowing how good their VBR algorithms scale up for problematic content cannot be tested in a fixed bitrate comparison, anyway.


Why do you guys actually need 96/128 kbps poll?
128 is only 30% larger than 96, and most of you don't mind encoder using 150% of nominal bitrate: [a href='index.php?showtopic=78033']Bitrate limit, How much VBR is allowed?[/a]

It's okay that consumers want to evaluate encoders in modes they regulary use. But I am surprised that people like C.R.Helmrich and muaddib do believe in magic of encoder frontend settings. There are very few principal encoding parameters in audio (bitrate, bitres mode, fs) and it would be easy to conduce fair comparison. Ridiculous but you don't want to fix any of them.

Wouldn't it be fair to exclude CT encoder? It only has CBR mode and outputs ADTS so you are going to steal another 2kbps.

Public Listening Test [2010]

Reply #116
Regarding the correct and fair VBR settings for each individual encoder, here is one of my related replies that I posted when the previous public listening test was prepared:

http://www.hydrogenaudio.org/forums/index....st&p=593735

The complete thread * would be good reading for anyone who is interested in the preparatory actions that are needed before a public listening can be launched. For instance, while the test was prepared we discovered a serious problem with the iTunes MP3 encoder. The problem had been existed for several years and only our discovery made Apple to finally fix it (I hope it is fixed now).

* There was also a preceding 14-page discussion about a year earlier: http://www.hydrogenaudio.org/forums/index....showtopic=47313

I have posted some other links to older threads here:
Quote
While we wait the test to begin it might be useful to revisit the comments that were posted in the 64kbps multiformat test's announce thread: http://www.hydrogenaudio.org/forums/index....showtopic=56397

In that thread I made some suggestions about how the test presentation and instructions could be developed further: http://www.hydrogenaudio.org/forums/index....st&p=509971

In addition, the comments in the post-test thread make a good read: http://www.hydrogenaudio.org/forums/index....showtopic=56851

Public Listening Test [2010]

Reply #117
QT doesn't let its psymodel sort out >16kHz content for ~128kbit/s material, but chooses to lowpass completely (and gain the benefit of only needing 32kHz). Is this development choice going to be honored or are you planning to force a sample rate?

Public Listening Test [2010]

Reply #118
.alexander., rpp3po,

sorry, I don't understand what you're talking about. Already at 96 kb VBR, iTunes gives me 44.1-kHz MP4 files. Using 32 kHz sampling rate at 128 kb or more is a bad idea, anyway. Pre-echo sensitive people like /mnt will tell you why.

C.R.Helmrich and muaddib believe in magic of encoder frontend settings because they know that, in principal, there are about 100 encoding parameters in AAC. There's a reason why consumers are allowed to access only very few of them. If a codec developer decides (after hundreds of hours of testing) to use a certain default sampling rate for a given bitrate, why should we disallow that?

Wouldn't it be fair to exclude CT encoder? Fair to whom? Stealing another 2kbps? From where? It's included in the 128 kb.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #119
I was referring to this post. Is it false information? I didn't verify it before posting.

I just noticed that for Q values <59, output is resampled to 32 kHz. For me, Q 59 results in 139 kbps on average. Isn't resampling done at much lower bitrates with mp3?.

Public Listening Test [2010]

Reply #120
I have just verified singaiya's report. Quicktime, left at its default settings for TVBR and CVBR, downsamples automatically below Q59. Apple's bare bones afconvert front-end produces the following output:

Code: [Select]
rpp3po:Desktop rpp3po$ afconvert test.wav -f m4af -s 3 -d aac -u vbrq 58 -o test.m4a -v
Input file: test.wav, 11393676 frames
strategy = 3
user property 'qrbv' = 58
Formats:
  Input file     2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
  Output file    2 ch,      0 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 0 frames/packet, 0 bytes/frame
  Output client  2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
AudioConverter 0x595004 [0x10012d490]:
  CodecConverter 0x0x10013e1e0
    Input:   2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
    Output:  2 ch,  32000 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 1024 frames/packet, 0 bytes/frame
    codec: 'aenc'/'aac '/'appl'
    Input layout tag: 0x650002
    Output layout tag: 0x650002
Optimizing test.m4a... done
Output file: test.m4a, 8267520 frames


Code: [Select]
rpp3po:Desktop rpp3po$ afconvert test.wav -f m4af -s 3 -d aac -u vbrq 59 -o test.m4a -v
Input file: test.wav, 11393676 frames
strategy = 3
user property 'qrbv' = 59
Formats:
  Input file     2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
  Output file    2 ch,      0 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 0 frames/packet, 0 bytes/frame
  Output client  2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
AudioConverter 0x59a004 [0x10012cc00]:
  CodecConverter 0x0x10013d910
    Input:   2 ch,  44100 Hz, 'lpcm' (0x0000000C) 16-bit little-endian signed integer
    Output:  2 ch,  44100 Hz, 'aac ' (0x00000000) 0 bits/channel, 0 bytes/packet, 1024 frames/packet, 0 bytes/frame
    codec: 'aenc'/'aac '/'appl'
    Input layout tag: 0x650002
    Output layout tag: 0x650002
Optimizing test.m4a... done
Output file: test.m4a, 11393676 frames


.alexander., rpp3po,

sorry, I don't understand what you're talking about. Already at 96 kb VBR, iTunes gives me 44.1-kHz MP4 files. Using 32 kHz sampling rate at 128 kb or more is a bad idea, anyway. Pre-echo sensitive people like /mnt will tell you why.


I don't understand why you have addressed both posts together. My point wasn't related.

If a codec developer decides (after hundreds of hours of testing) to use a certain default sampling rate for a given bitrate, why should we disallow that?


Does that mean that you would prefer to use Apple's default or not?

Could you please point me to any reference why having only 16kHz of bandwidth makes it harder to avoid pre-echo?

Public Listening Test [2010]

Reply #121
Could you please point me to any reference why having only 16kHz of bandwidth makes it harder to avoid pre-echo?


The bandwidth doesn't matter, but the temporal resolution of the codec will be different due to the constant blocksize regardless of samplerate.

Public Listening Test [2010]

Reply #122
Ah, thanks, I didn't think about that. The blocks last much longer.

Public Listening Test [2010]

Reply #123
For some reason olympic sprinters don't run in casual wear, despite they can regulary walk barefoot or wear suits. In my opinion it would be good idea to compare streams having equal bitrates (amount of bits), but not produced using similarly spelled settings.

C.R.Helmrich and muaddib believe in magic of encoder frontend settings because they know that, in principal, there are about 100 encoding parameters in AAC. There's a reason why consumers are allowed to access only very few of them.


100 parameters per encoder is somewhat about 400 for this test. And that's why I propose to focus on parameters of bitstreams. Actually there are less than 100 data_elements in aac-lc syntax (see 4.148).

If a codec developer decides (after hundreds of hours of testing) to use a certain default sampling rate for a given bitrate, why should we disallow that?


You are right, and I appreciate invested hundreds of hours of testing. Though resampling isn't primary AAC compression technique, and some applications require fixed sample rate.

Wouldn't it be fair to exclude CT encoder? Fair to whom? Stealing another 2kbps? From where? It's included in the 128 kb.


Note that in previous tests CT streams were in ADTS, while others targeted bitrate of raw data. CT encoder could use 56 bits of each ADTS header for huffman codes. And I kindly ask you to increase CT bitrate upto 130 kbps.

Public Listening Test [2010]

Reply #124
In my opinion it would be good idea to compare streams having equal bitrates (amount of bits), but not produced using similarly spelled settings.

That won't happen. I promise you.
Why?
99% of HA community are on the same page that  codec should be tested without any bitrate restriction while produce the same bitrate on enough big amount of files.

Quote
And I kindly ask you to increase CT bitrate upto 130 kbps.

Yes, CT's bitrate will be shifted to ~130 kbps