Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: New Public Multiformat Listening Test (Jan 2014) (Read 164866 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #325
... The codecs produce the expected VBR bitrates over a large corpus. Why does it matter for what reason they're varying their bitrates over the test set?

Exactly, it absolutely doesn't "matter for what reason they're varying their bitrates over the test set". And if it doesn't matter then such variation should be removed from the test. Otherwise it is not understandable why such variation, which has no meaning is present in the test setup. It has no meaning, this is a point. It only spoils results, being just a bias.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #326
Regarding sample selection:

Can we assume the sample pool of the 2011 test is included? In any case, I (still) I recommend the test set I constructed in 2010, which Igor already kindly mentioned here:
http://www.hydrogenaudio.org/forums/index....st&p=695576
IIRC only BerlinDrug was actually chosen from that list in the 2011 test. One of the samples, CantWait, is stereo-miked a-cappella male singing, which nicely fits the category TheBashar suggested here.

I can provide samples which aren't available on HA any more.

Regarding VBR behavior:

Why worry about the VBR behavior now? In the 2011 96-kb test most coders behaved identically over the entire sample pool and over the actually tested subset of samples (FhG, CVBR, Dolby all ended up at an average of 100 kbps, TVBR was chosen as closely to 100 kbps as possible). Only the Nero encoder differed a bit, but it's not included in the 2014 test.

The point of VBR is so a codec can spend more bits where it needs to. Serge is now advocating that the working of VBR is "filtered" out of the test?

I understood something different, namely that the actually tested samples shall be coded with the target bit-rate on average. Meaning: of course the codecs can still run with VBR, but their average behavior shall be adjusted to the set of test samples. But like I said, in the previous test it didn't matter except for the Nero encoder (meaning no re-adjustment was really necessary), so I recommend focusing on the sample selection for now, and taking a look at the average bit-rates only once the test set is completed.

One question about a special scenario, though. Let's assume that all codecs' VBR settings were calibrated on a very large sample set S, incl. a handful of samples X where a codec greatly increases the bit-rate to a factor of, say, 1.5 times the target (sample-set-average) bit-rate. Let's also assume that the number of samples X is so small compared to S that their removal from S doesn't affect the calculation of the average VBR bit-rate over S. Now let's also assume that one or more of samples X are included in the listening test set L, which - since L is smaller than S - shall lead to the case that the average VBR rate over L (incl. X) be quite a bit larger than the average VBR rate over S (also incl. X).

In such a scenario, I conclude that there is no penalty for a codec which greatly boosts the bit-rate on some item (e.g. up to a factor of 1.5), when compared to another codec which does not boost the VBR rate on the same item (e.g. stays close to 1.0), even if both codecs end up providing (roughly) the same audio quality. Again, the a-priori bit-rate calibration is assumed to not have revealed this behavior, since set X is much smaller than S.

Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?

Chris
If I don't reply to your reply, it means I agree with you.

New Public Multiformat Listening Test (Jan 2014)

Reply #327
Concerning sample selection.

Quote
It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this.

I outlined such a method that is not very complicated earlier in this thread. Problem of the current method is that it biases to music that is more popular with our audience. The upshot of that flaw is that it makes the results more, not less, meaningful for our readers, although you're free to point out the results are biased more towards popular music rather than unpopular one when discussing the result.


So we have the listening test.

[blockquote](1) This test is not aimed to compare efficiency of vbr encoding, it compares encoders at specific settings.

(2) What are these settings and why they are important to us. Because these settings provide almost equal bitrates with aggregated music material of lvqcl, Gecko, kamedo2, Kohlrabi and two previous tests. We also believe that this music material is pretty common for our forumers and that's why these settings are interesting to compare (s4).
[/blockquote]
If both statements are correct then the only possible method  of choosing sound samples is randomly picking them from those aggregated sound material. As all the material is at hand there is no problem to perform almost perfect sampling (random and sufficient).

After the test set will be properly selected the bitrates of all codecs will be inevitably equal (magic, isn't it?). And this equality will indicate that the test set is representative. That is correct design of the test.

If we want to test encoders with some predefined set of samples then it is another (different) listening test. In this case the use of any external music material is irrelevant in the context of the test. And there are only two options of setting encoders – providing equal bitrates for the test set (s0) or using natural (integer) ones (s1), depending on what we want to compare in the test – efficiency or popular settings.

We need to decide what kind of results we want to see.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #328
Well,

Robert, lvqcl and Garf are disagree with you.

So my answer to you is: No.

New Public Multiformat Listening Test (Jan 2014)

Reply #329
So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient. You can't choose which hypothesis is true until you get the scores of perceptual quality.

The VBR's beauty is that it assigns more bits on where it needs while take away bits from where it needs less. If the vbr algorithm is less efficient and put more bits on random places fruitlessly, it will surely take away bits from where it needs. That's why many immature and poorly-tuned vbr encoders exist. The concern is very real. I'm currently putting a lot of effort on improving FFmpeg's native AAC encoder, both CBR and VBR. The CBR is constantly getting better, but the VBR, it distributes more bits on random noise, and distributes less bits on tonal samples(they need more bits). It is a disaster, and many tonal samples collapses, as well as degrades. The inefficiency is, of course, detectable by the traditional HA method.


Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.


New Public Multiformat Listening Test (Jan 2014)

Reply #330
Hi all. I have something like an offer and also a question that should be thoroughly discussed (in my opinion).

You know that many recordings today uses full digital scale, up to 0 dBFS. And after the lossy encoding in almost all cases we get samples with level higher than 0 dBFS. First of all, if encoder itself uses fixed point processing this samples will be lost even during encoding process. But AFAIK encoders which take part in our test allow floating point, so encoding will be ok with any level.
But let's see what is happening then, especially in "real world". People just encode their recordings into some format, and then for example upload them to their portable players. Again, AFAIK almost all portable equipment uses decoders with fixed point processing. So if we have samples with levels much higher than 0 dBFS (1.00000), we will get a deep clipping on such equipment. And this clipping really can be audible (for example one time I successfully passed an ABX test comparing clipped mp3 with peak about 1.36, and the same MP3 with clipping removed).

So my question is: maybe we should consider clipping as a part of quality loses and as flaw of encoding algorithm. If so we must not take any action to prevent clipping (like attenuation before or after encoding).

I think it really makes sense, because it makes our test closer to real-life conditions.

I would like to know what all of you think about it.

add:
On other hand we can consider clipping only as a problem of decoder, not encoder's. In this case we take into account only irretrievable loses.
🇺🇦 Glory to Ukraine!

New Public Multiformat Listening Test (Jan 2014)

Reply #331
The inefficiency is, of course, detectable by the traditional HA method.

I'm sure that if the test would be properly designed it can detect inefficiency even better.

Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.

Of course we can get some idea about music universe using limited amount of music. The problem with current listening test design is (using your analogy) that pharma-company study preferences of woman on global scale, sampling them from different countries, and then tests its products on North Koreans women.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #332
...AFAIK almost all portable equipment uses decoders with fixrecored point processing. So if we have samples with levels much higher than 0 dBFS (1.00000), we will get a deep clipping on such equipment. And this clipping really can be audible...

This is a real life problem which should be covered by the RG mechanism. We shouldn't blame an encoder IMO if lossy encoding peaks should be higher than with other encoders. We should rather use RG for the test samples whenever necesary.
lame3995o -Q1.7 --lowpass 17

New Public Multiformat Listening Test (Jan 2014)

Reply #333
Quote
This is a real life problem which should be covered by the RG mechanism

ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. This is if you mean RG mechanism in some hardware players. If you mean foobar2000's RG - of course, it easily helps to prevent clipping. But anyway this requires additional processing, not just decoding.

🇺🇦 Glory to Ukraine!

New Public Multiformat Listening Test (Jan 2014)

Reply #334
Regarding VBR behavior:

Why worry about the VBR behavior now? In the 2011 96-kb test most coders behaved identically over the entire sample pool and over the actually tested subset of samples (FhG, CVBR, Dolby all ended up at an average of 100 kbps, TVBR was chosen as closely to 100 kbps as possible). Only the Nero encoder differed a bit, but it's not included in the 2014 test.

I see big difference between two cases: encoder settings have been chosen correctly and encoder settings turned out to be correct (coincided with correct ones). Procedure should be clearly defined and has clear meaning.

One question about a special scenario, though. Let's assume that all codecs' VBR settings were calibrated on a very large sample set S, incl. a handful of samples X where a codec greatly increases the bit-rate to a factor of, say, 1.5 times the target (sample-set-average) bit-rate. Let's also assume that the number of samples X is so small compared to S that their removal from S doesn't affect the calculation of the average VBR bit-rate over S. Now let's also assume that one or more of samples X are included in the listening test set L, which - since L is smaller than S - shall lead to the case that the average VBR rate over L (incl. X) be quite a bit larger than the average VBR rate over S (also incl. X).

In such a scenario, I conclude that there is no penalty for a codec which greatly boosts the bit-rate on some item (e.g. up to a factor of 1.5), when compared to another codec which does not boost the VBR rate on the same item (e.g. stays close to 1.0), even if both codecs end up providing (roughly) the same audio quality. Again, the a-priori bit-rate calibration is assumed to not have revealed this behavior, since set X is much smaller than S.

Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?


Seems I'm the only one who thinks there should be a penalty. Such penalty is called bias correction after the test. But it's much better to avoid such situation at all. If L is properly sampled from S there will be no such problem - average vbr rates will be equal (with some error which can be controlled by varying the size of L). At the moment there is no relation between S and L (L doesn't belong to population S), as a result vbr rates of different encoders with set L have random variance. In HA@96 listening test max. difference between rates is 8% (93-101). This is fundamentally incorrect design of the test. You can't set codecs with one music material but test them with another.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #335
The inefficiency is, of course, detectable by the traditional HA method.

I'm sure that if the test would be properly designed it can detect inefficiency even better.

I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.

Of course we can get some idea about music universe using limited amount of music. The problem with current listening test design is (using your analogy) that pharma-company study preferences of woman on global scale, sampling them from different countries, and then tests its products on North Koreans women.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well. By the way, if you believe the effect size is big, it's your job to demonstrate. Otherwise you can go to a hospital, question applicability to  nerds, queers, immigrants, amateur golfers, and you can stop any therapy there.

New Public Multiformat Listening Test (Jan 2014)

Reply #336
I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

I mentioned that this scenario is unrealistic (but perfectly valid). Fine-tuning with the test set is the next option.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well.

But not vice versa when study of North Korean women is applied to the rest of the globe as it is done in the current design of the listening test.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #337
Finally the puzzle of equal bitrates for VBR encodes is solved. Here is how.

A set of test samples L for a listening test can be obtained in two ways – (1) sampled from some population of music S or (2) chosen independently according to some criteria, problem samples for example.

In case (1) if the test set is sampled properly (sufficiently and randomly) target bitrates of encoders are equal with both S and L, so it doesn't matter with what sound material they are calculated – whole population S or selected samples L. If there is possibility to find such settings that the target bitrates are equal then results of such listening test show comparison of encoders' VBR efficiency. If bitrates can't be set equal (due to discontinuity of q-parameters) then results of such listening test show comparison of encoders at specific settings. Such specific settings can be only of one kind – natural (integer) ones (as bitrates can't be set equal with S and consequently with L, so all other settings are just random without any meaning).

In case (2) the test set L is already predefined and the population of music S which it is sampled from is undefined (a population of problem samples would be best guess). Consequently there is no possibility to calculate bitrates (and corresponding settings) with S. Any attempt to do this with some other music population leads to random variance of bitrates with the test set L, because the latter is not representative to that music population chosen out of the blue. That random variance in turn leads to variance of results making them less accurate. Thus in case (2) target bitrates can be calculated only with the test set L (no other sound material is present in the context of such listening test). As in the first case there are two choices – to make bitrates equal for the test set L (results then show comparison of VBR encoders efficiency) or to use natural (integer) values (results then show comparison of popular settings). All other settings are just random without any meaning.

In case (1) the results of the listening test are biased towards population of music S which was chosen for the test (some genre or a mix of them). In case (2) the results are biased towards particular test set L.

Case (1) needs much more sound samples in the test set because results are considered to be generalized to the whole population S. All of listening tests that were ever conducted belong to case (2) - the test set was chosen according to some criteria (problem samples, usual samples ...) but never sampled from some population as in (1). And the reason is quite obvious - with more samples (that case (1) needs) the test become labor-intensive but results are hardly better than with problem samples.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #338
I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

I mentioned that this scenario is unrealistic (but perfectly valid). Fine-tuning with the test set is the next option.

Fine-tuning the test set, fine-tuning a large set of samples, both produce roughly the same result, anyway. And I like it.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well.

But not vice versa when study of North Korean women is applied to the rest of the globe as it is done in the current design of the listening test.

The testers and tested samples came from all over the world. Norway, France, Germany, Argentina, Japan.... Not a single person came from North Korea, though.

New Public Multiformat Listening Test (Jan 2014)

Reply #339
Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?

Chris,

Yes, we all agree about it. Serge Smirnoff argues with himself.




Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?


Seems I'm the only one who thinks there should be a penalty.

Finally You have realized it.  Alleluia!





I will ask  moderators to split the thread. One thing is this particular test, another one is an endless disagreement between Hydrogenaudio and Sound Express.
We shouldn't suffer from this here.

New Public Multiformat Listening Test (Jan 2014)

Reply #340
I can provide samples which aren't available on HA any more.

Yes, please. Many samples have gone offline.

Regarding sample selection:

Can we assume the sample pool of the 2011 test is included? In any case, I (still) I recommend the test set I constructed in 2010, which Igor already kindly mentioned here:
http://www.hydrogenaudio.org/forums/index....st&p=695576
IIRC only BerlinDrug was actually chosen from that list in the 2011 test. One of the samples, CantWait, is stereo-miked a-cappella male singing, which nicely fits the category TheBashar suggested here.

Agree, this set of samples is fantastic. It will be great to see at least a good part of them (if not all of them)

New Public Multiformat Listening Test (Jan 2014)

Reply #341
ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. ...

You make me worry. Do you know if the Rockbox RG mechanism does it right? Is it safe to assume that any player that provides a RG based 'prevent clipping' option does it well?
Sorry for being OT for a moment.
lame3995o -Q1.7 --lowpass 17

New Public Multiformat Listening Test (Jan 2014)

Reply #342
Rather than work off a testimonial that does not satisfy the requirements of this forum, I would like to see proof that this will be a legitimate issue with the samples in question.

New Public Multiformat Listening Test (Jan 2014)

Reply #343
ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. ...

You make me worry. Do you know if the Rockbox RG mechanism does it right? Is it safe to assume that any player that provides a RG based 'prevent clipping' option does it well?
Sorry for being OT for a moment.

As far as I can see rockbox is using int32 as it's internal sample format for DSP.

New Public Multiformat Listening Test (Jan 2014)

Reply #344
Rather than work off a testimonial that does not satisfy the requirements of this forum, I would like to see proof that this will be a legitimate issue with the samples in question.

I understand your precaution.  But something should be done to avoid a flood of 100+ lines posts like here. I don't even read that. It really slows down discussion.

Organizers (including me) are open to critics and suggestions. You can ask people who are involved into test. 

Though let's face it. Watch out who comes critics from.  Sound Express had received very negative critics for his tests . And it's not just me.

He speaks here about a "mathematical perfection", "a bitrate should be exactly the same".
To begin with, it doesn't take into account that different formats have different overheads. So there will be 2-3% of difference in overhead if samples are short enough (10 seconds).
And it was only "to begin with..."

2-3%. So what has happened to mathematical perfection?  It's not perfect anymore. Not even close.
Math is a very good tool but without a good context and interpretation it has little value.


New Public Multiformat Listening Test (Jan 2014)

Reply #346
I understand your precaution.  But...

Care to elaborate rather than use my comment as an excuse to berate Serge?

New Public Multiformat Listening Test (Jan 2014)

Reply #347
I don't get your reaction.

New Public Multiformat Listening Test (Jan 2014)

Reply #348
IgorC, if you missed something, during this discussion I analysed your/HA listening test setup. This analysis has no connection to SE tests at all. I did this because I intuitively saw the flaw but couldn't prove my suspicions. Those long posts reflect my progress in the above analysis. It was a research in real time if you want. Finally I found the flaw and disclosed it in this post. In short, calculation of target bitrates with huge aggregated music material has no sense in the context of the listening test and leads to inaccurate final scores of codecs. This became possible due to incorrect use of statistical methods (incorrect sampling). The flaw is serious and affects not only current test but also previous ones (HA@64 and HA@96 at least), not completely invalidates them but changes interpretation of results and changes generalizations. I call for serious examination of the issue as it is not late yet. If this is a scientific discussion, let's discuss arguments and figures, not personalities.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #349
Quote
Are the details available of the normalisation method (and any other pre-processing) that will be used?

Normalization of decoded files .wav is done in ABC/HR Java. http://listening-tests.hydrogenaudio.org/i.../ABC-HR_bin.zip
It was mention before that there can be a better ways to do normalization.
Steve Fore Rio has rised question of pre-normalization of a source files before encoding as well.