Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: New Public Multiformat Listening Test (Jan 2014) (Read 165417 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

New Public Multiformat Listening Test (Jan 2014)

Reply #76
greynol already included that in his list.

New Public Multiformat Listening Test (Jan 2014)

Reply #77
Then I guess that's it for my request.

I'd still love to see FhG from Winamp 5.666 be tested at 96 kbits, but it doesn't seem to make much sense to me and I don't want IgorC to feel obligated to do something that he likely thinks is a waste of time.

Let that conclude my unwelcome visit into this discussion.

New Public Multiformat Listening Test (Jan 2014)

Reply #78
I'm not sure what else there would be.  I remember vorbis being lower than nero here.

Um... MP3pro? WMA pro? Does anyone use those anymore? Hope not...

My choices:
Apple AAC @ 80 kbps
Opus 1.1 @ 80 kbps (or even less, owing to its mostly-transparent performance at those rates)
LAME 3.99.5 MP3 @ 96 kbps (could be the low anchor?)
FHG AAC or FDK AAC @ 80 kbps

Awaiting further discussion on which Fraunhofer encoder to use. I'd say whichever has a brighter future...

New Public Multiformat Listening Test (Jan 2014)

Reply #79
If the Winamp encoder in whatever the latest Winamp release was is current, that's great.

Are there relevant differences between the libfdk_aac that you sold to Google and this encoder?

Yes, Winamp 5.666 has the latest AAC encoder quality-wise, no new quality tunings which are ready for release.

I'll let you know when quality is improved. Or just ask

The Winamp/Sonnox/... encoder has a completely different code-base than fdkaac and is a bit better tuned, especially for VBR.

Chris

Chris, how are You?
Let's clear our doubts.

Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?
If the answer is yes could You please indicate on what samples because I really fail to find any audible difference.

There was only one sample (which was actually submited to You by me) "In the roof with Quasimodo"  that is coded diferent by different versions of your encoder. But there is still no audible difference for me. 

Thank You.



It's for our information. FhG encoder will be included anyway if there will be enough people who will want it.

New Public Multiformat Listening Test (Jan 2014)

Reply #80
Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?
If the answer is yes could You please indicate on what samples because I really fail to find any audible difference.

There was only one sample (which was actually submited to You by me) "In the roof with Quasimodo"  that is coded diferent by different versions of your encoder. But there is still no audible difference for me. 

If the answer is no I will no longer care to see AAC/FhG/fhgaacenc in the test. Only AAC/Apple, AAC/FhG/FDK and Opus, still @96.

Chris, where were the tuning performed? In which bit rate range, if I can say this? Thanks.

New Public Multiformat Listening Test (Jan 2014)

Reply #81
Best of breed of the modern codecs so Apple AAC to represent AAC, Latest Vorbis reference, Opus 1.1, and latest LAME at testing.

All at ~96kbps, though 80kbps might be easier to test as most of these codecs (sans mp3) are pushing into transparent territory at 96kbps.

For reference it would be nice to have a LAME encode at ~128kbps to provide a reference to drive home the quality advantage of the modern codecs vs mp3 even when mp3 has a significant bitrate advantage, if any.

New Public Multiformat Listening Test (Jan 2014)

Reply #82
SE's test results didn't agree with yours so they must be wrong,


Yes. Either there's flaw in SE's test, or in ours. It has been pointed out what the issue with the SE test is (limited, non-representative, biased sample selection). It's up to you to point out what could be wrong with IgorCs previous test if you believe the results are invalid. If they aren't invalid, the other test has to be wrong.

Quote
The next time I see data from two different tests that aren't in agreement, I'll just ask you to point me in the right direction.  The scientific method of repeating an experiment to confirm the results be damned.


The idea is to set up the test so it can be repeated and subsequent tests will be in agreement. Do you have any arguments why this would not be the case? It would be nice to verify it but there's already enough candidate codecs and already enough lack of people with time to run the tests, so I see no argument to do it without good reason.

If you believe you saw a flaw in the test setup that invalidates the result and could change the outcome on a repeat, speak up now so we can see if it can be fixed. If not, then what's your argument in the first place?

The entire point of the statistical analysis instead of just reporting mean scores is to ensure that a subsequent test gives the same result even if there is random variance in listener ratings & samples.

New Public Multiformat Listening Test (Jan 2014)

Reply #83
Sorry for asking rather foolish question.
I have almost zero knowledge in this area, but considering listening test as a kind of sampling survey, what is considered as "population" here, in order to compute the reliability / validity of result ?

Considering both of human(subject) and audio sample(object) as parameters,
1. "population" = all individuals in the world
2. "population" = all songs and non-songs in the world (kind of ridiculous, looks impossible)
or something like that?

New Public Multiformat Listening Test (Jan 2014)

Reply #84
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.

For (1), yes, although our sample selection is obviously biased to the HA audience. So probably the population is the generic audio enthusiast nerd with a PC and above average listening equipment, and in many cases, some training wrt typical encoder artifacts.

So the question is really what the best codec is for the "discerning" listener to encode his music.

New Public Multiformat Listening Test (Jan 2014)

Reply #85
Thanks.
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.

Thanks.
Taking speech or other non-music into consideration makes population completely indefinable, so it makes sense to me (Even if music only, I can't imagine how large number the class will become ...)

New Public Multiformat Listening Test (Jan 2014)

Reply #86
Pardon, discussion is temporarily stopped.
It probably will be better for everyone  if I will take it to a different place.  It will take us some time.

Please stay in touch. Here is my mail
igoruso at gmail dot com

New Public Multiformat Listening Test (Jan 2014)

Reply #87
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.


I appreciated the two spoken samples (3 & 15) in the HA2011 test.  While it's much less likely I'd use this test's bitrate (80,96,128) for speech, I would very much like to have at least one sample be single voice chanting / acapella singing like sample 4 in that test.

New Public Multiformat Listening Test (Jan 2014)

Reply #88
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.


I appreciated the two spoken samples (3 & 15) in the HA2011 test.  While it's much less likely I'd use this test's bitrate (80,96,128) for speech, I would very much like to have at least one sample be single voice chanting / acapella singing like sample 4 in that test.


When I say speech I mean just that, i.e. not singing/chanting/acapella. Like what you have in a radio show in-between the music. This can be encoded extremely effectively at much lower bitrates than music, so it's a bit of a different area, codec-wise. (Things like Opus and USAC switch to a different mode to handle it)

I don't actually know if pure speech codecs handle singing.

New Public Multiformat Listening Test (Jan 2014)

Reply #89
You're dismissing the best available listening evidence

You're dismissing the SE result which doesn't exactly agree.  I guess I'll have to take you at your word as to why that is.

The whole point of this is that FhG could beat Apple in a re-match, especially when it tied Apple in a perfectly valid test, personal attacks against me aside.

I would like to see such a re-match.

Let me quote the last AAC public listening test. http://listening-tests.hydrogenaudio.org/i...-a/results.html
Code: [Select]
        Nero      CVBR      TVBR       FhG        CT  low_anchor
       3.698     4.391     4.342     4.253     4.039     1.545

Here, CVBR and TVBR have slightly more average score, and unadjusted p-value is 0.002 and 0.059.  So it's not totally unthinkable for FhG to beat Apple in a re-match, although quite unlikely. But even in that case, the FhG beating Apple in a significant margin is unlikely; the difference is, if it exists, less than 0.100. The difference is tiny. Do you really care?
(1)Is there a statistically significant difference? (2)Is that a big difference?  These questions are not the same, and typically, (2) is more important.

I'm deeply skeptical about the SE result, because sometimes mp2 wins and Opus is statistically tied to Lame and this; http://slashdot.org/story/09/03/11/153205/...s-of-mp3-format


New Public Multiformat Listening Test (Jan 2014)

Reply #90
SE's test results didn't agree with yours so they must be wrong,


Yes. Either there's flaw in SE's test, or in ours. It has been pointed out what the issue with the SE test is (limited, non-representative, biased sample selection). It's up to you to point out what could be wrong with IgorCs previous test if you believe the results are invalid. If they aren't invalid, the other test has to be wrong.


Those HA and SE @96 tests have different versions of participated codecs, slightly different settings, different sample sets, different way of presenting stimuli to testers and obviously different type of participants. How can you expect exactly similar results from both tests? IMHO they correlate well having all this in mind.
keeping audio clear together - soundexpert.org

New Public Multiformat Listening Test (Jan 2014)

Reply #91
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.
Not even the confidence intervals say that Apple AAC is better than FhG, and these give only a statement on this particular test with the specific samples used and listeners participating.
Sure if we assume (for good reasons) that the test was conducted well we would not expect that a codec like Nero, who came out much worse in that test, would win in a new test, but nobody considered to test Nero here as far as I can see.
lame3995o -Q1.7 --lowpass 17

New Public Multiformat Listening Test (Jan 2014)

Reply #92
Those HA and SE @96 tests have different versions of participated codecs, slightly different settings,


This is relevant, but IIRC some results are incompatible even with the same or nearly the same versions.

Quote
different sample sets, different way of presenting stimuli to testers and obviously different type of participants. How can you expect exactly similar results from both tests? IMHO they correlate well having all this in mind.


These should not be relevant. The type of listeners is already a bias as stated in previous posts in this thread, and one which neither of us can get around. If the selection of samples has an influence, that means a bad bias in their selection that invalidates the test (and it's the exact problem I have with your test!).

The way the stimuly are presented shouldn't affect the result. If it does, that's a flaw again. But you're not amplifying artifacts any more, right?

New Public Multiformat Listening Test (Jan 2014)

Reply #93
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.


The whole point we've been trying to explain is that this should be impossible if both tests are conducted correctly. We know our selection of listeners is biased and that could affect things. However, I wouldn't expect that to make a difference between two AAC codecs, but more that a test with generic listeners will output on average higher ratings due to more people not being able to discern differences. And it remains to be seen if the audience on SE is wildly different from the one here.

Let me state it again: if you repeat the test, you should get a compatible result. If someone else runs a similar test, they should get a compatible result. That's the whole point of the test setup. If you can run the same test and get another result, what's the point of running a test in the first place?

Quote
Not even the confidence intervals say that Apple AAC is better than FhG


This is downright false: FhG is worse than CVBR (p=0.005)


New Public Multiformat Listening Test (Jan 2014)

Reply #94
these give only a statement on this particular test with the specific samples used and listeners participating.


No, they don't. They would if they were treated as the entire population, but they're analyzed as a sample of the population. What you say is both wrong and irrelevant. These are really basic things.

http://en.wikipedia.org/wiki/Statistical_sampling

To illustrate the difference: we KNOW that CBR>TVBR for those specific samples and those specific listeners, because that's exactly what was tested and we can see the result. But the variance of the result indicates that this result may possibly not hold for all music samples and every person-with-a-pc-and-interested-in-audio, so this wasn't concluded from the test. On the other hand, CVBR was concluded to be better than FhG because the result indicates that if you rerun the test 200 times, with a similarly representative sample selection and a similar, but not necessarily identical, set of listeners, that FhG will only win once and lose 199 times.

New Public Multiformat Listening Test (Jan 2014)

Reply #95
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.

All You do is looking to an average score and draw conclusion based on that.

You could download the results http://listening-tests.hydrogenaudio.org/i...ous/results.zip
You will get quite enough of people who have rated Apple significantly higher than  FhG. And less results who have prefered FhG but not significantly

As far as I can see only Kamedo2 took the job  and had a closer look.


New Public Multiformat Listening Test (Jan 2014)

Reply #97
Visualization of the last (2011) AAC 96kbps public listening test results.
http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/

close up of the interesting section:

unlike the previous post, one plot denotes one music track.


Online visualization tool: http://zak.s206.xrea.com/bitratetest/graphmaker4.htm
Code: [Select]
Nero	CVBR	TVBR	FhG	CT	low_anchor
3.64 4.22 4.69 4.23 3.71 1.60
4.05 4.47 4.13 4.52 3.46 1.41
3.30 3.51 3.24 3.34 3.20 1.60
3.57 4.52 4.55 4.73 4.41 2.42
4.04 4.53 4.54 3.97 4.43 1.33
4.19 4.58 4.59 4.62 4.65 1.52
3.65 4.10 4.32 4.53 3.85 1.47
3.83 4.62 4.41 4.49 4.18 1.67
3.62 4.27 4.26 4.72 3.91 1.60
3.66 4.30 4.34 4.24 4.26 1.72
3.82 4.28 4.21 3.96 4.13 1.58
3.48 4.67 4.37 4.35 3.81 1.48
4.13 4.54 4.64 4.08 4.24 1.50
3.42 4.32 4.40 4.29 4.10 1.34
3.60 4.54 4.72 4.18 3.69 1.51
3.92 4.70 4.52 3.98 4.26 1.44
3.85 4.41 4.55 4.49 4.57 1.32
3.67 4.79 4.37 5.00 4.83 1.42
3.08 4.26 3.78 4.11 3.96 1.25
3.34 4.72 4.65 3.43 3.88 1.27
%samples 01 - Reunion Blues
%samples 02 - Castanets
%samples 03 - Berlin Drug
%samples 04 - Enola Gay
%samples 05 - Mahler
%samples 06 - Toms Diner
%samples 07 - I want to break free
%samples 08 - Skinny2a
%samples 09 - Fugue Premikres notes
%samples 10 - Jerkin Back n Forth
%samples 11 - Blackwater
%samples 12 - Dogies
%samples 13 - Convulsion
%samples 14 - Trumpet
%samples 15 - A train
%samples 16 - Enchantment
%samples 17 - Experiencia
%samples 18 - Male speech
%samples 19 - Smashing Pumpkins - Earphoria
%samples 20 - on the roof with Quasimodo

New Public Multiformat Listening Test (Jan 2014)

Reply #98
...All You do is looking to an average score and draw conclusion based on that.....

I did in my argumentation here, but personally I am much more interested in the quality of the particular samples.
Look at Kamedo's graphs for Enola Gay (FhG shines here) and Mahler (FhG has weaknesses here).
In another test with a sample selection similar to the samples used here, but not identical, a small variation in samples can have a relevant change in test outcome.
As for the listeners it's similar, especially as far as the percentage of very experienced listeners is concerned (the less experienced listeners smoothen differences out as they often judge sample issues as imperceptible). What we can also learn from Kamedo's graphs above is that the experienced users are differently sensitive towards the various artifacts. If you have a look at that it's clear that this is a most relevant factor for variation in test results.
IMO the (scientifically correct) statistics over all the samples (averages and confidence intervals) give a feeling of safe judgement about encoder quality which is misleading. Looking at the outcome of all the listeners for the various samples gives an impression of this.

What I try to say is: these listening tests are meaningful, but we shouldn't take them as the words of the bible (as soon as we don't look at the detailed results of all the listeners for all the samples). In case two encoders turn out to have a very similar outcome we should take them both as participants for a new tests, especially as there seems to be serious interest in both of them.
lame3995o -Q1.7 --lowpass 17

New Public Multiformat Listening Test (Jan 2014)

Reply #99
In case two encoders turn out to have a very similar outcome we should take them both as participants for a new tests, especially as there seems to be serious interest in both of them.

Isn't it a contradiction? If two codecs were found to be so close it won't change anything because each time there will be people who will want to re-test it arguing exactly the same, "small difference"

as I can see You have some expertise in listening tests, so You can define goals and organize a new test. A different one that  will figure out your doubts.

I consider to take a preparation to another place, it won't be here on HA.  So it's all yours.