Multiformat Listening Test @ 64 kbps

Topic: Multiformat Listening Test @ 64 kbps - FINISHED (Read 132797 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Multiformat Listening Test @ 64 kbps - FINISHED

2007-08-16 00:00:17

The much awaited results of the Public, Multiformat Listening Test @ 64 kbps are ready - partially. So far, I only uploaded an overall plot along with a zoomed version. The details will be available tomorrow. You can also download the encryption key on the results page that is located here:

http://www.listening-tests.info/mf-64-1/results.htm
http://www.listening-tests.info/mf-64-1/resultsz.png

Nero and WMA Professional 10 are tied and WMA Professional 10 is tied to Vorbis. Vorbis however performed worse than Nero. Of course, High Anchor is best and Low Anchor loses.

This one goes to the experts: How would you rank codecs in such a situation, where A=B and B=C, but C<A?

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #1 – 2007-08-16 00:06:13

Wow, thanks a lot for posting so fast these results.
WMAPro is competitive against HE-AAC at 64 kbps... great result for this new format. What were Microsoft listening tests on this subject (I forgot it)?

EDIT: correct link is http://www.listening-tests.info/mf-64-1/results.htm

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #2 – 2007-08-16 00:27:31

Compare to the last 48kbit/s listening test, 64kbits will only bring slightly better results.

I-tunes at 96kbits ist transparent for most users on both tests.

WMA is not interesting for me.

Nero-AAC HE score was 3,64 points at 48kbits, now we can see 3,74 points at 64kbits.
This is not very impressive for me. I thought Nero will performe better at 64kbit/s.
Of course, it is still usable for e.g. portable devices or good quality webradio.

Vorbis is also better at 64kbits/ (3,16 to 3,32 points )

So i can go with itunes at 96kbit for high quality use (maybe nero performing better at this bitrate?), and 48-64kbits for medium quality use.

maybe 80kbits/s will hit a 4.xx score?

i think the next test should be a 96-112kbit multi-format test, also including Lame.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #3 – 2007-08-16 00:28:38

Very interesting, Sebastian. Congratulations, and thank-you very much!

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #4 – 2007-08-16 00:32:17

Quote from: guruboolez on 2007-08-16 00:06:13

EDIT: correct link is http://www.listening-tests.info/mf-64-1/results.htm

They're actually both correct, but now I agree that the first format which I posted doesn't make sense anymore since the listening tests have their own page. That htaccess redirection was good for the time where the tests were in subfolders of the MaresWEB site.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #5 – 2007-08-16 00:34:51

Nice!

I'm a little surprised that Vorbis is on par with the others. During the test I had a feeling it would be worse. Now I need to check my own results.

A QUESTION:

Pardon my ignorance, is there any automated way to combine my own decrypted txt results into one table?
(in order to feed it to the ff123's ANOVA calculator)

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #6 – 2007-08-16 00:45:24

All you need is Chunky! http://www.phong.org/chunky/

And if you need a guide:

http://www.rarewares.org/rja/ListeningTest.pdf

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #7 – 2007-08-16 00:48:39

Quote from: Sebastian Mares on 2007-08-16 00:45:24

All you need is Chunky! http://www.phong.org/chunky/

And if you need a guide:

http://www.rarewares.org/rja/ListeningTest.pdf

Thanks!

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #8 – 2007-08-16 01:21:41

My personal results:

Code: [Select]

WMAPro	high	Vorbis	low	HEAAC
2.3	3.7	2.0	1.0	3.2
2.0	3.0	1.5	1.0	2.5
2.0	2.5	2.5	1.0	1.7
2.8	4.3	3.2	1.5	3.8
2.5	4.5	2.8	1.0	1.8
2.7	2.5	2.0	1.0	1.5
1.8	5.0	1.5	1.0	3.0
1.8	3.5	3.0	1.0	2.2
2.0	3.5	3.0	1.0	2.3
3.5	3.0	3.0	1.0	2.0
2.0	3.0	2.0	1.0	1.7
1.5	2.3	1.3	1.0	1.5
4.0	3.0	3.5	1.0	4.0
3.5	3.0	2.5	1.0	2.8
2.1	1.5	3.0	1.0	2.0
3.0	4.5	2.0	1.5	3.0
1.2	3.5	2.0	1.0	1.5
3.5	3.0	2.0	1.0	2.0

FRIEDMAN version 1.24 (Jan 17, 2002) [url=http://ff123.net/]http://ff123.net/[/url]
Tukey HSD analysis

Number of listeners: 18
Critical significance:  0.05
Tukey's HSD:   0.574

Means:

high     WMAPro   Vorbis   HEAAC    low      
  3.29     2.46     2.38     2.36     1.06   

-------------------------- Difference Matrix --------------------------

         WMAPro   Vorbis   HEAAC    low      
high       0.839*   0.917*   0.933*   2.239* 
WMAPro              0.078    0.094    1.400* 
Vorbis                       0.017    1.322* 
HEAAC                                 1.306* 
-----------------------------------------------------------------------

high is better than WMAPro, Vorbis, HEAAC, low
WMAPro is better than low
Vorbis is better than low
HEAAC is better than low

For the first time in listening tests my personal results are more evangelical than the collective one... no winner nor loser for my ears.

A direct comparison between my average scores and the collective one:

Code: [Select]

          collective   guruboolez  (diff)
low          1.55         1.06     -0.49
HE-AAC       3.74         2.36     -1.38
VORBIS       3.32         2.38     -0.94
WMAPRO       3.52         2.46     -1.06
high         4.59         3.29     -1.30
            ______       ______    ______
             3.34         2.31     -1.03

Compared to the whole group of testers my global evaluation for all competitors is clearly more harsh (-1.03 points on average), especially with the high anchor (-1.3 points) and HE-AAC (biggest deviation with -1.38 points). It confirms the lake of sympathy I feel for the SBR trick (there's several complains in my log files against the "SBR texture/noise"). I'm more disappointed by the high anchor which doesn't sound great to my ears. I expected more from LC-AAC two years after my previous test at 96 kbps.

WMAPro is a weird case. I'm not familar at all with this format (I never tested since its last metamorphosis in WMP11) and the new kind of distortion it produces. I disliked it on the beginning but I was much more enthousiastic after some times. Indeed, the second half of tested samples was better marked than the first one while it was at best the same for all other competitors. In other words my notation was more harsh during the second half but WMAPro's one has drastically grown in this severe period
WMAPro artefacts were close to HE-AAC ones; it has a stronger smearing (cf kraftwerk, eig...) and share the same kind of SBRish issue (noise packets altering tonal sound, cymbals...) but often with less annoyance. It also has a kind of "noise sharpening" (for people knowing this foobar2000's plug-in) which tends to add some energy to high frequencies. Sound is often a bit brighter than reference to my ears. It's unexpected, not necessary a good thing but I find it rather pleasant in some situations, and certainly more enjoying than stereo reduction, pre-echo, lowpass or noise filtering. I simply fear that this kind of enhancement would quickly appear as tiresome (like noise sharpening IMO). That's why I wonder if I would still consider WMAPro so kindly with additionnal experience with this encoder and its own texture...

I was never fond of Vorbis at <80 kbps so I'm not surprised to see it inferior to HE-AAC with a confidence >95%. It often sound coarse, fat, with serious stereo issues (and a bit lowpassed too, but a smaller one would maybe increase the ringing...). I'm simply disappointed that for my taste no other format could currently outdistance this format.

As a consequence I'm disappointed. I maybe expected a miracle too soon after reading other people's comments. I will see in a future test if 80 or 96 kbps are more enjoyable for my taste.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #9 – 2007-08-16 01:43:25

Quote from: -Nepomuk- on 2007-08-16 00:27:31

Compare to the last 48kbit/s listening test, 64kbits will only bring slightly better results.

I-tunes at 96kbits ist transparent for most users on both tests.

WMA is not interesting for me.

Nero-AAC HE score was 3,64 points at 48kbits, now we can see 3,74 points at 64kbits.
This is not very impressive for me. I thought Nero will performe better at 64kbit/s.
Of course, it is still usable for e.g. portable devices or good quality webradio.

Vorbis is also better at 64kbits/ (3,16 to 3,32 points )

So i can go with itunes at 96kbit for high quality use (maybe nero performing better at this bitrate?), and 48-64kbits for medium quality use.

maybe 80kbits/s will hit a 4.xx score?

i think the next test should be a 96-112kbit multi-format test, also including Lame.

It's technically not valid to compare results between tests, although the ratings differences do seem to make some sense.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #10 – 2007-08-16 01:52:44

Quote from: ff123 on 2007-08-16 01:43:25

It's technically not valid to compare results between tests, although the ratings differences do seem to make some sense.

I think it's not completely pointless to note that both high and low anchor (which haven't change in the meantime - iTunes's version excepted) are now slightly worse than previously (samples are harder and/or listeners a bit more sensitive on average). A direct comparison between 48 kbps and 64 kbps performance should take this difference into account. It increases a bit the difference between 48 and 64 kbps encodings.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #11 – 2007-08-16 01:58:48

Quote from: Sebastian Mares on 2007-08-16 00:00:17

How would you rank codecs in such a situation, where A=B and B=C, but C<A?

not an expert, but at leas mathematically if A=B and B=C, A=C.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #12 – 2007-08-16 02:00:32

Very interesting. After all my results are not so different from average except my ratings are spanned over a wider range.

Here are my ratings:

Code: [Select]

% 2.78    4.89    2.78    2.03    3.89
WMApro    high    Vorbis    low    Nero

Code: [Select]

FRIEDMAN version 1.24 (Jan 17, 2002) [url=http://ff123.net/]http://ff123.net/[/url]
Tukey HSD analysis

Number of listeners: 18
Critical significance:  0.05
Tukey's HSD:   0.804

Means:

high     Nero     Vorbis   WMApro   low      
  4.89     3.89     2.78     2.78     2.03   

-------------------------- Difference Matrix --------------------------

         Nero     Vorbis   WMApro   low      
high       1.000*   2.111*   2.111*   2.861* 
Nero                1.111*   1.111*   1.861* 
Vorbis                       0.000    0.750  
WMApro                                0.750  
-----------------------------------------------------------------------

high is better than Nero, Vorbis, WMApro, low
Nero is better than Vorbis, WMApro, low

Kudos to Nero! A clear winner according to me. Probably I must like SBR sort of trickery. Ranked it "annoying" only twice.
(And I guess Nero needs some work on the classical orchestra sample "macabre")

WMA pro is disappointing. I'm not impressed. All narrow stereo problems turned out to be WMA.

Vorbis is not worse than WMA but it sounds to me that it didn't really improve very much (at this bitrate) for the last couple of years.

Both WMA and Vorbis tend to distort lower frequencies, which is very easy for me to notice on natural acoustic instruments (guitars, violin, trumpet, also voice). Too distorted sometimes, even worse than low anchor.

(I am not so sensitive to high frequency artifacts. At least typically I don't find it annoying.)

High anchor is very good. Almost transparent. However, I didn't really concentrate very much on the high anchor. Otherwise I could have given it a few more "4"s. But very impressive anyways.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #13 – 2007-08-16 02:58:22

It seems that Itunes at 96 VBR has outscored Itunes 128 CBR from the previous multi-format test.

That's a substantial improvement unless the difficulty of the samples is not comparable.

Guru's results make me think that prolonged exposure to various artifacts might cause scores to drop over time.

Thanks to all organizers and participants.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #14 – 2007-08-16 05:42:15

Quote from: Sebastian Mares on 2007-08-16 00:00:17

This one goes to the experts:

How would you rank codecs in such a situation, where A=B and B=C, but C<A?

I think you have to just stick with your description and refer to the graph. Otherwise the explanation becomes unwieldy. A=B and B=C because if you repeated the test, there's a fair chance (more than 1 in 20) that A would score higher than B, or that C would score higher than B. But we say A>C because there's less than a 1 in 20 chance that a repeat test would show the opposite.

BTW, these results do seem to contradict the NSTL results, but they can actually both be consistent because neither yielded a clear winner between nero he-aac and wma pro 10.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #15 – 2007-08-16 07:39:38

Guru, my taste mirrors yours on Vorbis...anything below 80 kbps and the codec is displeasing with the artifacts. At 80 kbps, without a reference, my tin ears (a place where our similarities vanish) simply couldn't be happier. *This* is the reason that I request that we stick with the original plan and do an 80 kbps multiformat test next.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #16 – 2007-08-16 09:17:56

Little Question: How do I use the key to see my results?

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #17 – 2007-08-16 09:34:11

Quote from: Slacker on 2007-08-16 09:17:56

Little Question: How do I use the key to see my results?

Open java abc/hr and go to menu Tools/Process result files.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #18 – 2007-08-16 09:46:48

I used the key and decrypted results through java abc/hr menu Tools/Process and got 18 text files. Some resulting text files don't include all 5 ratings in text file (I rated all 5 tracks s of all 18 samples). Is this some kind of bug?

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #19 – 2007-08-16 10:07:50

Wow, much more results than what I expected!
Thank you Mares for organizing the test!
Thanks to all participants for doing the test!

Quote from: Alexxander on 2007-08-16 09:46:48

I used the key and decrypted results through java abc/hr menu Tools/Process and got 18 text files. Some resulting text files don't include all 5 ratings in text file (I rated all 5 tracks s of all 18 samples). Is this some kind of bug?

I also have suspicion that java abc/hr has some bugs in processing encrypted results. Just never had time to check it.

Quote from: kennedyb4 on 2007-08-16 02:58:22

It seems that Itunes at 96 VBR has outscored Itunes 128 CBR from the previous multi-format test.
That's a substantial improvement unless the difficulty of the samples is not comparable.

Different samples, different participants. Just look at how personal results posted here differ from the average.
Results from different listening tests are just not easily comparable.

Quote from: kwanbis on 2007-08-16 01:58:48

Quote from: Sebastian Mares on 2007-08-16 00:00:17
How would you rank codecs in such a situation, where A=B and B=C, but C<A?
not an expert, but at leas mathematically if A=B and B=C, A=C.

Operator = and < have in this case different meaning. If average score of A is greater than average score of B, then B=A means that there is chance greater than threshold x that in another test B could have higher average score. B<A means that the chance that in another test B is in average better than A is less than x (x is predefined by procedure used for ranking). This is roughly speaking, correct definitions would be more complicated.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #20 – 2007-08-16 11:53:36

Here are my personal results:

Code: [Select]

% Sample Averages:
WMA    High    Vorbis    Low    Nero
2.60    4.00    1.70    1.00    3.30
2.00    3.50    2.00    1.00    3.00
2.80    4.00    2.30    1.00    2.70
3.40    4.00    3.10    1.00    3.70
2.40    3.60    2.20    1.00    2.30
2.10    3.50    1.70    1.00    2.50
1.70    2.50    2.00    1.00    1.70
2.20    3.40    3.00    1.00    2.60
1.60    3.20    2.30    1.00    2.60
3.10    3.50    2.80    1.00    2.60
2.60    3.50    2.40    1.00    2.80
1.80    3.40    2.00    1.00    1.80
2.90    3.80    2.30    1.00    2.60
3.00    3.90    2.00    1.00    2.70
2.00    3.70    2.30    1.00    1.70
3.00    4.00    2.10    1.20    2.10
2.30    3.50    2.80    1.00    1.80
3.40    4.00    3.40    1.00    3.10

% Codec averages:
% 2.49    3.61    2.36    1.01    2.53

I too am a bit disappointed. I would have expected a few pleasant surprises where the new codecs would have reached almost transparent listening experience. For me, only the high anchor would be usable, even though it is far from transparency.

Out of curiosity, I played some of the samples through my big & good Hi-Fi speakers. I did know that only headphones can reveal codec problems properly, but I was still surprised about how much better the encoded samples sounded through a standard stereo speaker system in a casual listening situation. I suppose that the normal room echoes get mixed with pre-echo and other codec faults and the listener's brain "calculates" subconsciously a new "combined acoustic space", which does not sound completely wrong.

WMA Pro behavior is interesting. It clearly produces more distortion than the other encoders (I mean constant distortion like an analog amp produces when it is played too loud) and behaves rather oddly with some samples. Despite these problems it was occasionally the best contender.

When the WMA Pro samples are inspected with an audio analyzer it looks like the MS developers are very optimistic about how high frequencies their codec can successfully fit in 64 kbps files. WMA Pro uses a lowpass filter at around 20 kHz. However, I suspect that the highest frequency range is more like an artificial byproduct of the MS version of "HE" than a real attempt to represent the original sound faithfully. The WMA Pro samples seem to produce quite altered waterfall displays at about 15-20 kHz when compared with the reference.

Edit: ~~encoder~~ > contender & a couple of typos

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #21 – 2007-08-16 12:49:07

Quote from: Alexxander on 2007-08-16 09:46:48

I used the key and decrypted results through java abc/hr menu Tools/Process and got 18 text files. Some resulting text files don't include all 5 ratings in text file (I rated all 5 tracks s of all 18 samples). Is this some kind of bug?

Now this is weird!

OK, I uploaded all user comments - you can either browse here or download everything as signed, solid and locked RAR. Notice that those were the comments used for evaluating. Please check if you find all five codecs rated in my decrypted result files.

An updated HTML results file will be online this evening.

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #22 – 2007-08-16 14:12:42

i downloaded the rar file and tried to process the results with chunky but i always get this error:

Code: [Select]

G:\listeningtest\chunky-0.8.4-win>chunky.exe --codec-file="codecs.txt" -n --ratings=results --warn -p 0.05
Parsing result files...
Traceback (most recent call last):
  File "chunky", line 639, in ?
  File "chunky", line 595, in main
  File "abchr_parser.pyc", line 634, in __init__
  File "abchr_parser.pyc", line 646, in _handleTargets
  File "abchr_parser.pyc", line 697, in __init__
abchr_parser.Error: Sample directory names must end in a number.

but they do end in numbers as you can see:

Code: [Select]

G:\listeningtest\chunky-0.8.4-win>dir
25.05.2004  21:26            49.152 chunky.exe
16.08.2007  15:00                60 codecs.txt
25.05.2004  21:26            45.123 datetime.pyd
25.05.2004  21:26           712.726 library.zip
25.05.2004  21:26           135.234 pyexpat.pyd
25.05.2004  21:26           974.915 python23.dll
16.08.2007  13:40    <DIR>          Sample01
15.08.2007  23:37    <DIR>          Sample02
15.08.2007  23:38    <DIR>          Sample03
15.08.2007  23:38    <DIR>          Sample04
15.08.2007  23:27    <DIR>          Sample05
15.08.2007  23:38    <DIR>          Sample06
15.08.2007  23:39    <DIR>          Sample07
15.08.2007  23:42    <DIR>          Sample08
15.08.2007  23:42    <DIR>          Sample09
15.08.2007  23:42    <DIR>          Sample10
15.08.2007  23:43    <DIR>          Sample11
15.08.2007  23:43    <DIR>          Sample12
15.08.2007  23:43    <DIR>          Sample13
15.08.2007  23:43    <DIR>          Sample14
15.08.2007  23:44    <DIR>          Sample15
15.08.2007  23:44    <DIR>          Sample16
15.08.2007  23:27    <DIR>          Sample17
15.08.2007  23:50    <DIR>          Sample18
25.05.2004  21:26            16.384 w9xpopen.exe
25.05.2004  21:26            49.218 _socket.pyd
25.05.2004  21:26            57.407 _sre.pyd
25.05.2004  21:26           495.616 _ssl.pyd
25.05.2004  21:26            36.864 _winreg.pyd

what am i doing wrong?

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #23 – 2007-08-16 14:20:14

Quote from: thana on 2007-08-16 14:12:42

i downloaded the rar file and tried to process the results with chunky but i always get this error:

What I did was this: I made a new empty folder and moved all samples subfolders there, and also added a switch to Chunky, something like --directory=".\empty_folder"

Multiformat Listening Test @ 64 kbps - FINISHED

Reply #24 – 2007-08-16 15:08:57

Quote from: thana on 2007-08-16 14:12:42

i downloaded the rar file and tried to process the results with chunky but i always get this error: ...

The "sample01", "sample 02" etc folders must be inside an empty base folder.

After strugling with the same problem for a while I found that the following worked:

First I saved the "codecs.txt" file in the chunky program folder.

Then I created a subfolder named "res" under my chunky program folder and placed the sample folders inside the empty "res" folder.

After that I opened a command prompt and went to this "res" folder:
C:\Documents and Settings\Alex B>L:
L:\>CD 64test\chunky\res\
L:\64test\chunky\res>

and used this command line:
L:\64test\chunky\res>..\chunky.exe --codec-file=..\codecs.txt -n --ratings=results --warn -p 0.05

(italics=prompt, bold=command line)

Chunky didn't like one of the text lines in the source files:
Unrecognized line: "Ratings on a scale from 1.0 to 5.0"
However, despite the warnings it created apparently correct result files.