Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Multiformat listening test @ ~64kbps: Results (Read 123141 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Multiformat listening test @ ~64kbps: Results

The test is finished, results are available here:

http://listening-tests.hydrogenaudio.org/igorc/results.html

Summary: CELT/Opus won, Apple HE-AAC is better than Nero HE-AAC, and Vorbis has caught up with Nero HE-AAC.

Multiformat listening test @ ~64kbps: Results

Reply #1
If someone can assist with a bitrate table or per-sample results, that would be nice...

Multiformat listening test @ ~64kbps: Results

Reply #2
Oh, and given that Opus is open sourced, if one of the developers can give a technical explanation for our audience on what codec features and design decisions made them able to win this test, that would be pretty damn interesting, too

Multiformat listening test @ ~64kbps: Results

Reply #3
I just wonder one thing, when the Vorbis encoder was tested how was it lowpassed. Was it tested with the default 14 kHz lowpass?
256 kbps Apple AAC bought iTunes music



Multiformat listening test @ ~64kbps: Results

Reply #6
Congratulation to CELT/Opus!

I wanted to compare ratings by testers per sample, but it seems that every tester gets random testing sequence
Is there any way I can get such data, and get wanted plot - if it's not clear I want to know source sample formats (for 5 rating bins) for each tester

Thanks

edit: nevermind, I found a way - it seems that sample name appendixes are same (those describing 5 bins at header of each test result)

Multiformat listening test @ ~64kbps: Results

Reply #7
I think the results of lessthanjoey and AlexB are also anonym. It will be changed.
If anyone is interested in his/her test there is key or email me and I will send the results.


oh, I have participated in this test too.
Garf had the key for my results and had checked them.

It's also good to keep strong words like "thank you, great job". But this time I want to say big Thank You to all participants and people who has helped to conduct these test.
Sebastian Mares - for his previous public tests.  This test benefited much from them.
AlexB - for providing pre-decoded packages and being here.
Especially, Garf.

And many other people who were here around. Your time is valuable and highly aprreciated.

Multiformat listening test @ ~64kbps: Results

Reply #8
I'm stunned by the CELT/Opus results! I would have assumed that your toolbox is smaller than usually when you are targeting low-delay. And now Celt even beats the others by lengths.

Thanks for the great work, guys!

Multiformat listening test @ ~64kbps: Results

Reply #9
Thanks guys! Interesting results.


One note though:

Code: [Select]
Read 5 treatments, 531 samples => 10 comparisons
    Means:
          Vorbis   Nero_HE-AAC  Apple_HE-AAC          Opus    AAC-LC@48k
           3.513         3.547         3.817         3.999         1.656


For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files?

Here's how chunky parses the 566 result files I have:

Code: [Select]
% Result file produced by chunky-0.8.4-beta
% ..\chunky.exe --codec-file=..\codecs.txt -n --ratings=results --warn -p 0.05
%
% Sample Averages:

Vorbis    Nero    Apple    CELT    Anchor
2.56    4.28    4.19    2.67    1.87
2.95    4.20    4.03    2.36    1.68
3.42    3.51    3.98    4.73    2.51
4.12    3.84    4.49    4.64    2.18
4.18    3.59    3.87    4.52    1.95
3.35    3.68    3.34    4.00    1.56
3.86    2.98    2.96    3.50    1.85
4.03    3.78    4.09    4.49    2.02
3.60    3.71    3.89    3.94    1.51
4.28    2.78    2.19    4.12    1.44
4.12    3.93    4.17    4.39    1.70
3.25    3.18    3.20    4.14    1.77
3.83    3.63    3.86    4.56    1.41
3.49    3.81    4.01    4.27    1.37
4.15    3.84    4.08    4.76    2.04
3.97    2.74    3.09    4.38    1.74
3.35    3.24    4.15    4.44    1.56
2.68    2.96    3.63    4.10    1.51
3.58    4.37    4.88    3.73    1.76
3.40    4.10    4.68    4.26    1.61
3.80    3.49    3.55    4.43    1.38
3.81    3.30    4.27    4.26    1.13
3.59    3.14    3.51    4.09    1.18
3.29    3.61    3.88    4.16    1.36
3.66    3.84    4.37    3.86    1.55
2.78    3.99    4.18    2.82    1.57
3.62    3.88    3.92    3.93    1.34
3.39    4.03    4.39    3.96    1.46
3.61    4.12    4.36    4.09    1.54
4.42    3.48    4.29    4.68    1.82

% Codec averages:
% 3.60    3.63    3.92    4.08    1.65

Multiformat listening test @ ~64kbps: Results

Reply #10
I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference.

I get the same result as you. It looks like the results submitted on the 10th of April are missing.

Edit: See below.

Multiformat listening test @ ~64kbps: Results

Reply #11
For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file

I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings inside the result files to .wav before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit.

The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033

Multiformat listening test @ ~64kbps: Results

Reply #12
Quote
For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I
had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I
wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files?


Sounds like you didn't eliminate the listeners with more than 4 invalid results.

The filtering rules on the page are:

*    If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded.
*    If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded.
*    If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded.

You'll have to modify chunky to get the that behavior.

Multiformat listening test @ ~64kbps: Results

Reply #13
For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file

I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings in filenames to .wav  before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit.

The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033


Thanks, I didn't have the triaged results here, so this was welcome. By the way, chunky has quite dangerous behavior: by default, it squashes all listeners together per sample for the overall results. In other words, its discarding most of the information in the test, as if only a single listener did all samples! The per-sample results don't suffer from that, so those should be fine.

Edit: Whoops, I indeed missed some results that should have been discarded.

Multiformat listening test @ ~64kbps: Results

Reply #14
Sounds like you didn't eliminate the listeners with more than 4 invalid results.

The filtering rules on the page are:

*    If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded.
*    If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded.
*    If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded.

You'll have to modify chunky to get the that behavior.


Ah, good point. There were two discarded listeners, I got those. I saw one result with a rated reference that didn't cause an invalidation, so got that correctly.

But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

Multiformat listening test @ ~64kbps: Results

Reply #15
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has  never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


Multiformat listening test @ ~64kbps: Results

Reply #17
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has  never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


Ah, okay!

(moving and amending from my edited post, since others already replied. Sorry)
The users which should have been excluded according to that rule are 09, 27, and 22 but IgorC decided to keep 22 (because 22 didn't understand the procedure at first but got better later) and I expected 21 to be filtered too (because he only rated the low anchor on almost all the samples: 23/30 are either low anchor only or invalid, including many of the really obvious ones).

Multiformat listening test @ ~64kbps: Results

Reply #18
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel)

My math says 560. 

(or did you actually remove the "rated but accepted reference" instance?)

 

Multiformat listening test @ ~64kbps: Results

Reply #19
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel)

My math says 560. 

(or did you actually remove the "rated but accepted reference" instance?)


No. But after running chunky I only had 565, not 566 files. It appears to reject one input file for some reason (this is on Linux).

A lesson here is that the post-screened data-set should be published, too, because it's easy to make mistakes there and it makes it easier for people wanting to do other/further analysis. But considering the comment from NullC the results on the site are probably correct.

Multiformat listening test @ ~64kbps: Results

Reply #20
Regarding the bitrate table,

I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data.

If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount?

Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.)

Multiformat listening test @ ~64kbps: Results

Reply #21
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has  never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.
Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.

Multiformat listening test @ ~64kbps: Results

Reply #22
# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.
Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.


I think it's really unfortunate that Igor released a file with the word cheater in it.  There are so many ways for a result to go weird which have nothing to do with "cheating".

Your results can be excluded purely based on the previously published confused reference criteria (2,4,9,22,30 invalid),  so that should close the question on correctness of excluding those results and it should have been left at that.  Even with good and careful listeners this can happen, and it's nothing anyone should take too personally.

Though, your results are pretty weird— You ranked the reference fairly low (e.g. 3) on a couple comparisons where many people found the reference and codec indistinguishable.  I think you also failed to reverse your preference on some samples where the other listeners changed their preference (behavior characteristic of a non-blind test?).

I don't mean to cause offense, but were you listening via speakers or could you have far less HF sensitivity than most of the other listeners (if you are male and older than most participants then the answer to that might be yes)?  Any other ideas why your results might be very different overall and also on specific samples?

Multiformat listening test @ ~64kbps: Results

Reply #23
Regarding the bitrate table,
I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data.
If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount?
Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.)


If you wish to remove container overhead for the Vorbis and Opus files you can use a tool like ogg-dump from oggztools to extract all the packet sizes.

On a few samples Vorbis suffers a bit because the Vorbis headers are fairly large compare to an 8 second 64kbit/sec file (e.g. Sample01) but I don't think the container overhead is all that considerable.

Multiformat listening test @ ~64kbps: Results

Reply #24
Yes, I was too strict. Sorry about it.

Some of the listeners prefer Nero over Vorbis or vice versa. Some of them have rated Vorbis higher against HE-AAC codecs.
Other preferred Apple HE-AAC over CELT on second half of samples. These variations are all fine.
Finally on average Opus/CELT was better for all listeners with enough results.
It was very strange that you have ranked the Opus as low as low anchor! (like sample 10 and much others) where ALL other listeners scored it very well.
You average scores (including 5 invalid samples):
Vorbis - 3.53
Nero - 3.15
Apple -3.51
CELT - 2.34


Maybe your hardware has some issues.

Earlier I also wrote you to re run again the whole test  because there were 5 invalid results and all test was discarded.