Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Topic: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples) (Read 23457 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

2020-10-17 16:21:51

RESULTS

Code: [Select]

FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 105
Critical significance:  0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              629         921.84
Testers (blocks)   104         392.24
Codecs eval'd        5         332.55   66.51   175.51  0.00E+000
Error              520         197.05    0.38
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.167

Means:

Apple    FHG      FDK      NERO     FAAC     FFMPG    
  4.63     4.13     4.04     3.69     2.94     2.50   

---------------------------- p-value Matrix ---------------------------

         FHG      FDK      NERO     FAAC     FFMPG    
Apple    0.000*   0.000*   0.000*   0.000*   0.000*   
FHG               0.282    0.000*   0.000*   0.000*   
FDK                        0.000*   0.000*   0.000*   
NERO                                0.000*   0.000*   
FAAC                                         0.000*   
-----------------------------------------------------------------------

Apple is better than FHG, FDK, NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG

• Apple’s AAC encoder (QuickTime, iTunes) really plays in a different league. Quality is outstanding and it outperform the competition.

• FDK and FHG (Winamp) are very close each other’s. It’s confirmed with all group of samples, except one (problem samples group). They are probably sharing the same DNA. While there are both inferior to Apple’s encoder, they are providing a very satisfying sound quality at 128 kbps. However Winamp seems to be more robust against known-issues: difficult samples tested here are generally better with FHG than FDK (fatboy is one of the most obvious example).

• Nero: not too far from Fraunhofer’s encoders with classical music, but clearly inferior with all other tested samples (pop/rock/electro). Quality even becomes bad with the hardest samples group. I recall that I used ABR 2-pass, which should give some benefit compared to other competitors. There are almost ten years without development for Nero. I guess I’ll definitely put this encoder in the graveyard with Winamp’s AAC and I won’t test it anymore.

• Faac: In my souvenir it was a very bad encoder. It was recently developed for speed and quality improvements. It’s indeed very fast, but there are still (too) many quality issues on music and sounds often bad on problem samples.

• Ffmpeg: I was very curious to precisely check how this encoder perform against the competition. Why? Because this encoder is now distributed with Handbrake (popular video converter) on Windows Platform. Unfortunately, quality is rather poor with distortions almost everywhere. I suspect that quality may be more acceptable for movie encoding, but I wouldn’t use it, at least not at 128 kbps.

Now, let's see in details how encoders are performing depending on the samples genre.

CLASSICAL MUSIC ONLY

POP/ROCK/ELECTRO…

PROBLEM SAMPLES

Code: [Select]

•••••••••••••••••••••••
•CLASSICAL MUSIC GROUP•
•••••••••••••••••••••••

FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 75
Critical significance:  0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              449         675.32
Testers (blocks)    74         318.69
Codecs eval'd        5         214.13   42.83   111.19  0.00E+000
Error              370         142.51    0.39
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.199

Means:

Apple    FHG      FDK      NERO     FAAC     FFMPG    
  4.71     4.12     4.09     3.87     3.09     2.64   

---------------------------- p-value Matrix ---------------------------

         FHG      FDK      NERO     FAAC     FFMPG    
Apple    0.000*   0.000*   0.000*   0.000*   0.000*   
FHG               0.762    0.014*   0.000*   0.000*   
FDK                        0.032*   0.000*   0.000*   
NERO                                0.000*   0.000*   
FAAC                                         0.000*   
-----------------------------------------------------------------------

Apple is better than FHG, FDK, NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG


•••••••••••••••••••••
•VARIOUS MUSIC GROUP•
•••••••••••••••••••••

FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 20
Critical significance:  0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              119         128.62
Testers (blocks)    19          32.52
Codecs eval'd        5          75.99   15.20   71.78  0.00E+000
Error               95          20.11    0.21
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.289

Means:

Apple    FDK      FHG      NERO     FAAC     FFMPG    
  4.56     4.22     4.21     3.47     2.93     2.32   

---------------------------- p-value Matrix ---------------------------

         FDK      FHG      NERO     FAAC     FFMPG    
Apple    0.024*   0.018*   0.000*   0.000*   0.000*   
FDK               0.918    0.000*   0.000*   0.000*   
FHG                        0.000*   0.000*   0.000*   
NERO                                0.000*   0.000*   
FAAC                                         0.000*   
-----------------------------------------------------------------------

Apple is better than FDK, FHG, NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG

•••••••••••••••••••••••
•PROBLEM SAMPLES GROUP•
•••••••••••••••••••••••

FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 10
Critical significance:  0.05
Significance of data: 7.33E-011 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total               59          89.26
Testers (blocks)     9          12.40
Codecs eval'd        5          53.46   10.69   20.56  7.33E-011
Error               45          23.40    0.52
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.650

Means:

Apple    FHG      FDK      NERO     FAAC     FFMPG    
  4.15     4.11     3.35     2.78     1.87     1.84   

---------------------------- p-value Matrix ---------------------------

         FHG      FDK      NERO     FAAC     FFMPG    
Apple    0.902    0.017*   0.000*   0.000*   0.000*   
FHG               0.023*   0.000*   0.000*   0.000*   
FDK                        0.084    0.000*   0.000*   
NERO                                0.007*   0.006*   
FAAC                                         0.926    
-----------------------------------------------------------------------

Apple is better than FDK, NERO, FAAC, FFMPG
FHG is better than FDK, NERO, FAAC, FFMPG
FDK is better than FAAC, FFMPG
NERO is better than FAAC, FFMPG

• One interesting point is that performance is quite similar with classical music and other musical genres. Even 20 samples were enough to get statistically significant results: Apple > FHG & FDK > Nero > FAAC > FFMPEG

• Problem samples are however a turbulent region for encoders: Apple is not as good and is not superior to FHG (Winamp). And for the first time, FDK and FHG performances are dissociated with a clear superiority for FHG on the test samples. More anecdotal result is that FFMPEG now statistically tied with FAAC is.

TEST FAQ

Why testing so many AAC implementations?

For curiosity first. It’s something I’d like to do for a decade now. The purpose is not only to check which one sounds the best to my ears, but also to see how perform FDK vs FHG, or FFMPEG’s encoder (used in Handbrake now). Second reason: for future tests. I’d like to have a solid test for reference if I decide next to make a multiformat personal test and choose one AAC implementation among the six (at least) existing ones.

Why not using VBR?

Most AAC encoders don’t have a precise VBR scale and it's not only possible to get 128 kbps or something close. In order to make the test easier I therefore decided this time to avoid VBR and all related issues coming with this mode. I choose CBR or ABR when available (ABR 2-pass with Nero). With this setup I can evaluate the core performance of each encoder (and imagine a small quality jump with VBR when available).

Are these samples difficult ones?

No, except a small amount of them. The 75 classical samples are musical parts I really enjoy. There was no selection based on difficulty. The 10 “Billboard” samples were made indiscriminately: 30 seconds coming from an exact range of each track (1 min 00 sec to 1 min 30 sec). But HA.org samples may be selected by members for their possible difficulty. And a final group of 10 samples is intentionally a possible “killer-samples” group.
These samples and selection was already used for previous tests I made this year.

Methodology

Java ABC/HR is my software of choice. Volume was normalized and delay removed by the software (faad2 decoder). It was a listen-and-rate test with only little effort to find artifact and to adjust the ranking. It’s still a blind test but with no ABX sessions. I’d rather test non-extensively a wide set of samples and let small ranking errors be statistically vanished than spending a lot of time on a smaller set of samples. My hardware setup is very basic: laptop headphone output (no DAC), AKG Q701 headphone, moderate listening volume playback.
EDIT: I ranked 443 files (score < 5.0) and I made two mistakes only (ranking the reference instead of the encoded one). I gave a 5.0 score to these two files. Final score is therefore 441/443 and probability of guessing is very close to zero

Encoders setup

• Apple: encoded through foobar2000's converter (AAC Apple Graphical Interface): ABR 128. Metadata: qaac 2.67, CoreAudioToolbox 7.10.9.0, AAC-LC Encoder, ABR 128kbps, Quality 96
• FAAC: encoded through foobar2000 command line encoder: -b 128 - -o %d. Metadata: FAAC 1.30
• FDK: encoded through foobar2000's converter (AAC FDK Graphical Interface): CBR 128. Metadata: fdkaac 1.0.0, libfdk-aac 4.0.0, CBR 128kbps
• FHG (Winamp): encoded through foobar2000's converter (AAC Winamp FHG Graphical Interface): CBR 128. Metadata: fhgaac v03.02.15;CBR=128000
• Nero: encoded through foobar2000's converter (AAC Nero Graphical Interface): 2 pass ABR 128. Metadata: ndaudio 1.5.4.0 / -2pass -br 128000
• FFMPEG: encoded through foobar2000 command line encoder: -i - -c:a aac -b:a 128k %d. Metadata: Lavf58.45.100. FFmpeg version used: ffmpeg-4.3.1-2020-10-01

Next investigations?

I’d like to check how more efficient are next-gen formats (OPUS, xHE-AAC) compared to AAC at its best, and if I can spare some space without sacrificing quality. These formats are technically stronger but not necessary as tuned as Apple’s AAC.

TABLE

(with the nice help of kamedo2!!! thanks again

)

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #1 – 2020-10-17 18:43:18

Thanks for doing the test. I wouldn't be surprised if Helix or LAME MP3 did better than FFMPEG's AAC encoder.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #2 – 2020-10-20 13:12:55

A superb test, there should be a substantial effort to perform this test.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #3 – 2020-10-20 15:25:32

Excellent post!

Will there be a difference when VBR is used? If so how much and in what direction? I am asking this question because AFAIK VBR is one of the main advantages of AAC.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #4 – 2020-10-20 18:18:58

Quote from: LithosZA on 2020-10-17 18:43:18

Thanks for doing the test. I wouldn't be surprised if Helix or LAME MP3 did better than FFMPEG's AAC encoder.

I'm sure that both MP3 encoders would sound better than current FFMPEG's AAC

Quote from: Kamedo2 on 2020-10-20 13:12:55

A superb test, there should be a substantial effort to perform this test.

Oh yes, I wouldn't do it every week

Quote from: pr0m3th3u5 on 2020-10-20 15:25:32

Excellent post!

Will there be a difference when VBR is used? If so how much and in what direction? I am asking this question because AFAIK VBR is one of the main advantages of AAC.

Thanks. VBR should make a small difference. How much: I can't say but it shouldn't be a huge gap. In what direction: logically better but who knows… VBR isn't an advantage: MP3, MPC, WMA, Vorbis, OPUS, USAC are also VBR. It's very common.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #5 – 2020-10-20 23:51:29

Guru, Another great personal test.

While I'm not surprised by the results as they are the same as this public test we have performed in 2011
http://listening-tests.hydrogenaud.io/igorc/aac-96-a/index.htm (Apple >> FhG >> Nero )
I'm still happy to see that we haven't make any big mistake back then.
All haters now can go and touch this and can go f...

And btw quality of encoders haven't change since then. Some bugfixes here, some misc. changes there. But it's all the same.
Nice!

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #6 – 2020-10-21 07:39:03

@Guruboolez

Excellent work, as always.
Do you think that Apple encoder is equally well tuned in all modes (cbr, abr, cvbr and tvbr)?

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #7 – 2020-10-21 10:14:42

Quote from: synclagz on 2020-10-21 07:39:03

Excellent work, as always.
Do you think that Apple encoder is equally well tuned in all modes (cbr, abr, cvbr and tvbr)?

Thank you

I can't provide any objective evidence. But I'm very confident that all encoding modes are well tuned. Apple's AAC is probably the most advanced encoder. I already noticed its very good performance in 2003 and confirmed it in the following years:
https://hydrogenaud.io/index.php?topic=16395.msg163116#msg163116
https://hydrogenaud.io/index.php?topic=29924.msg258644#msg258644
https://hydrogenaud.io/index.php?topic=58724.msg527434#msg527434
https://hydrogenaud.io/index.php?topic=58724.msg527434#msg527434

And there were a lot of works since even if there's almost no development for years. This encoder was also used by Apple to sell billions of 128 kbps files, and upgrade them to 256 kbps few years later. It's pretty solid on many aspect and I'm sure there are no hidden flaws.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #8 – 2020-10-21 18:50:05

Top-notch work! Thanks for doing this big effort and sharing it!

Again confirmed that codecs haven't evolved significantly in the last few years for these bit rates. And with every year that's passing, 128kbps will become more insignificant compared to the network speeds and storage spaces.

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #9 – 2021-09-19 21:25:32

Thank you very much for sharing these tests Guru. It must have taken quite a lot of your time.

I was wondering if the fact that FDK AAC's bandwidth is slightly narrower than Apple's (see: https://wiki.hydrogenaud.io/index.php?title=Fraunhofer_FDK_AAC#Bandwidth) played any part in making it less transparent overall (Qaac has a low-pass filter cutting at about 18.5 KHz with the settings you used, if I recall correctly).

Would you say that it might have contributed to FDK AAC's slightly lower marks in your test?

Thanks!

Re: Personal Blind Listening Test of AAC at 128 kbps (six encoders & 105 samples)

Reply #10 – 2021-09-20 06:47:45

@guruboolez
Hi,
Could you please upload A18 Monteverdi sample on which FDK and FhG got only 1.7 mark?
I would like to test it to hear how bad is it (probably distorted everywhere

) and how quality improves on higher bitrate.

Notice