RESULTS
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis
Number of listeners: 105
Critical significance: 0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings
Source of Degrees Sum of Mean
variation of Freedom squares Square F p
Total 629 921.84
Testers (blocks) 104 392.24
Codecs eval'd 5 332.55 66.51 175.51 0.00E+000
Error 520 197.05 0.38
---------------------------------------------------------------
Fisher's protected LSD for ANOVA: 0.167
Means:
Apple FHG FDK NERO FAAC FFMPG
4.63 4.13 4.04 3.69 2.94 2.50
---------------------------- p-value Matrix ---------------------------
FHG FDK NERO FAAC FFMPG
Apple 0.000* 0.000* 0.000* 0.000* 0.000*
FHG 0.282 0.000* 0.000* 0.000*
FDK 0.000* 0.000* 0.000*
NERO 0.000* 0.000*
FAAC 0.000*
-----------------------------------------------------------------------
Apple is better than FHG, FDK, NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG
• Apple’s AAC encoder (QuickTime, iTunes) really plays in a different league. Quality is outstanding and it outperform the competition.
• FDK and FHG (Winamp) are very close each other’s. It’s confirmed with all group of samples, except one (problem samples group). They are probably sharing the same DNA. While there are both inferior to Apple’s encoder, they are providing a very satisfying sound quality at 128 kbps. However Winamp seems to be more robust against known-issues: difficult samples tested here are generally better with FHG than FDK (fatboy is one of the most obvious example).
• Nero: not too far from Fraunhofer’s encoders with classical music, but clearly inferior with all other tested samples (pop/rock/electro). Quality even becomes bad with the hardest samples group. I recall that I used ABR 2-pass, which should give some benefit compared to other competitors. There are almost ten years without development for Nero. I guess I’ll definitely put this encoder in the graveyard with Winamp’s AAC and I won’t test it anymore.
• Faac: In my souvenir it was a very bad encoder. It was recently developed for speed and quality improvements. It’s indeed very fast, but there are still (too) many quality issues on music and sounds often bad on problem samples.
• Ffmpeg: I was very curious to precisely check how this encoder perform against the competition. Why? Because this encoder is now distributed with Handbrake (popular video converter) on Windows Platform. Unfortunately, quality is rather poor with distortions almost everywhere. I suspect that quality may be more acceptable for movie encoding, but I wouldn’t use it, at least not at 128 kbps.
Now, let's see in details how encoders are performing depending on the samples genre.
CLASSICAL MUSIC ONLY
POP/ROCK/ELECTRO…
PROBLEM SAMPLES
•••••••••••••••••••••••
•CLASSICAL MUSIC GROUP•
•••••••••••••••••••••••
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis
Number of listeners: 75
Critical significance: 0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings
Source of Degrees Sum of Mean
variation of Freedom squares Square F p
Total 449 675.32
Testers (blocks) 74 318.69
Codecs eval'd 5 214.13 42.83 111.19 0.00E+000
Error 370 142.51 0.39
---------------------------------------------------------------
Fisher's protected LSD for ANOVA: 0.199
Means:
Apple FHG FDK NERO FAAC FFMPG
4.71 4.12 4.09 3.87 3.09 2.64
---------------------------- p-value Matrix ---------------------------
FHG FDK NERO FAAC FFMPG
Apple 0.000* 0.000* 0.000* 0.000* 0.000*
FHG 0.762 0.014* 0.000* 0.000*
FDK 0.032* 0.000* 0.000*
NERO 0.000* 0.000*
FAAC 0.000*
-----------------------------------------------------------------------
Apple is better than FHG, FDK, NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG
•••••••••••••••••••••
•VARIOUS MUSIC GROUP•
•••••••••••••••••••••
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis
Number of listeners: 20
Critical significance: 0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings
Source of Degrees Sum of Mean
variation of Freedom squares Square F p
Total 119 128.62
Testers (blocks) 19 32.52
Codecs eval'd 5 75.99 15.20 71.78 0.00E+000
Error 95 20.11 0.21
---------------------------------------------------------------
Fisher's protected LSD for ANOVA: 0.289
Means:
Apple FDK FHG NERO FAAC FFMPG
4.56 4.22 4.21 3.47 2.93 2.32
---------------------------- p-value Matrix ---------------------------
FDK FHG NERO FAAC FFMPG
Apple 0.024* 0.018* 0.000* 0.000* 0.000*
FDK 0.918 0.000* 0.000* 0.000*
FHG 0.000* 0.000* 0.000*
NERO 0.000* 0.000*
FAAC 0.000*
-----------------------------------------------------------------------
Apple is better than FDK, FHG, NERO, FAAC, FFMPG
FDK is better than NERO, FAAC, FFMPG
FHG is better than NERO, FAAC, FFMPG
NERO is better than FAAC, FFMPG
FAAC is better than FFMPG
•••••••••••••••••••••••
•PROBLEM SAMPLES GROUP•
•••••••••••••••••••••••
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis
Number of listeners: 10
Critical significance: 0.05
Significance of data: 7.33E-011 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings
Source of Degrees Sum of Mean
variation of Freedom squares Square F p
Total 59 89.26
Testers (blocks) 9 12.40
Codecs eval'd 5 53.46 10.69 20.56 7.33E-011
Error 45 23.40 0.52
---------------------------------------------------------------
Fisher's protected LSD for ANOVA: 0.650
Means:
Apple FHG FDK NERO FAAC FFMPG
4.15 4.11 3.35 2.78 1.87 1.84
---------------------------- p-value Matrix ---------------------------
FHG FDK NERO FAAC FFMPG
Apple 0.902 0.017* 0.000* 0.000* 0.000*
FHG 0.023* 0.000* 0.000* 0.000*
FDK 0.084 0.000* 0.000*
NERO 0.007* 0.006*
FAAC 0.926
-----------------------------------------------------------------------
Apple is better than FDK, NERO, FAAC, FFMPG
FHG is better than FDK, NERO, FAAC, FFMPG
FDK is better than FAAC, FFMPG
NERO is better than FAAC, FFMPG
• One interesting point is that performance is quite similar with classical music and other musical genres. Even 20 samples were enough to get statistically significant results: Apple > FHG & FDK > Nero > FAAC > FFMPEG
• Problem samples are however a turbulent region for encoders: Apple is not as good and is not superior to FHG (Winamp). And for the first time, FDK and FHG performances are dissociated with a clear superiority for FHG on the test samples. More anecdotal result is that FFMPEG now statistically tied with FAAC is.
TEST FAQ
Why testing so many AAC implementations?
For curiosity first. It’s something I’d like to do for a decade now. The purpose is not only to check which one sounds the best to my ears, but also to see how perform FDK vs FHG, or FFMPEG’s encoder (used in Handbrake now). Second reason: for future tests. I’d like to have a solid test for reference if I decide next to make a multiformat personal test and choose one AAC implementation among the six (at least) existing ones.
Why not using VBR?
Most AAC encoders don’t have a precise VBR scale and it's not only possible to get 128 kbps or something close. In order to make the test easier I therefore decided this time to avoid VBR and all related issues coming with this mode. I choose CBR or ABR when available (ABR 2-pass with Nero). With this setup I can evaluate the core performance of each encoder (and imagine a small quality jump with VBR when available).
Are these samples difficult ones?
No, except a small amount of them. The 75 classical samples are musical parts I really enjoy. There was no selection based on difficulty. The 10 “Billboard” samples were made indiscriminately: 30 seconds coming from an exact range of each track (1 min 00 sec to 1 min 30 sec). But HA.org samples may be selected by members for their possible difficulty. And a final group of 10 samples is intentionally a possible “killer-samples” group.
These samples and selection was already used for previous tests I made this year.
Methodology
Java ABC/HR is my software of choice. Volume was normalized and delay removed by the software (faad2 decoder). It was a listen-and-rate test with only little effort to find artifact and to adjust the ranking. It’s still a blind test but with no ABX sessions. I’d rather test non-extensively a wide set of samples and let small ranking errors be statistically vanished than spending a lot of time on a smaller set of samples. My hardware setup is very basic: laptop headphone output (no DAC), AKG Q701 headphone, moderate listening volume playback.
EDIT: I ranked 443 files (score < 5.0) and I made two mistakes only (ranking the reference instead of the encoded one). I gave a 5.0 score to these two files. Final score is therefore 441/443 and probability of guessing is very close to zero
Encoders setup
• Apple: encoded through foobar2000's converter (AAC Apple Graphical Interface): ABR 128. Metadata: qaac 2.67, CoreAudioToolbox 7.10.9.0, AAC-LC Encoder, ABR 128kbps, Quality 96
• FAAC: encoded through foobar2000 command line encoder: -b 128 - -o %d. Metadata: FAAC 1.30
• FDK: encoded through foobar2000's converter (AAC FDK Graphical Interface): CBR 128. Metadata: fdkaac 1.0.0, libfdk-aac 4.0.0, CBR 128kbps
• FHG (Winamp): encoded through foobar2000's converter (AAC Winamp FHG Graphical Interface): CBR 128. Metadata: fhgaac v03.02.15;CBR=128000
• Nero: encoded through foobar2000's converter (AAC Nero Graphical Interface): 2 pass ABR 128. Metadata: ndaudio 1.5.4.0 / -2pass -br 128000
• FFMPEG: encoded through foobar2000 command line encoder: -i - -c:a aac -b:a 128k %d. Metadata: Lavf58.45.100. FFmpeg version used: ffmpeg-4.3.1-2020-10-01
Next investigations?
I’d like to check how more efficient are next-gen formats (OPUS, xHE-AAC) compared to AAC at its best, and if I can spare some space without sacrificing quality. These formats are technically stronger but not necessary as tuned as Apple’s AAC.
TABLE
(with the nice help of kamedo2!!! thanks again )