HydrogenAudio

Hydrogenaudio Forum => Listening Tests => Topic started by: IgorC on 2013-12-08 16:57:00

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 16:57:00
Hi, Guys 

It's time to discuss an upcoming listening test. It will be a multiformat test at 96 kbps as a logical continuation to the last public AAC listening test from 2011 year.


There are few things we need to talk about:

1.Amount of codecs.
I think the possible amount of them  is around 3-5. 

2. Selection of codecs.
Please, propose  a codecs You want to test.  AAC, MP3, Opus, Vorbis, ...
We already had some discussion here http://www.hydrogenaudio.org/forums/index....c=92490&hl= (http://www.hydrogenaudio.org/forums/index.php?showtopic=92490&hl=)
But since it was 2 years ago it will be good to start from scratch. 
I will be updating this list "choice of codecs" (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing)


Also I think it will be more interesting to compare MP3 128-135 kbps and AAC/AoTuV 96-100 kbps. Probably a lot of people are interesting to trade off between compability/compression efficiency. But it's my point of view.

Let's discuss.

Also this time Steve Forte Rio will be conducting the test. He has helped a lot to organize and conduct the last  AAC public test.
I just organize the discussion and help him. He will receive a results from a participants and keep a dialogue with them when the test will be open.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-08 17:31:52
I'd like to see the results of Opus. The development is very active, and I'm very interested to see the progress since two years ago, especially at high bitrates.
Also, AAC is a mature, strong codec, have good compatibility, and especially Apple AAC encoder is known for its very high quality. Will Opus beat Apple?
And MP3 128kbps is also interesting. It has the best compatibility, and many people know what to expect from MP3 128kbps, so it's a good "anchor".

And we need a low anchor as well. FAAC 96kbps cbr has a bad quality, good for a low anchor.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: LithosZA on 2013-12-08 17:35:41
My choices would be:
MP3 (Helix or LAME or Fraunhofer) - Which one is better at 96Kbps?
AoTuV Vorbis
Apple AAC
Opus

All of them running at 96Kbps
EDIT: And FAAC 96Kbps low anchor
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-08 18:23:19
My choice:

1. MP3@128 kbps (96 kbps will surely lead to complete defeat). We can use  Fraunhofer IIS MP3 Surround encoder, which is sometimes better than LAME at low bitrates beginning from 128 kbps, but I consider LAME much more popular, so maybe we should use LAME 3.99.5 -b 128 -q0.

2. OGG Vorbis aoTuV b6.03

3. QuickTime AAC TVBR.

4. WMA Pro

5. Opus 1.1, of course.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-08 18:30:01
I think we should increase the number of samples. More samples leads to more statistically valid results.
And I think we should choose the samples so that the average bitrate of the samples tested, and average bitrate of albums, is roughly equal, like I did;
http://www.hydrogenaudio.org/forums/index....howtopic=100896 (http://www.hydrogenaudio.org/forums/index.php?showtopic=100896)
If the average bitrate of albums is 96k and the average bitrate of tested samples is 144k, the corpus is overrepresented by critical samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-08 18:33:54
Thank you for organizing this, my choices:

AAC (Apple/qaac)

AAC (Fraunhofer/fhgaacenc)

AAC (Fraunhofer/fdkaac)

Opus (1.1)

Vorbis (libvorbis 1.3.3)

Vorbis (aoTuV b6.03)

WMA Standard

WMA Pro


Don't care about MP3, don't care about MPC.

edit: good observation from IgorC, let me the "optional" ones. I don't know much about WMA, just want to see how it performs even if nothing changed in the last few years (I'm not even really sure about this, Microsoft is a mess)? Is Stardard or Pro more compatible?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 18:40:09
eahm,

OK. But some obseravtions.

We already know that Apple was a best AAC encoder. It represents all AAC format very well. No need to test again FhG encoder.
The same for Vorbis. Only aoTuv. 

We can't test 10 codecs.  . The optimal number is 3-5.

Guys, correct me if I'm wrong.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 18:48:58
I'd like to see the results of Opus. The development is very active, and I'm very interested to see the progress since two years ago, especially at high bitrates.

Do You mean include  Opus 1.0 and 1.1?

Also, AAC is a mature, strong codec, have good compatibility, and especially Apple AAC encoder is known for its very high quality. Will Opus beat Apple?

Good question. Let's see.

And MP3 128kbps is also interesting. It has the best compatibility, and many people know what to expect from MP3 128kbps, so it's a good "anchor".

And we need a low anchor as well. FAAC 96kbps cbr has a bad quality, good for a low anchor.

Agree.



Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-08 18:49:45
Opus, QT-AAC, Musepack and Vorbis (AoTuV). Musepack really deserves to be tested again vs. other modern codecs.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 18:56:00
I would like to point to one comment of member Gecko. He has participated in previous tests:

http://www.hydrogenaudio.org/forums/index....st&p=780195 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=92490&view=findpost&p=780195)
Quote
+1 for keeping the number of codecs small. Consider only three perhaps? Four is stretching it. I really struggled with the five in the last test (AAC @ ~96 kbps [July 2011]).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-08 18:57:35
Just to be difficult, I propose 80kbps instead of 96kbps (easier for testers), and Apple LC-AAC, Apple HE-AAC, FhG (libfdk) LC-AAC, FhG (libfdk) HE-AAC, Opus 1.1, Vorbis aoTuV.

The AAC encoders could/should be a seperate pre-test, especially if FhG wants to send in a newer encoder than what is in libfdk. I'd favor libfdk over anything else AAC as it's used a lot together with ffmpeg now.

Edit: This isn't 100% a serious suggestion, but I want people to think about some things.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: o-l-a-v on 2013-12-08 18:59:12
Opus, MP3(LAME), AAC(QAAC) and Vorbis(aoTuv).
MPC and WMA is not interesting imo
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-08 19:00:47
I'd like to see the results of Opus. The development is very active, and I'm very interested to see the progress since two years ago, especially at high bitrates.

Do You mean include  Opus 1.0 and 1.1?

The latest one, Opus 1.1. Testing Opus 1.0 is likely to lead to the redundant duplicate of http://listening-tests.hydrogenaudio.org/igorc/results.html (http://listening-tests.hydrogenaudio.org/igorc/results.html) and http://www.hydrogenaudio.org/forums/index....showtopic=97913 (http://www.hydrogenaudio.org/forums/index.php?showtopic=97913)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-08 19:03:55
Musepack really deserves to be tested again vs. other modern codecs.

Doesn't MPC really ony shine at settings that are intended to deliver transparent results which are like 3x what is being proposed for this test?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 19:08:15
The latest one, Opus 1.1. Testing Opus 1.0 is likely to lead to the redundant duplicate of http://listening-tests.hydrogenaudio.org/igorc/results.html (http://listening-tests.hydrogenaudio.org/igorc/results.html) and http://www.hydrogenaudio.org/forums/index....showtopic=97913 (http://www.hydrogenaudio.org/forums/index.php?showtopic=97913)

Agree, only Opus 1.1
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 19:13:11
My choice:

1. MP3@128 kbps (96 kbps will surely lead to complete defeat). We can use  Fraunhofer IIS MP3 Surround encoder, which is sometimes better than LAME at low bitrates beginning from 128 kbps, but I consider LAME much more popular, so maybe we should use LAME 3.99.5 -b 128 -q0.

2. OGG Vorbis aoTuV b6.03

3. QuickTime AAC TVBR.

4. WMA Pro

5. Opus 1.1, of course.

I think we should probably keep in mind both TVBR and CVBR. Because if TVBR will end up with ~94 kbps and other codecs at ~96-100 kbps then we  probably should go to CVBR ~100 kbps. Anyway both Apple  TVBR and CVBR are great.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 19:18:09
Just to be difficult, I propose 80kbps instead of 96kbps (easier for testers), and Apple LC-AAC, Apple HE-AAC, FhG (libfdk) LC-AAC, FhG (libfdk) HE-AAC, Opus 1.1, Vorbis aoTuV.

The AAC encoders could/should be a seperate pre-test, especially if FhG wants to send in a newer encoder than what is in libfdk. I'd favor libfdk over anything else AAC as it's used a lot together with ffmpeg now.

Edit: This isn't 100% a serious suggestion, but I want people to think about some things.

Why not?

I agree that it will be interesting to see it. So let's see what people propose.

However also let's see where we're coming from. We've tested AAC encoders at 96 kbps and it's logical to test the best AAC encooder, Apple, against the rest of codecs.

Personally it's hard for me to do pre-test and then test, but, yeah, let's test in future or even now . It's all up to people decision.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 19:20:30
Opus, MP3(LAME), AAC(QAAC) and Vorbis(aoTuv).
MPC and WMA is not interesting imo


MP3 at 96 or 128 kbps?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: o-l-a-v on 2013-12-08 19:27:51
Opus, MP3(LAME), AAC(QAAC) and Vorbis(aoTuv).
MPC and WMA is not interesting imo


MP3 at 96 or 128 kbps?


Let's say 128. If I were to convert music in low bitrate mp3, I would never even think about going lower than that.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 19:36:57
I will participate in this test too so here is my wishlist.

1. MP3 128 kbps. LAME 3.99.5 -V5 (high anchor)
2. MP3 96 kbps . LAME ABR is better than VBR (?)
3. Apple AAC 96 kbps (QAAC highest quality TVBR or CVBR.)
4. Opus 1.1 vbr 96 kbps.
5. Vorbis AoTuv 6.0.3 vbr 96 kbps.

low anchor - FAAC CBR 96 kbps, as Kamedo2 said. It has a reasonably low quality.
We had also discussion to have 2 low anchors. Actually low anchor and low-middle anchor.  It's good to have two acnhors to validate results. Low-middle anchor should be better than low anchor.
It can be: FAAC 64 (low anchor) and FAAC 96 (low-middle anchor).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 20:02:36
I think we should increase the number of samples. More samples leads to more statistically valid results.
And I think we should choose the samples so that the average bitrate of the samples tested, and average bitrate of albums, is roughly equal, like I did;
http://www.hydrogenaudio.org/forums/index....howtopic=100896 (http://www.hydrogenaudio.org/forums/index.php?showtopic=100896)
If the average bitrate of albums is 96k and the average bitrate of tested samples is 144k, the corpus is overrepresented by critical samples.

More than 20 samples? hm, maybe, I don't know.
20 is already enough high number. During the last we've waited a little bit more than month to get enough results. 


What do other think about it?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 21:10:29
... so maybe we should use LAME 3.99.5 -b 128 -q0.

What about vbr. V5 ~128  kbps?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-08 21:46:43
What I'd like to see in the test:

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-08 21:58:54
1) Musepack at 96kbps will have lowpass ~14kHz. That's too low IMHO.

2) Bitrates of WMA VBR:
WMA Std Q50: 74 kbps;
WMA Std Q75: 115 kbps;
WMA Pro Q25: 83 kbps;
WMA Pro Q50: 113 kbps.

None of them is close to the target bitrate.

3) IMHO for LAME 3.99.x low-bitrate VBR is better than low-bitrate ABR.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-08 22:41:06
For WMA you can also set the bit rate at 96.

OT
lvqcl, where did you get the bit rates for the WMA quality settings? Do you have all of them (Std and Pro)?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-08 22:49:47
... Bitrates of WMA VBR:
WMA Std Q50: 74 kbps;
WMA Std Q75: 115 kbps;
WMA Pro Q25: 83 kbps;
WMA Pro Q50: 113 kbps.

None of them is close to the target bitrate.

Then CBR @96 kbps  is the only option.

By the way, You've posted here http://www.hydrogenaudio.org/forums/index....st&p=779933 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=92490&view=findpost&p=779933)
Quote
+1 for AAC
+1 for Vorbis
+0.5 for MP3
+0.5 for WMA standard (who knows, maybe it isn't very bad...)

IMHO mp3@96kbps cannot compete with aac/vorbis@96kbps. 112 or 128 kbps MP3 is more interesting.


Is it still so?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-09 05:01:57
Doesn't MPC really ony shine at settings that are intended to deliver transparent results which are like 3x what is being proposed for this test?
I only knew about this listening test at 128 kbps (http://listening-tests.freetzi.com/html/Multiformat_128kbps_public_listening_test_results.htm), where Musepack was tied for the win, and beat at least MP3, and WMA, which half of the people here want to see tested again, and a young implementation of AAC. But...

1) Musepack at 96kbps will have lowpass ~14kHz. That's too low IMHO.
...might be more important than preconceptions about the format. I also found another test at 96kbps (http://forum.hardware.fr/hfr/VideoSon/Traitement-Audio/mp3-aac-ogg-sujet_84950_1.htm), where Musepack was outperformed by a huge margin, maybe due to the mentioned lowpass. With this in mind I'd opt to only test Opus, QTAAC and Vorbis, and maybe use MPC as low anchor. Looking at the latter listening test, where's the point in testing WMA and MP3 again? It will put unneccessary strain on all listeners, by having them to test another one or two encoders per sample. Why not try to get smaller error bars on the tested encoders by having more results on the relevant modern codecs, instead of discouraging peope with a flood of encoders which have already been shown to be outperformed?*

Garf's suggestion of comparing HE/LC AAC implementations is also interesting, but should be done in a different test, similar to the recent MP3 listening test. For this one QTAAC-LC is probably fine.

EDIT: *OK, there are no more recent results than this old listening test, so maybe there were improvements to WMA and MP3 (Helix?) warranting another comparison. Also the PR effect might be larger if MP3 was included (Slashdot: "MP3 beaten once again by modern codecs"  )
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: jensend on 2013-12-09 05:52:23
Holding a huge comparison isn't going to work; there just aren't enough people ready to spend the time testing seven codecs on many samples, esp. with modern encoders at 96kbps where differences are anything but obvious to most listeners.

The real priority here is getting a comparison between Opus 1.1 and the best AAC encoder, and getting enough samples and enough participants to feel confident about the result.

I still think that being able to compare to MP3 at the same rate is important to make the results meaningful to a wider audience beyond HA regulars. Yes, it won't win. That's fine. Maybe it could serve as low anchor; maybe it's too good to serve as low anchor.

Rather than tossing in another MP3 rate, I think trying to nail down "modern codecs at X kbps ~= MP3 at Y kbps" should be a separate test- probably one with just one modern codec and around three different MP3 rates.

The last time we talked about this I said we should include Vorbis (http://www.hydrogenaudio.org/forums/index.php?showtopic=92490&view=findpost&p=788281) because it was still somewhat commonly used by end users and because of its HTML5 etc use. But end-user Vorbis use has been slowly but steadily declining, webm never really took off, and Firefox gave in and started supporting using system codecs for MP3 and AAC in HTML5. Vorbis results could still be nice to have but I don't think it's a priority.

Musepack and WMA simply don't garner sufficient interest these days to justify the additional workload on volunteers for this test and the accompanying reduction in how many results actually get submitted and in the statistical meaningfulness of the conclusions. Esp. since musepack is (similar to what greynol said but without the silly exaggeration) generally considered to only be interesting at --quality 4 and up; musepack --quality 3 lost to even same-bitrate LAME ABR by fairly wide margins in tests (e.g. this (http://forum.hardware.fr/hfr/VideoSon/Traitement-Audio/mp3-aac-ogg-sujet_84950_1.htm)).

So: either Opus v QT-AAC v LAME 96kbps + some other low anchor, Opus v QT-AAC + LAME 96kbps as the low anchor, or possibly Opus v QT-AAC v one other codec + LAME 96kbps as low anchor.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: jensend on 2013-12-09 06:17:14
I started writing my post, left and came back, submitted it, then saw Kohlrabi's post which says some of the same things, esp. re. Musepack and re. MP3 being important to wider audiences.

Garf's suggestion of comparing HE/LC AAC implementations is also interesting, but should be done in a different test, similar to the recent MP3 listening test. For this one QTAAC-LC is probably fine.
Garf's second suggestion there really depends on his first. At 80kbps LC vs HE and FhG vs Apple are questions that may warrant exploring.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-09 08:11:10
Garf's suggestion of comparing HE/LC AAC implementations is also interesting, but should be done in a different test, similar to the recent MP3 listening test. For this one QTAAC-LC is probably fine.
Garf's second suggestion there really depends on his first. At 80kbps LC vs HE and FhG vs Apple are questions that may warrant exploring.


I specifically put 80kbps because I'm reasonably confident on the answer for 96kbps. If we're going 96kbps we avoid he entire LC/HE question and I would put FhG (libfdk) and Apple. So for 96kbps I would do (consider this my serious suggestion):

- Apple 96kbps
- FhG libfdk 96kbps
- Opus 1.1 96kbps
- MP3 96kbps (~low anchor)
- MP3 128kbps (~high anchor but it may fail at that )

I would exclude Vorbis. It's still used for streaming a lot (e.g. Spotify), but it also didn't evolve since the previous test. I have no real idea whether Apple evolved, but libfdk is significant in that it's a state of the art open sourced encoder easily available in ffmpeg, used by Android, and AAC is getting enough use nowadays that the two best encoders are interesting to compare. Nobody outside HA uses Musepack, and I have no seen no case that it's competitive at 96kbps, so I would exclude it as well. Someone here stated it did good in the last 128kbps test but check what encoder it lost against and how that one did in the last tests...

LAME or Helix? Helix did actually win the last test many years ago, but hasn't evolved at all since and I'm not sure anyone actually uses it.

If the FhG guys want to submit a codec then it's going to be tricky to decide what to do with it. I think it'd almost need a pretest vs  Apple, as I wouldn't want to lose libfdk in the test. (Edit: Eh actually IIRC last time we allowed it because they were going to update Winamp with it. Winamp is dead now, so the most logical thing for a new FhG codec would be for them to update libfdk? Or are the FhG encoders available elsewhere?)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: LithosZA on 2013-12-09 09:55:26
I still think that being able to compare to MP3 at the same rate is important to make the results meaningful to a wider audience beyond HA regulars. Yes, it won't win. That's fine. Maybe it could serve as low anchor; maybe it's too good to serve as low anchor.


I agree with what jensend said. We already know 128Kbps MP3 should be transparent to most users anyway.
There needs to be a 96Kbps MP3 even if it doesn't sound too good
We might use that as a low anchor instead of FAAC 96Kbps?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-09 10:20:47
I specifically put 80kbps because I'm reasonably confident on the answer for 96kbps.

I totally agree. Is there no interest in lower bit-rates? 48 kbps perhaps?

Quote
If the FhG guys want to submit a codec... Or are the FhG encoders available elsewhere?)

Why would Fraunhofer want to submit a codec? The codec is already out there, you guys decide whether it should be in the test.
The codec has long been available in e.g. some Magix, Sonnox, or Sony software. They should contain the same version as Winamp, especially the Sonnox plug-in (http://www.sonnoxplugins.com/pub/plugins/products/codec/codectoolbox.html). Or are you asking for free-of-charge software?

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: LithosZA on 2013-12-09 11:34:49
Quote
LAME or Helix? Helix did actually win the last test many years ago, but hasn't evolved at all since and I'm not sure anyone actually uses it.

What is the latest version of the Helix MP3 encoder? I want to do a quick test to 'hear' if there any differences between Helix and LAME at 96Kbps.
I only found a binary on the RareWares site.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 11:45:52
I think we should increase the number of samples. More samples leads to more statistically valid results.
And I think we should choose the samples so that the average bitrate of the samples tested, and average bitrate of albums, is roughly equal, like I did;
http://www.hydrogenaudio.org/forums/index....howtopic=100896 (http://www.hydrogenaudio.org/forums/index.php?showtopic=100896)
If the average bitrate of albums is 96k and the average bitrate of tested samples is 144k, the corpus is overrepresented by critical samples.

More than 20 samples? hm, maybe, I don't know.
20 is already enough high number. During the last we've waited a little bit more than month to get enough results. 


What do other think about it?

In the last Opus test in 2011, the contributors submitted 531 valid results, but there were only 30 samples. (17.7 results/sample)
http://listening-tests.hydrogenaudio.org/igorc/results.html (http://listening-tests.hydrogenaudio.org/igorc/results.html)
This is not the most efficient use of the effort. The number of sample, 30, is the statistical bottleneck that hinders to draw even more conclusions.
I recommend the number of samples by this formula: 4*sqrt(expected number of valid results/4)
By this, donators are putting 50% of their effort to the overall conclusions, while remaining 50% of their effort to accurately measure the quality of one sample, which helps developers.
The value used in overall conclusion is about 2x more accurate than the average quality of one sample.

In the last AAC 96kbps, there were 280 results, so if we were to expect the same number of contributes, 33 is the proposed number of samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-09 12:35:40
Instead of asking for the desired codecs, I'd like to ask the following: "Which questions would you like answered by the listening test?"

I'm having a hard time coming up with any relevant questions which could be answered by a 96k or 80k multi-codec test, but that's just me. What about you guys? Are any of you using these codecs at these bitrates?

As far as I'm concerned, the most interesting questions revolve around Opus. All of the other codecs seem mature. libfdk may also be interesting, but I hardly know anything about it.
"Can Opus be considered (pretty much) transparent (most of the time) and at what bitrate?"
"How much has Opus 1.1 improved over older versions?" (e.g. the ones used in http://www.ietf.org/proceedings/80/slides/codec-4.pdf (http://www.ietf.org/proceedings/80/slides/codec-4.pdf))
"As an online-radio station, should I replace my 64k ACCP stream with Opus?" (If this is even currently possible)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-09 12:36:00
Quote
If the FhG guys want to submit a codec... Or are the FhG encoders available elsewhere?)

Why would Fraunhofer want to submit a codec? The codec is already out there, you guys decide whether it should be in the test.
The codec has long been available in e.g. some Magix, Sonnox, or Sony software. They should contain the same version as Winamp, especially the Sonnox plug-in (http://www.sonnoxplugins.com/pub/plugins/products/codec/codectoolbox.html). Or are you asking for free-of-charge software?

We can buy the encoder, that's not the problem. But I'm wondering where to get the latest and greatest since that's not so obvious from my side. If the Winamp encoder in whatever the latest Winamp release was is current, that's great.

Are there relevant differences between the libfdk_aac that you sold to Google and this encoder?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-09 12:58:35
Instead of asking for the desired codecs, I'd like to ask the following: "Which questions would you like answered by the listening test?"

I'm having a hard time coming up with any relevant questions which could be answered by a 96k or 80k multi-codec test, but that's just me. What about you guys? Are any of you using these codecs at these bitrates?


Spotify currently streams to mobile devices at 96kbps. There is supposed to be a re-launch of their mobile stuff with free streaming this week, BTW.

Realistically most applications are using even higher bitrates nowadays but they are likely pointless to test. YouTube used 96kbps for many videos, but switched to 128kbps a year or so ago. 80kbps stereo means about 256kbps for 5.1 audio which I also use. But such results aren't directly comparable.

You could say 96kbps is the highest bitrate where we still expect to be able to detect differences between codecs.

Quote
I totally agree. Is there no interest in lower bit-rates? 48 kbps perhaps?


Technically, yes. But practically, do you know much examples where people are still deploying 48kbps music?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 13:40:51
I will participate in this test too so here is my wishlist.

1. MP3 128 kbps. LAME 3.99.5 -V5 (high anchor)
2. MP3 96 kbps . LAME ABR is better than VBR (?)
3. Apple AAC 96 kbps (QAAC highest quality TVBR or CVBR.)
4. Opus 1.1 vbr 96 kbps.
5. Vorbis AoTuv 6.0.3 vbr 96 kbps.

low anchor - FAAC CBR 96 kbps, as Kamedo2 said. It has a reasonably low quality.
We had also discussion to have 2 low anchors. Actually low anchor and low-middle anchor.  It's good to have two acnhors to validate results. Low-middle anchor should be better than low anchor.
It can be: FAAC 64 (low anchor) and FAAC 96 (low-middle anchor).

I recommend FFmpeg MP2 96 for the very-low anchor, if you want to split the low anchor into two. It has a low-pass filter of 5.6kHz, much lower than the FAAC 96 which is 10kHz.
The comparison of FAAC 64, FAAC 96, LAME 96, LAME 128 is below. I think MP3 96 kbps is too good to be a low anchor.
http://www.hydrogenaudio.org/forums/index....howtopic=102876 (http://www.hydrogenaudio.org/forums/index.php?showtopic=102876)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-09 14:33:09
Thanks for answering, I hadn't considered streaming to mobile. I guess that would be an interesting question: "How does the audio quality of popular mobile streaming services compare?" Codecs/settings should be chosen accordingly.

I agree that higher bitrates are pointless, even though there are many threads looking for the "absolute best mp3" etc.

Quote
Technically, yes. But practically, do you know much examples where people are still deploying 48kbps music?

If shoutcast counts (hopefully they'll leave the shoutcast page online after taking down Winamp) low bitrates seem to be quite popular. Maybe also due to mobile use? Here in Germany, most data flatrates are throttled to 64kbps after using up the paid-for high-speed traffic.

Bitrates of the top 10 stations (sorted by listeners):
192 x 1 (mp3)
128 x 1 (mp3)
64 x 3 (2x mp3, aac+)
48 x 1 (mp3 <-- yikes!)
32 x 4 (aac+)

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-09 14:49:35
OT
lvqcl, where did you get the bit rates for the WMA quality settings? Do you have all of them (Std and Pro)?

I simply encoded several albums and took the average bitrate. The bitrates are as follows (Quality 10/25/50/75/90/98):
std: 42 / 53 / 74 / 115 / 176 / 322
pro: 53 / 83 / 113 / 134 / 166 / 266

By the way, You've posted here
[...]
Is it still so?

+ AAC (Apple or FhG or both)
+ Opus 1.1
+ Vorbis (aotuv)
(MP3 and WMA aren't very interesting to me now)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-09 15:50:58
In the last AAC 96kbps, there were 280 results, so if we were to expect the same number of contributes, 33 is the proposed number of samples.

Some info about the last AAC@96 test:

Discarded listeners: 13

Accepted listeners: 25. Among them:

10 listeners submitted results for all 20 samples
3 listeners submitted results for only 1 sample
2 listeners: results for 4 samples
2 listeners: results for 7 samples
and the remaining 8 listeners: results for 2, 3, 5, 6, 9, 10, 11, 14 samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 16:32:54
Some info about the last AAC@96 test:

Discarded listeners: 13

Accepted listeners: 25. Among them:

10 listeners submitted results for all 20 samples
3 listeners submitted results for only 1 sample
2 listeners: results for 4 samples
2 listeners: results for 7 samples
and the remaining 8 listeners: results for 2, 3, 5, 6, 9, 10, 11, 14 samples.

Thank you.
(http://i43.tinypic.com/5nrvyo.png)
It seems the majority come from the "full" contributors. I think testing the same sample by more than 10 people is a bit overkill, but if we were to double the sample size to 40, which is good for the statistical point of view, few can be the "full" contributors.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: jmvalin on 2013-12-09 18:28:58
In the last Opus test in 2011, the contributors submitted 531 valid results, but there were only 30 samples. (17.7 results/sample)
http://listening-tests.hydrogenaudio.org/igorc/results.html (http://listening-tests.hydrogenaudio.org/igorc/results.html)
This is not the most efficient use of the effort. The number of sample, 30, is the statistical bottleneck that hinders to draw even more conclusions.
I recommend the number of samples by this formula: 4*sqrt(expected number of valid results/4)
By this, donators are putting 50% of their effort to the overall conclusions, while remaining 50% of their effort to accurately measure the quality of one sample, which helps developers.
The value used in overall conclusion is about 2x more accurate than the average quality of one sample.

In the last AAC 96kbps, there were 280 results, so if we were to expect the same number of contributes, 33 is the proposed number of samples.


I would actually go one step further. Why not have only one listener for each sample, i.e. give everybody different samples. That would maximize both the statistical significance of the conclusion and the usefulness to the developers (at least for me).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 18:35:04
Well, a lot of suff is going on.

Kamedo2,
Until now one of the conditions of HA tests is "no less than 10 results per sample".
Please have a look through these 10 "full" contributors. http://listening-tests.hydrogenaudio.org/i...-a/results.html (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/results.html) zip file.

Sadly some of them have got tired let's say after 10 samples and after that they have just rated the low anchor.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-09 18:43:03
I would like to change my choice to:

AAC (Apple/qaac) 80 kbps

AAC (Fraunhofer/fhgaacenc) 80 kbps

AAC (Fraunhofer/fdkaac) 80 kbps

Opus (1.1) 80 kbps

Vorbis (libvorbis 1.3.3)

Vorbis (aoTuV b6.03)

WMA Standard

WMA Pro

Still, don't care about MP3 and MPC.

I simply encoded several albums and took the average bitrate. The bitrates are as follows (Quality 10/25/50/75/90/98):
std: 42 / 53 / 74 / 115 / 176 / 322
pro: 53 / 83 / 113 / 134 / 166 / 266
Thanks, I'll do some test too.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 18:44:14
Instead of asking for the desired codecs, I'd like to ask the following: "Which questions would you like answered by the listening test?"

Hi, Gecko.

Agree. But there are so many questions those can be answered by one public test.
Streaming, portable use etc.  People will express they need and we will test that.

I'm having a hard time coming up with any relevant questions which could be answered by a 96k or 80k multi-codec test, but that's just me.

Also 96 kbps (VBR actually goes quite high 110-120 kbps max) can give a hint what happens on ~128 kbps.

YouTube used 96kbps for many videos, but switched to 128kbps a year or so ago.

Just checked a few fresh videos at YouTube. The default 360p Youtube's videos still uses AAC 96 kbps.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 18:46:30
I would actually go one step further. Why not have only one listener for each sample, i.e. give everybody different samples. That would maximize both the statistical significance of the conclusion and the usefulness to the developers (at least for me).

Yes, I had thought of that, but in that case, the standard error of the score will be unacceptably big; I mean, the each score will be unreliable. Many people have different idea of the score, and it will worsen the situation. We cannot even say which sample resulted in the worst quality.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 18:48:07
I will ask people, please,  to mainly concentrate in the choice of codecs, bitrate etc. Lately we will have time to discuss samples, a quantity of them and another conditions.

First of all we should figure out what we want to test.
Though parallel discussions are ok.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 18:53:39
I would like to change my choice to:

OK, eahm. Updating your choice.

It's worth to clear that everybody can change his/her choice.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 19:00:54
I would like to change my choice to:

AAC (Apple/qaac) 80 kbps  AAC (Fraunhofer/fhgaacenc) 80 kbps  AAC (Fraunhofer/fdkaac) 80 kbps
Opus (1.1) 80 kbps

Do all these codecs have VBR mode at 80 kbps?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 19:07:15
Well, a lot of suff is going on.

Kamedo2,
Until now one of the conditions of HA tests is "no less than 10 results per sample".
Please have a look through these 10 "full" contributors. http://listening-tests.hydrogenaudio.org/i...-a/results.html (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/results.html) zip file.

Sadly some of them have got tired let's say after 10 samples and after that they have just rated the low anchor.

I said that "more than 10 results per sample is a bit overkill", but if many are 5.0, around 10 res/sample might be about right. I'll later try to dotplot it, rather than just average and CI95% errorbar, so that I can have a better grasp of the score distribution.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 19:19:55
I said that "more than 10 results per sample is a bit overkill", but if many are 5.0, around 10 res/sample might be about right...

That's why a good low anchor with  acceptable quality has to be included. So people won't just rate a low anchor and submit other results as unrated 5.0. I admit,  our low anchor ffmpeg at 96 kbps was very bad.

faac at 96 kbps (CBR) should do better job.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-09 19:22:01
I don't think the Apple AAC encoders shine at 80k. They shine at 96k. At 80k, I'm almost sure Opus would beat the AAC-LC encoders.
And 96k is likely to be the bitrate many people would use. For smartphones with speed limitation(128kbps is common), broadcasting at 96k seems natural.
http://www.hydrogenaudio.org/forums/index....howtopic=102876 (http://www.hydrogenaudio.org/forums/index.php?showtopic=102876)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 19:23:10
Kamedo2,

If here we're are at least 10 listeners then we can rise a number of samples or even work at what Jean-Marc has proposed.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 19:54:02
I totally agree. Is there no interest in lower bit-rates? 48 kbps perhaps?

Everything is discussible. If people will prefer 48 kbps or any other rate, so be it.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-09 20:25:11
I don't think the Apple AAC encoders shine at 80k. They shine at 96k. At 80k, I'm almost sure Opus would beat the AAC-LC encoders.
And 96k is likely to be the bitrate many people would use. For smartphones with speed limitation(128kbps is common), broadcasting at 96k seems natural.
http://www.hydrogenaudio.org/forums/index....howtopic=102876 (http://www.hydrogenaudio.org/forums/index.php?showtopic=102876)

Damn it poor IgorC, I am afraid Kamedo2 is right, AAC shines at 96, I was too excited to see it against Opus at 80 that I didn't even think about the older test already performed and AAC really didn't change much since.

Sorry again, revert back and last, definitive choice:

AAC-LC (Apple/qaac) VBR 96 kbps

AAC-LC (Fraunhofer/fhgaacenc) VBR 96 kbps

AAC-LC (Fraunhofer/fdkaac) VBR 96 kbps

Opus (1.1) VBR 96 kbps
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-09 20:27:19
I'm interested in Opus and fdk-AAC as they are new players with high popularity potential. And how they compete with Apple-AAC @96 or @80. So:
1. Opus
2. fdk-AAC
3. Apple-AAC
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-09 20:45:15
I think we should probably keep in mind both TVBR and CVBR. Because if TVBR will end up with ~94 kbps and other codecs at ~96-100 kbps then we  probably should go to CVBR ~100 kbps. Anyway both Apple  TVBR and CVBR are great.


We'll see about that (bitrate for TVBR). But anyway TVBR is recommended mode and it is mostly used. So results for another algorithms will not be so useful and informative. This should be a decisive argument.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 21:01:13
eahm, Serge Smirnoff,

Will update later.

Has FhG fdk VBR 80 or 96 kbps mode?


eahm,
...
AAC-LC (Fraunhofer/fhgaacenc) VBR 96 kbps
...

Can I ask You what do You expect from testing FhG again?  fhgaacenc uses the same FhG Winamp encoder that we have tested in the last public listening test. It came 2ºd, right after Apple AAC.  The result will be same.
Anyway  it's your choice.


I think we should probably keep in mind both TVBR and CVBR. Because if TVBR will end up with ~94 kbps and other codecs at ~96-100 kbps then we  probably should go to CVBR ~100 kbps. Anyway both Apple  TVBR and CVBR are great.


We'll see about that (bitrate for TVBR). But anyway TVBR is recommended mode and it is mostly used. So results for another algorithms will not be so useful and informative. This should be a decisive argument.

Agree
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-09 21:15:42
It came 2ºd, right after Apple AAC.  The result will be same.

First, Apple did not win.

Second, what makes you so sure the results will be identical?  I can provide random data for the contenders and not have my results tossed so long as I don't do anything stupid in ranking the anchors with respect to the contenders.  These tests are subjective, after all.  Also, even if people who participated in both tests gave the same results (don't hold your breath on that) what about people who participated in one test, but not the other?

Third, I see no reason to dismiss this test which don't give the exact* same result:
http://www.hydrogenaudio.org/forums/index....howtopic=100525 (http://www.hydrogenaudio.org/forums/index.php?showtopic=100525)

(*) Maybe it did if you actually pay respect to the error bars for both tests (just between Apple and FhG, they are actually statistically tied over-all in both tests!).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-09 21:20:30
If the Winamp encoder in whatever the latest Winamp release was is current, that's great.

Are there relevant differences between the libfdk_aac that you sold to Google and this encoder?

Yes, Winamp 5.666 has the latest AAC encoder quality-wise, no new quality tunings which are ready for release.

I'll let you know when quality is improved. Or just ask

The Winamp/Sonnox/... encoder has a completely different code-base than fdkaac and is a bit better tuned, especially for VBR.

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: jensend on 2013-12-09 21:42:22
greynol, what on earth are you getting at? Are you just trying to be a devil's advocate or troll the process?

Apple absolutely did win. p=.002.

IgorC didn't claim the individual test results will be identical, he claimed the overall ordering will be the same. Though exact scores would vary if you ran that test a thousand times, since Apple won the test by a statistically significant margin, it is to be expected that it would place first in a large majority of repeated tests.

There are tons of good reasons to value the last HA listening test higher than SoundExpert's unsound methodology. That has been discussed more than plenty already.

The last test had five codecs incl. anchor and several of those who did the test said it was too much. Now, at a higher bitrate, people are now trying to throw in every AAC encoder and mode under the sun. You won't get enough worthwhile participation to get any useful information out of the test if you do that.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-09 21:46:19
Apple absolutely did win. p=.002.

I see it now, CVBR, not TVBR, though a retest could easily go the other way.

There are tons of good reasons to value the last HA listening test higher than SoundExpert's unsound methodology. That has been discussed more than plenty already.

I've done my share of criticizing SE, TYVM.  You apparently aren't familiar with the details of test I linked as none of the comments I've seen about SE's unsound methodology apply.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: binaryhermit on 2013-12-09 21:53:04
I have to say that I'm most interested in Opus, Vorbis (aotuv), and mp3 (lame) around 80-96 kbps.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-09 22:06:37
Greynol,

Let me give You one  example.
There were 3 public tests. Roberto's Sebastian's and mine. All of them have shown that HE-AAC is better than Vorbis.

The same way if there will be a new public test (well done, with enough people, with correct methodic). Apple AAC will be always be on top of other AAC encoders.
There can be variations but an average score of Apple will be always on top.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-09 22:26:26
Can I ask You what do You expect from testing FhG again?  fhgaacenc uses the same FhG Winamp encoder that we have tested in the last public listening test. It came 2ºd, right after Apple AAC.  The result will be same.
Anyway  it's your choice.

I remember Chris said he/they did some tuning in the low bit rates, even on the very last release (.16). Wonder if anything changed up to ~96 kbps.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-09 22:26:56
Has FhG fdk VBR 80 or 96 kbps mode?

fdkaac.exe -m 1 (libfdk-aac 3.4.12) gave me 94.1 for the whole Shpongle album
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-09 22:36:40
Here is a document I made while ago about kbps and settings: http://pastebin.com/4HiD8juZ (http://pastebin.com/4HiD8juZ)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: testyou on 2013-12-09 23:11:39
I'm interested in:
Opus --bitrate 96
Apple -V 36 (target is ~96)
LAME -V 5 (target is ~128)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 01:13:53
The whole point of this is that FhG could beat Apple in a re-match, especially when it tied Apple in a perfectly valid test, personal attacks against me aside.

Well, I see You're not familiar enough well with the results http://listening-tests.hydrogenaudio.org/i...ous/results.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/miscellaneous/results.zip)

And that's why You can't imagine why somebody (me in this case) can be  sure that  "FhG can't  beat Apple in a re-match".  (always speaking of 96 kbps, LC-AAC, stereo, 44.1 kHz)

If You analyse the results enough close, You will see that  there are two group of people.  First, the bigger one, who have prefer Apple encoder over FhG with statistically important differences. And the second (a smaller one)  who have very slightly but still prefer FhG encoder over Apple's but there was  no statistically valid difference.  Someone could argue about sample selection. Well, it was a very representative set of 20 (!) samples  which were  automatically randomly picked from different groups of samples.


And now You get the SE test as argument.
Accroding to HA public listening test  2011 there was no a single person who has prefer Nero over any other AAC encoder (excluding of course the low anchor). Not a single person.
But according to SE test,  the average score for Nero was higher than Apple's. Sorry, this can't be right.  No.
And I can even explain to You what has happened. Nero uses a lot  of long frames that makes it quite good for tonal samples, SE is generally consist of these kind of samples and doesn't have any kind of other problematic samples.
That's why Nero does that good. While HA's balanced set of samples consists of different kind of samples, including a transients where Nero performs very bad.


P.S. Anyway I don't mind to include FhG. In fact in my opinion FhG has the best in class quality among HE-AAC encoders at 64 kbps. And at some point it will shine in some future public test at those rate.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-10 01:26:54
SE's test results didn't agree with yours so they must be wrong, the graphical representation of the overall results of your test doesn't do justice to the test data, and you feel there is no point in looking into it any further; I get it.

The next time I see data from two different tests that aren't in agreement, I'll just ask you to point me in the right direction.  The scientific method of repeating an experiment to confirm the results be damned.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-10 01:28:42
P.S. Anyway I don't mind to include FhG.

I think that would be great.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: jensend on 2013-12-10 01:39:05
Ah yes, editing your posts after people have replied in an effort to cover your tracks. Classy move, that. "'Hey, wasn't Trotsky in that photo last time I saw it?' 'No, Comrade. Nothing to see here.'"

You're dismissing the SE result which doesn't exactly agree.  I guess I'll have to take you at your word as to why that is.
Rather than taking my word for it, you could actually read the thread you linked to, along with any number of other threads on the subject.

AFAIK the only thing done differently for that test as compared to most SE tests was to quit with the artifact amplification. Though that was by far the most obvious problem with SE's methods, there have been many other concerns. A few of those include only having five samples (which have been considered unrepresentative), fine-tuning the rate settings to try to achieve a target on those five samples rather than allowing VBR encoders to make intelligent rate decisions given their target, not following normal ABC/HR or other established test protocol, and some statistical methods concerns.

This isn't about people "having an axe to grind." When you leap to uninformed conclusions and criticize people in an uncivil fashion it's pretty absurd for you to turn around and be hypersensitive - "Oh my, somebody called me out on my bad behavior! how rude of them!"

Without bothering to actually read the test results to find out whether there was a discrepancy, and without bothering to ask about in a civil fashion, you leapt to telling the person who organized and did so much of the work on that test that he was lying about its results. You disparaged scientific statistical methods in an incoherent fashion. Rather than considering what people have said about SE, including on the page you linked, you leapt to telling me I'm just ignorant. And you didn't expect to get any pushback on any of this unless someone "had an axe to grind"? Is that because you expect to usually be able to just bin anybody who dares to disagree with you?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-10 02:50:29
I merely questioned (and am still questioning) a blanket claim.  The differences between encoders simply isn't as great as all this posturing implies. Meanwhile we are all to dismiss another set of test results as completely worthless. I for one find the tight grouping of samples encoded with FhG as being worthwhile to at least a few regular and outspoken members. Perhaps this might be taken into consideration for this current round of testing.

I will compromise my request so that we don't have to worry about the possibility of redundant results:
QAAC 80 kbits TVBR
Whichever FhG AAC encoder that will likely get used by the greatest number of people in a post-Winamp world
Opus
Any other non-AAC format that stands a chance at this bitrate.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: testyou on 2013-12-10 03:15:59
QAAC 80 kbits TVBR

I think this would be qaac -V 27.
Quote
Whichever FhG AAC encoder that will likely get used by the greatest number of people in a post-Winamp world

fdkaac?
Quote
Any other non-AAC format that stands a chance at this bitrate.

I'm not sure what else there would be.  I remember vorbis being lower than nero here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: binaryhermit on 2013-12-10 03:18:27
Quote
Any other non-AAC format that stands a chance at this bitrate.

I'm not sure what else there would be.

Opus?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: testyou on 2013-12-10 03:20:05
greynol already included that in his list.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-10 03:23:45
Then I guess that's it for my request.

I'd still love to see FhG from Winamp 5.666 be tested at 96 kbits, but it doesn't seem to make much sense to me and I don't want IgorC to feel obligated to do something that he likely thinks is a waste of time.

Let that conclude my unwelcome visit into this discussion.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TomasPin on 2013-12-10 03:31:47
I'm not sure what else there would be.  I remember vorbis being lower than nero here.

Um... MP3pro? WMA pro? Does anyone use those anymore? Hope not...

My choices:
Apple AAC @ 80 kbps
Opus 1.1 @ 80 kbps (or even less, owing to its mostly-transparent performance at those rates)
LAME 3.99.5 MP3 @ 96 kbps (could be the low anchor?)
FHG AAC or FDK AAC @ 80 kbps

Awaiting further discussion on which Fraunhofer encoder to use. I'd say whichever has a brighter future...
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 03:38:58
If the Winamp encoder in whatever the latest Winamp release was is current, that's great.

Are there relevant differences between the libfdk_aac that you sold to Google and this encoder?

Yes, Winamp 5.666 has the latest AAC encoder quality-wise, no new quality tunings which are ready for release.

I'll let you know when quality is improved. Or just ask

The Winamp/Sonnox/... encoder has a completely different code-base than fdkaac and is a bit better tuned, especially for VBR.

Chris

Chris, how are You?
Let's clear our doubts.

Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?
If the answer is yes could You please indicate on what samples because I really fail to find any audible difference.

There was only one sample (which was actually submited to You by me) "In the roof with Quasimodo"  that is coded diferent by different versions of your encoder. But there is still no audible difference for me. 

Thank You.



It's for our information. FhG encoder will be included anyway if there will be enough people who will want it.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-10 04:49:37
Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?
If the answer is yes could You please indicate on what samples because I really fail to find any audible difference.

There was only one sample (which was actually submited to You by me) "In the roof with Quasimodo"  that is coded diferent by different versions of your encoder. But there is still no audible difference for me. 

If the answer is no I will no longer care to see AAC/FhG/fhgaacenc in the test. Only AAC/Apple, AAC/FhG/FDK and Opus, still @96.

Chris, where were the tuning performed? In which bit rate range, if I can say this? Thanks.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: yourlord on 2013-12-10 06:00:39
Best of breed of the modern codecs so Apple AAC to represent AAC, Latest Vorbis reference, Opus 1.1, and latest LAME at testing.

All at ~96kbps, though 80kbps might be easier to test as most of these codecs (sans mp3) are pushing into transparent territory at 96kbps.

For reference it would be nice to have a LAME encode at ~128kbps to provide a reference to drive home the quality advantage of the modern codecs vs mp3 even when mp3 has a significant bitrate advantage, if any.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 08:29:45
SE's test results didn't agree with yours so they must be wrong,


Yes. Either there's flaw in SE's test, or in ours. It has been pointed out what the issue with the SE test is (limited, non-representative, biased sample selection). It's up to you to point out what could be wrong with IgorCs previous test if you believe the results are invalid. If they aren't invalid, the other test has to be wrong.

Quote
The next time I see data from two different tests that aren't in agreement, I'll just ask you to point me in the right direction.  The scientific method of repeating an experiment to confirm the results be damned.


The idea is to set up the test so it can be repeated and subsequent tests will be in agreement. Do you have any arguments why this would not be the case? It would be nice to verify it but there's already enough candidate codecs and already enough lack of people with time to run the tests, so I see no argument to do it without good reason.

If you believe you saw a flaw in the test setup that invalidates the result and could change the outcome on a repeat, speak up now so we can see if it can be fixed. If not, then what's your argument in the first place?

The entire point of the statistical analysis instead of just reporting mean scores is to ensure that a subsequent test gives the same result even if there is random variance in listener ratings & samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-10 09:19:02
Sorry for asking rather foolish question.
I have almost zero knowledge in this area, but considering listening test as a kind of sampling survey, what is considered as "population" here, in order to compute the reliability / validity of result ?

Considering both of human(subject) and audio sample(object) as parameters,
1. "population" = all individuals in the world
2. "population" = all songs and non-songs in the world (kind of ridiculous, looks impossible)
or something like that?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 09:29:54
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.

For (1), yes, although our sample selection is obviously biased to the HA audience. So probably the population is the generic audio enthusiast nerd with a PC and above average listening equipment, and in many cases, some training wrt typical encoder artifacts.

So the question is really what the best codec is for the "discerning" listener to encode his music.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-10 09:44:30
Thanks.
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.

Thanks.
Taking speech or other non-music into consideration makes population completely indefinable, so it makes sense to me (Even if music only, I can't imagine how large number the class will become ...)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 10:37:12
Pardon, discussion is temporarily stopped.
It probably will be better for everyone  if I will take it to a different place.  It will take us some time.

Please stay in touch. Here is my mail
igoruso at gmail dot com
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-10 11:03:32
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.


I appreciated the two spoken samples (3 & 15) in the HA2011 test.  While it's much less likely I'd use this test's bitrate (80,96,128) for speech, I would very much like to have at least one sample be single voice chanting / acapella singing like sample 4 in that test.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 11:15:31
For (2) I'd say stuff that's generally considered to be "music". There is quite some codec research regarding speech but our tests steer clear of that.


I appreciated the two spoken samples (3 & 15) in the HA2011 test.  While it's much less likely I'd use this test's bitrate (80,96,128) for speech, I would very much like to have at least one sample be single voice chanting / acapella singing like sample 4 in that test.


When I say speech I mean just that, i.e. not singing/chanting/acapella. Like what you have in a radio show in-between the music. This can be encoded extremely effectively at much lower bitrates than music, so it's a bit of a different area, codec-wise. (Things like Opus and USAC switch to a different mode to handle it)

I don't actually know if pure speech codecs handle singing.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-10 11:31:25
You're dismissing the best available listening evidence

You're dismissing the SE result which doesn't exactly agree.  I guess I'll have to take you at your word as to why that is.

The whole point of this is that FhG could beat Apple in a re-match, especially when it tied Apple in a perfectly valid test, personal attacks against me aside.

I would like to see such a re-match.

Let me quote the last AAC public listening test. http://listening-tests.hydrogenaudio.org/i...-a/results.html (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/results.html)
Code: [Select]
        Nero      CVBR      TVBR       FhG        CT  low_anchor
       3.698     4.391     4.342     4.253     4.039     1.545

Here, CVBR and TVBR have slightly more average score, and unadjusted p-value is 0.002 and 0.059.  So it's not totally unthinkable for FhG to beat Apple in a re-match, although quite unlikely. But even in that case, the FhG beating Apple in a significant margin is unlikely; the difference is, if it exists, less than 0.100. The difference is tiny. Do you really care?
(1)Is there a statistically significant difference? (2)Is that a big difference?  These questions are not the same, and typically, (2) is more important.

I'm deeply skeptical about the SE result, because sometimes mp2 wins and Opus is statistically tied to Lame and this; http://slashdot.org/story/09/03/11/153205/...s-of-mp3-format (http://slashdot.org/story/09/03/11/153205/young-people-prefer-sizzle-sounds-of-mp3-format)

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-10 13:07:15
SE's test results didn't agree with yours so they must be wrong,


Yes. Either there's flaw in SE's test, or in ours. It has been pointed out what the issue with the SE test is (limited, non-representative, biased sample selection). It's up to you to point out what could be wrong with IgorCs previous test if you believe the results are invalid. If they aren't invalid, the other test has to be wrong.


Those HA and SE @96 tests have different versions of participated codecs, slightly different settings, different sample sets, different way of presenting stimuli to testers and obviously different type of participants. How can you expect exactly similar results from both tests? IMHO they correlate well having all this in mind.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-10 14:34:31
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.
Not even the confidence intervals say that Apple AAC is better than FhG, and these give only a statement on this particular test with the specific samples used and listeners participating.
Sure if we assume (for good reasons) that the test was conducted well we would not expect that a codec like Nero, who came out much worse in that test, would win in a new test, but nobody considered to test Nero here as far as I can see.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 14:40:02
Those HA and SE @96 tests have different versions of participated codecs, slightly different settings,


This is relevant, but IIRC some results are incompatible even with the same or nearly the same versions.

Quote
different sample sets, different way of presenting stimuli to testers and obviously different type of participants. How can you expect exactly similar results from both tests? IMHO they correlate well having all this in mind.


These should not be relevant. The type of listeners is already a bias as stated in previous posts in this thread, and one which neither of us can get around. If the selection of samples has an influence, that means a bad bias in their selection that invalidates the test (and it's the exact problem I have with your test!).

The way the stimuly are presented shouldn't affect the result. If it does, that's a flaw again. But you're not amplifying artifacts any more, right?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 14:54:09
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.


The whole point we've been trying to explain is that this should be impossible if both tests are conducted correctly. We know our selection of listeners is biased and that could affect things. However, I wouldn't expect that to make a difference between two AAC codecs, but more that a test with generic listeners will output on average higher ratings due to more people not being able to discern differences. And it remains to be seen if the audience on SE is wildly different from the one here.

Let me state it again: if you repeat the test, you should get a compatible result. If someone else runs a similar test, they should get a compatible result. That's the whole point of the test setup. If you can run the same test and get another result, what's the point of running a test in the first place?

Quote
Not even the confidence intervals say that Apple AAC is better than FhG


This is downright false: FhG is worse than CVBR (p=0.005)

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-10 14:57:53
these give only a statement on this particular test with the specific samples used and listeners participating.


No, they don't. They would if they were treated as the entire population, but they're analyzed as a sample of the population. What you say is both wrong and irrelevant. These are really basic things.

http://en.wikipedia.org/wiki/Statistical_sampling (http://en.wikipedia.org/wiki/Statistical_sampling)

To illustrate the difference: we KNOW that CBR>TVBR for those specific samples and those specific listeners, because that's exactly what was tested and we can see the result. But the variance of the result indicates that this result may possibly not hold for all music samples and every person-with-a-pc-and-interested-in-audio, so this wasn't concluded from the test. On the other hand, CVBR was concluded to be better than FhG because the result indicates that if you rerun the test 200 times, with a similarly representative sample selection and a similar, but not necessarily identical, set of listeners, that FhG will only win once and lose 199 times.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 15:55:38
As it's only about whether or not FhG AAC encoder should participate: the results of the HA test are so close that IMO of course FhG can win in another listening tests.

All You do is looking to an average score and draw conclusion based on that.

You could download the results http://listening-tests.hydrogenaudio.org/i...ous/results.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/miscellaneous/results.zip)
You will get quite enough of people who have rated Apple significantly higher than  FhG. And less results who have prefered FhG but not significantly

As far as I can see only Kamedo2 took the job  and had a closer look.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-10 16:25:58
You could download the results http://listening-tests.hydrogenaudio.org/i...ous/results.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/miscellaneous/results.zip)
You will get quite enough of people who have rated Apple significantly higher than  FhG. And less results who have prefered FhG but not significantly

As far as I can see only Kamedo2 took the job  and had a closer look.

(http://i44.tinypic.com/2z6cpxv.png)
(http://i40.tinypic.com/e8x1dy.png)
(http://i42.tinypic.com/2chxkox.png)
(http://i41.tinypic.com/1exhf.png)
(http://i42.tinypic.com/2lbdtmb.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-10 16:54:28
Visualization of the last (2011) AAC 96kbps public listening test results.
http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/ (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/)
(http://i42.tinypic.com/6iy7bk.png)
close up of the interesting section:
(http://i40.tinypic.com/24lk4gi.png)
unlike the previous post, one plot denotes one music track.
(http://i42.tinypic.com/2a6ml2d.png)

Online visualization tool: http://zak.s206.xrea.com/bitratetest/graphmaker4.htm (http://zak.s206.xrea.com/bitratetest/graphmaker4.htm)
Code: [Select]
Nero	CVBR	TVBR	FhG	CT	low_anchor
3.64 4.22 4.69 4.23 3.71 1.60
4.05 4.47 4.13 4.52 3.46 1.41
3.30 3.51 3.24 3.34 3.20 1.60
3.57 4.52 4.55 4.73 4.41 2.42
4.04 4.53 4.54 3.97 4.43 1.33
4.19 4.58 4.59 4.62 4.65 1.52
3.65 4.10 4.32 4.53 3.85 1.47
3.83 4.62 4.41 4.49 4.18 1.67
3.62 4.27 4.26 4.72 3.91 1.60
3.66 4.30 4.34 4.24 4.26 1.72
3.82 4.28 4.21 3.96 4.13 1.58
3.48 4.67 4.37 4.35 3.81 1.48
4.13 4.54 4.64 4.08 4.24 1.50
3.42 4.32 4.40 4.29 4.10 1.34
3.60 4.54 4.72 4.18 3.69 1.51
3.92 4.70 4.52 3.98 4.26 1.44
3.85 4.41 4.55 4.49 4.57 1.32
3.67 4.79 4.37 5.00 4.83 1.42
3.08 4.26 3.78 4.11 3.96 1.25
3.34 4.72 4.65 3.43 3.88 1.27
%samples 01 - Reunion Blues
%samples 02 - Castanets
%samples 03 - Berlin Drug
%samples 04 - Enola Gay
%samples 05 - Mahler
%samples 06 - Toms Diner
%samples 07 - I want to break free
%samples 08 - Skinny2a
%samples 09 - Fugue Premikres notes
%samples 10 - Jerkin Back n Forth
%samples 11 - Blackwater
%samples 12 - Dogies
%samples 13 - Convulsion
%samples 14 - Trumpet
%samples 15 - A train
%samples 16 - Enchantment
%samples 17 - Experiencia
%samples 18 - Male speech
%samples 19 - Smashing Pumpkins - Earphoria
%samples 20 - on the roof with Quasimodo
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-10 17:37:34
...All You do is looking to an average score and draw conclusion based on that.....

I did in my argumentation here, but personally I am much more interested in the quality of the particular samples.
Look at Kamedo's graphs for Enola Gay (FhG shines here) and Mahler (FhG has weaknesses here).
In another test with a sample selection similar to the samples used here, but not identical, a small variation in samples can have a relevant change in test outcome.
As for the listeners it's similar, especially as far as the percentage of very experienced listeners is concerned (the less experienced listeners smoothen differences out as they often judge sample issues as imperceptible). What we can also learn from Kamedo's graphs above is that the experienced users are differently sensitive towards the various artifacts. If you have a look at that it's clear that this is a most relevant factor for variation in test results.
IMO the (scientifically correct) statistics over all the samples (averages and confidence intervals) give a feeling of safe judgement about encoder quality which is misleading. Looking at the outcome of all the listeners for the various samples gives an impression of this.

What I try to say is: these listening tests are meaningful, but we shouldn't take them as the words of the bible (as soon as we don't look at the detailed results of all the listeners for all the samples). In case two encoders turn out to have a very similar outcome we should take them both as participants for a new tests, especially as there seems to be serious interest in both of them.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 17:55:50
In case two encoders turn out to have a very similar outcome we should take them both as participants for a new tests, especially as there seems to be serious interest in both of them.

Isn't it a contradiction? If two codecs were found to be so close it won't change anything because each time there will be people who will want to re-test it arguing exactly the same, "small difference"

as I can see You have some expertise in listening tests, so You can define goals and organize a new test. A different one that  will figure out your doubts.

I consider to take a preparation to another place, it won't be here on HA.  So it's all yours.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-10 18:09:08
There was progress in codec development since the last test, wasn't it?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-10 20:38:00
You can download two versions of Winamp and run ABX at least on 2-3 samples encoded by  Fraunhofer AAC encoder .

Now that will really help to organize a test.

I'm not mad about Fraunhofer, it's the excelent encoder in my opionion. And I really mean it.
I'm desperate because nobody want to corroborate. If  there were quality changes at 96 kbps, what to expect etc.


P.S. hey, why don't we "re-rrrun" to see if some of MP3 encoders from  here  http://listening-tests.hydrogenaudio.org/s...8-1/results.htm (http://listening-tests.hydrogenaudio.org/sebastian/mp3-128-1/results.htm)  could flip out?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-10 22:44:37
Well, sorry, but:
Were there improvements (not bugfixes) that imrove an audible quality of your AAC encoder at 96 kbps in las 2 years?

Quote from: eahm link=msg=0 date=
Chris, where were the tuning performed? In which bit rate range, if I can say this?

Counter-question: was the quality of Apple's AAC encoder improved over the last two years, and on which samples? Do you understand why I'm asking this?

Answers to the original question were given here a few months back (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=101580&view=findpost&p=840417) and here a few weeks back (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103483&view=findpost&p=850697). These post say it all. The remaining doubts were cleared up by Igor's further comment (the one with the Quasimodo sample). I already re-tuned the encoder while the 2011 HA test was still running.

Why should I tell you which samples were improved by my tunings? Judging from my (unfortunately unpleasant) experience with the preparation of the 2011 test (where I mentioned samples on which the Fraunhofer encoder does well), I fear this would have an influence on whether these samples would be considered for inclusion in the test or not. After all, we apparently don't know on which samples Apple's encoder improved, knowledge which is necessary for fairness and which brings me back to the above question. Edit: Actually, I don't think such questions should be asked at all in a discussion of selection of codecs.

Anyway, the 64-kbps and 96-kbps SoundExpert tests give you a hint which samples the Winamp encoder handles quite well and which the Opus encoder doesn't.

By the way, the overall ranking of the 64-kbps and 96-kbps SoundExpert tests is nearly identical, which indicates that it can't be that wrong. Of course their sample selection is debatable and radically different from the HA tests', but concluding that "the SE test must be wrong" is a bit unfair IMHO. For the record, I got relatively similar results in internal MUSHRA tests with the same samples (Opus scored a bit better due to different bit-rate calibration).

Honestly, I really don't know what to make of this discussion, and I seriously considered leaving it after greynol - a moderator - was addressed with "grow up" and something like "Mr. know-it-all" yesterday (Edit: apparently deleted now, but the deletion doesn't undo it). I only give this last reply because Igor and eahm directly addressed me with a question.

Now my personal wish list, in case anybody cares: I have absolutely no interest in seeing a comparsion between two AAC encoders at 96 kbps, and I certainly don't care which AAC encoder should be used (since, like I replied to Garf's statement, I'm sure of what the result would be). I'd rather like to see Opus 1.0.? compared against 1.1 as backup for the claim (http://www.opus-codec.org) of "significantly improved encoding quality, especially for variable-bitrate (VBR)". I think HA is the logical place for such a test.

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-10 22:56:30
...P.S. hey, why don't we "re-rrrun" to see if some of MP3 encoders from  here...

Nobody has ever asked for a re-run of a listening test.
You want to organize a new listening test and you asked for codec proposals. Sure @96 kbps AAC plays a major role, and IMO the last AAC listening test does not imply that only Apple AAC is worth testing. That's what my contribution (not only mine) here was about.

But in the end it's best just to collect wishes from HA users, and leave any personal background for or against certain codecs aside.
In case HA users want FhG AAC to participate, it should be done IMO. In case there's no interest for it, it should be left out.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 01:27:24
Why should I tell you which samples were improved by my tunings?

What's wrong with that?  The sample selection is automatic and human-independent anyway.

Judging from my (unfortunately unpleasant) experience with the preparation of the 2011 test (where I mentioned samples on which the Fraunhofer encoder does well)

If we would accept your samples then Apple developers would be yelling at us.
But anyway You can punch me. I understand You. It's your encoder that came up second.

When we will finish with a future public tests at 64-96 kbps then more probably we'll go to lower rates like 32-48 kbps, where most probably (HE)-AAC family  will beat Opus, Vorbis. Then Xiph developers will start to punch me.
It's kinda already fun for me.

Anyway, the 64-kbps and 96-kbps SoundExpert ...

Haaa ...?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 01:32:33
Kamedo2,

Thank You for posting the graphs. What should we look at?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-11 02:26:43
The type of listeners is already a bias as stated in previous posts in this thread, and one which neither of us can get around. If the selection of samples has an influence, that means a bad bias in their selection that invalidates the test (and it's the exact problem I have with your test!).

I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music, especially taking into account the usual practice of using killer samples in codec listening tests. Population of music is extremely huge and diverse. So, selection of samples is also a bias like selection of listeners to my opinion.

The practice of targeting bitrate of VBR codecs using big music library is not ideal as well. The bitrates depend on proportion of various music genres the library consists of. Different codecs react differently with their resulting bitrates upon changes of those proportions. As a result the choice of codec settings is also arbitrary to some extent. The more so, this choice turned out to be completely unrelated to the sound material actually used in listening test. This was discussed a lot and other solutions (including SE one) have drawbacks of their own and it was a consensus that this approach is reasonable and valid, but it is not the only one and there are no indications that it is the best.

I'm pretty sure there is no such listening test design that produces some final results. Because of those assumptions, conventions and compromises any listening test shows only a part of the whole picture. Any such test could be perfectly repeatable if it follows the same methodology and corresponding test design. There are simply valid variations of the same methodology which could affect the result. I think if you repeat that HA @96 listening test with different samples (I'm not sure in representativeness of any such set of samples) and different way of calculating target bitrates (having different pros and cons) the results will be not the same - some tied contenders could easily change their places. But actually I don't recommend to do this, quite the contrary, I have the impudence to give you advice - follow your methodology which is well established, valid, elaborated inside HA community and thus accepted by its majority. But, please, stay away from claiming your results a final word in comparison of codecs. Such claims are ungrounded and unproductive. Exactly because of this hard-edged approach the initial discussion turned into hysterical defense of HA sacred cow - listening test results. These results are not ideal but perfectly useful and I am very interested in them because they help both verify SE results (indirectly though) and better understand the limits of SE methodology. Conducting of listening tests with strict design never was a goal of SE. SE moves from the opposite side - first of all it offers a version of blind listening tests which are designed to be as simple as possible for ordinary listeners and afterwards derives as much as possible information from collected grades. So I'm perfectly aware of shortcomings of SE methodology and yet I still think it produces helpful results, less accurate but valid.

Quote
The way the stimuly are presented shouldn't affect the result. If it does, that's a flaw again. But you're not amplifying artifacts any more, right?
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency. Amplification of artifacts was never used for codecs below 100kbit/s, not a single time starting from the beginning in 2001, just there is no need for it at low bitrates.

@IgorC
I think that your listening test agenda should not depend on external and unrelated factors such as SE, its results and possible advocates.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 02:30:22
Kamedo2,

Thank You for posting the graphs. What should we look at?

The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis. So I retract my past post that says the likelihood of FhG beating Apple is very small.

There's a minor possibility that FhG wins over Apple. Still, it's a multiformat listening test and I'd rather prefer to see the AAC-AAC battle in a separate public test than in this one.

And the 20 sample indeed is a statistical bottleneck. The sample number is small and it is likely to improve if we double the sample number.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-11 05:44:12
Honestly, I really don't know what to make of this discussion, and I seriously considered leaving it after greynol - a moderator - was addressed with "grow up" and something like "Mr. know-it-all" yesterday (Edit: apparently deleted now, but the deletion doesn't undo it). I only give this last reply because Igor and eahm directly addressed me with a question.
FWIW, I moved offensive statements which didn't consider the original topic into the recycle bin, so nothing was really deleted. I just tried to sanitize this thread and wanted to avoid people picking up on these offensive statements, which wouldn't further the discussion.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-11 05:53:26
So far as I understand, how far we can "generalize" things depends on what is called external validity (http://en.wikipedia.org/wiki/External_validity)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-11 07:41:11
What's wrong with that?  The sample selection is automatic and human-independent anyway.

Yes, but I meant that the samples I thought of weren't even included in the pool from which the samples were randomly drawn.

Quote
If we would accept your samples then Apple developers would be yelling at us.

Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention. So I won't tell you. And no, Igor, I'm not punching you.

Quote
... lower rates like 32-48 kbps, where most probably (HE)-AAC family  will beat Opus, Vorbis. Then Xiph developers will start to punch me.

Why should they? Opus already won by some margin at 64 kbps, why are you so sure that HE-AAC would win there? That's why I would like a listening test at 48 kbps: to show me which coder wins (or is tied to another)!

Quote
Haaa ...?

Quote from: Serge Smirnoff link=msg=0 date=
I think that your listening test agenda should not depend on external and unrelated factors such as SE, its results and possible advocates.

True, true. So forget what I said about the SE test. We're at HA.

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 09:44:27
The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis.


I agree here BTW. The past tests had an issue that the results were merged per-sample before doing the analysis, but this loses the information on the variability of the listeners and makes the test lose all power (it's the same as if one person would take the test). The fix was to keep all results, but this conflates the variability of the listeners and the samples. The bootstrap tool should be fixed to block over both samples and listeners instead of sample-listeners to give correct results with our test format.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 09:51:09
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music, especially taking into account the usual practice of using killer samples in codec listening tests. Population of music is extremely huge and diverse. So, selection of samples is also a bias like selection of listeners to my opinion.


You're claiming classical statistics is wrong?

That said, I agree on the concerns regarding *our* sample selection. We use problem samples so it's clearly biased.

More practically, we don't have an entire library of music available on which we can make a truly random choice. Ideally, we draw random numbers out of the entire (for example) Spotify catalog and test those samples.

Maybe we can come close to that: We get a list of all songs from musicbrainz (for example), someone makes a program which outputs a list of randomly picked songs + 30s excerpts (musicbrainz has duration info so it's possible), publicizes the list, and we start looking from the top if anyone actually has the CD so we can get the sample.

This would still bias towards more popular music, but a) we can probably live with that as it's arguably a wanted bias b) it's better than what we do now.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 10:08:32
I think if you repeat that HA @96 listening test with different samples (I'm not sure in representativeness of any such set of samples) and different way of calculating target bitrates (having different pros and cons) the results will be not the same


Agree on that wrt samples. The way of calculating the target bitrate is a choice. I believe it's the correct one if we don't assume people will reconfigure their encoder for every specific song they encode. If you agree with that assumption, I'd like to see a concrete proposal of another methodology that would be valid or an argument why ours isn't.

Quote
But, please, stay away from claiming your results a final word in comparison of codecs. Such claims are ungrounded and unproductive.


We understand that our tests have flaws which influence the result and introduce error. I think we've done a lot to eliminate them as much as possible.

The problem is people arguing: if you repeat a test you get a different result anyway. This is wrong thinking. This is only true if the test has flaws. That should be the goal of the discussion: to point out and figure out how to eliminate as many flaws as possible. If you can point out a flaw, you have an argument why a repeat test will give a different result and the result that was posted isn't definite. If you just say you will get a different result without a valid reason, you're misunderstanding statistics.

The main valid point I've seen rised here was sample selection. That's good. We can try to move to the next level there.

Quote
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency.


Is this demonstrable or is it your suspicion? I would worry that non-hidden reference adds loads of noise to the result, and makes it harder to draw conclusions, because of people ranking fake differences. Of course this is less of a factor if you have very many listeners.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 11:43:13
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency.

What happens if 50% of people distinguished and preferred the non-reference? It happens; http://slashdot.org/story/09/03/11/153205/...s-of-mp3-format (http://slashdot.org/story/09/03/11/153205/young-people-prefer-sizzle-sounds-of-mp3-format)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 12:25:37
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music ...

This assumption is false. Any of developers or people involved in tests can say that.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 12:50:20
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 15:57:35
Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention...

Agree

Quote
why are you so sure that HE-AAC would win there?

Well, HE-AAC is very efficient at 32-48 kbps. While I'm not 100% sure what can happen in a public tests, personally I prefer HE-AAC at this range of bitrate.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-11 16:19:59
For the latest posts of Kamedo2, IgorC, Serge Smirnoff:

I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

As soon as you accept that a listening test has an important but necessarily limited meaning everything is fine. Of course the test should be conducted with best effort to do things right. But I've always hated to rate encoders according to just statistical analysis and think if this is done correctly (not always the case as we have seen in this thread) we know with scientific precision that encoder A is better than B.

For encoder choice the formal statistics of average and confidence interval often have no meaning. I'm thinking of the last mp3@128kbps test. In the light of overall average and confidence intervals all the encoders were tied. But looking at the outcome for the individual samples Lame 3.97, iTunes and - to a minor degree - Fraunhofer showed some noticeable weaknesses for some samples. So without information from outside the test it is not very reasonable to choose one of these encoders. From the listening test alone only Lame 3.98.2 and Helix remain as the practical candidates for encoder choice.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 16:30:25
The error bars are bigger than the official one. And I performed the ANOVA analysis over the 20 sample, and the result was far dull than the official one.
I notice the .zip/Analysis/results_AAC_2011.txt and it paste all 280 individual results in a flat format, and analysis were made as if there were independent 280 samples.
I have to say it's an incorrect procedure of the statistical analysis.


I agree here BTW. The past tests had an issue that the results were merged per-sample before doing the analysis, but this loses the information on the variability of the listeners and makes the test lose all power (it's the same as if one person would take the test).

It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'. The errorbar is about +/- 0.2 in size, which is enough to get the rough idea of the quality.
(http://i42.tinypic.com/6iy7bk.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 16:46:34
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

I believe you are too anxious. I tend to spend a lot of time listening to encoded music, rather than the collections of wav in my HDD. The reason is to report the defect to the developer(s) if anything go wrong. I've already sent a dozen of problematic samples to a developer of FFmpeg's native AAC encoders. You don't get any report, because nothing have gone wrong.
If you're still worrying, read this: http://scienceblogs.com/cognitivedaily/200...-dont-understa/ (http://scienceblogs.com/cognitivedaily/2007/03/29/most-researchers-dont-understa/)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-11 16:47:14
Quote
If we would accept your samples then Apple developers would be yelling at us.

Exactly, Igor. Which is why I fear the same thing would happen in this test if I tell you which samples I tuned, or to prevent such yelling, you'd have to exclude the sample I mention. So I won't tell you. And no, Igor, I'm not punching you.

Well, accepting those samples should surely make the test dubious in terms of fairness which is indeed a bad thing, but do codec developers really yell at it?
I guess samples where company A performs worse than others will be more useful to company A's developer than samples where company A performs quite well, and even imagine that developer might be able to "steel" something from others when they are same codec... but of course I'm not a codec developer and I could be completely wrong.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 17:20:57
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.

Great. Please, inform yourself how the samples were picked for the last HA public test and then propose how You can improve that.

Make a study on these 20 samples per:
- type of content
- type of possible artifact
- music style if it was a music sample
- ...

It wasn't just a casual choice. 

Thank You. That will help.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 17:27:03
It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'.


I'm not sure what you are talking about here, but I think you completely misunderstood what I pointed out. If you squash all results per sample *before doing the analysis*, you have *20* results, not *280* as your graph shows. This is exactly the same input as if one person had taken the test. All the information about variability that you get from multiple listeners is forever gone. You might get lucky in that there is now less variability than with an actual test with one person, but how can you even tell?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 17:47:03
I think this is the very problem.
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection.


Why not? A random sample drawn from the population without bias is fine and sufficient. This isn't a case of "failure no matter how hard you try". Why would it? Statistical sampling isn't magic. It's well-understood but just non-trivial to pull off. You're now the second person to make this claim even though it runs directly contrary to well-established mathematics. You *can* correctly infer population statistics from a random, non-biased sample. There's no point in claiming something else. If you want to show it's not possible, you should go collect your Fields Medal in the process.

Quote
It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.
...
For encoder choice the formal statistics of average and confidence interval often have no meaning. I'm thinking of the last mp3@128kbps test. In the light of overall average and confidence intervals all the encoders were tied. But looking at the outcome for the individual samples Lame 3.97, iTunes and - to a minor degree - Fraunhofer showed some noticeable weaknesses for some samples. So without information from outside the test it is not very reasonable to choose one of these encoders. From the listening test alone only Lame 3.98.2 and Helix remain as the practical candidates for encoder choice.


You're making an argument here that the best encoder isn't the one which gives the best quality result on average, but which is least prone to make a bad encoding. You can estimate it by looking at the indicated bounds, selecting the codec with the highest upper bound: it's the one that's least likely to give you bad outliers. I have no idea why you claim they have no meaning as they indicate directly what you want.

The idea that the best encoder to use is one that is determined by that reasoning, rather than the one that gives the highest quality on average, is entirely on you BTW.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 17:57:17
It's noticeably better than one person would take the test, and I'm not that pessimistic to call it 'loosing all power'.


I'm not sure what you are talking about here, but I think you completely misunderstood what I pointed out. If you squash all results per sample *before doing the analysis*, you have *20* results, not *280* as your graph shows. This is exactly the same input as if one person had taken the test. All the information about variability that you get from multiple listeners is forever gone. You might get lucky in that there is now less variability than with an actual test with one person, but how can you even tell?

Yes, it's *20* results, but the average result is far more accurate than the result of one person, which comes from the fact it was tested many times. Humans are whimsical, but less so if the test was conducted multiple times. Even less whimsical if the another test was conducted by another person.

In case really one person had taken the test, the accuracy is gone, the result is dirty.
(http://i39.tinypic.com/2njeyx1.png)
After squashing all (average:14) results per sample *before doing the analysis*, indeed, the accuracy is improved by the squashing.
(http://i42.tinypic.com/6iy7bk.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-11 18:22:04
Alright, nice. So the variability does already drop a load due to that. What's the analysis you used for analyzing samples and listeners separately, i.e. the original graph you posted? multi-way ANOVA? I'd be curious to see the (corrected for multiple comparisons) p-values then. I agree they're overstated in the original results. I have my reservations about ANOVA as well, due to the clipping at 5.0, but doing a bootstrap with dependent samples is out of my league so I think it's the best we can do for now.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 18:25:14
If somebody is interested here is also an IRC channel irc://irc.freenode.net/hydrogenaudio

P.S. I will update the list with codecs later.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 18:38:02
It's simply a bootstraped confidence interval estimation of the averaged, squashed data below.
Code: [Select]
Nero CVBR TVBR FhG CT low_anchor
3.64 4.22 4.69 4.23 3.71 1.60
4.05 4.47 4.13 4.52 3.46 1.41
3.30 3.51 3.24 3.34 3.20 1.60
3.57 4.52 4.55 4.73 4.41 2.42
4.04 4.53 4.54 3.97 4.43 1.33
4.19 4.58 4.59 4.62 4.65 1.52
3.65 4.10 4.32 4.53 3.85 1.47
3.83 4.62 4.41 4.49 4.18 1.67
3.62 4.27 4.26 4.72 3.91 1.60
3.66 4.30 4.34 4.24 4.26 1.72
3.82 4.28 4.21 3.96 4.13 1.58
3.48 4.67 4.37 4.35 3.81 1.48
4.13 4.54 4.64 4.08 4.24 1.50
3.42 4.32 4.40 4.29 4.10 1.34
3.60 4.54 4.72 4.18 3.69 1.51
3.92 4.70 4.52 3.98 4.26 1.44
3.85 4.41 4.55 4.49 4.57 1.32
3.67 4.79 4.37 5.00 4.83 1.42
3.08 4.26 3.78 4.11 3.96 1.25
3.34 4.72 4.65 3.43 3.88 1.27
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-11 18:58:09
Interesting.

I'd like to restate the (EDIT: sarcastic) comment I made earlier:
the graphical representation of the overall results of [the] test doesn't do justice to the test data

Not to be a pain, but I must question once again whether Apple actually "won", since some appear to be basing it on a p figure from an analysis that seems to have been drawn into question.

EDIT: With the analysis by Kamedo2, I don't feel terribly inclined to believe a p figure over the graphs where all the error bars indicate a statistical tie.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-11 20:06:25
Alright, nice. So the variability does already drop a load due to that. What's the analysis you used for analyzing samples and listeners separately, i.e. the original graph you posted? multi-way ANOVA? I'd be curious to see the (corrected for multiple comparisons) p-values then. I agree they're overstated in the original results. I have my reservations about ANOVA as well, due to the clipping at 5.0, but doing a bootstrap with dependent samples is out of my league so I think it's the best we can do for now.

I tried the blocked bootstrapping confidence interval estimation, using the 280 raw results.
(http://i40.tinypic.com/op1w1h.jpg)

It's almost the same as the squashed version. You've said that "All the information about variability that you get from multiple listeners is forever gone", but I can say that data is not lost by the squashing.
As for p-value, the program would be way harder than the CI estimation, but shouldn't be very different from the ANOVA of the squashed version.

Code: [Select]
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 20
Critical significance:  0.05
Significance of data: 3.91E-013 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total               99          18.63
Testers (blocks)    19           6.48
Codecs eval'd        4           6.87    1.72   24.74  3.91E-013
Error               76           5.28    0.07
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.166

Means:

CVBR     TVBR     FhG      CT       Nero
  4.42     4.36     4.26     4.08     3.69

---------------------------- p-value Matrix ---------------------------

         TVBR     FhG      CT       Nero
CVBR     0.523    0.068    0.000*   0.000*
TVBR              0.229    0.001*   0.000*
FhG                        0.028*   0.000*
CT                                  0.000*
-----------------------------------------------------------------------

CVBR is better than CT, Nero
TVBR is better than CT, Nero
FhG is better than CT, Nero
CT is better than Nero
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-11 20:45:04
...Great. Please, inform yourself how the samples were picked for the last HA public test and then propose how You can improve that. ...

I guess you think I'm criticizing that test. I really don't.
What I was talking about is intrinsic limitations in generalizing the test results (of any listening test) to the universe of music and listeners, especially if the test's outcome is measured by overall statistics - same thing what Greynol said. But I don't want to continue this discussion as IMO everything was said about it.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-11 21:25:00
I guess you think I'm criticizing that test. I really don't.

No, I don't think You're criticizing.

Please, understand me correctly. All I'm asking is stop "shooting to the air" and start to elaborate some possible solutions, work on some particular parts as now Kamedo2 now provides real numbers. He makes a real deal.


"Look I have made some researchment and have found that we should include those and these samples because of that and this. We should include x number of samples with p, q and r charactersitics. Acording that paper...  " You know, make a real call.

So, I ask You,  have You figured out how a sample selection was done during the last?  It would be a good start to begin with.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-12 00:06:49
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.   
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-12 01:07:22
Quote
Stimuli at SE are presented without non-hidden reference, this affects results near the edge of transparency.


Is this demonstrable or is it your suspicion? I would worry that non-hidden reference adds loads of noise to the result, and makes it harder to draw conclusions, because of people ranking fake differences. Of course this is less of a factor if you have very many listeners.


What happens if 50% of people distinguished and preferred the non-reference? It happens; http://slashdot.org/story/09/03/11/153205/...s-of-mp3-format (http://slashdot.org/story/09/03/11/153205/young-people-prefer-sizzle-sounds-of-mp3-format)


Exactly this happened in SE @96 test, tables with submitted grades show increased number of 6-grades (confused reference) which are discarded. IMO this should not affect final scores, just prolongs testing period.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-12 01:42:51
I can't prove that, but I have intuitive assumption that 10, 20 or even 100 samples can't reliably represent the whole population of music,

Even when the 'extremely huge and diverse' population of music that fluctuates between 1.0=Very Annoying and 5.0=Imperceptible, when we randomly pick 100 samples from the population, we can reliably determine the average of the 'extremely huge and diverse' population of music in a 0.1 accuracy, without ever testing the whole 'extremely huge and diverse' population of music.

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.   
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.

(1) true
(2) We won't know the variance of means before the test. Instead, imagine how much accuracy we need. 3.0=Slightly Annoying 4.0=Perceptible but not annoying 5.0=Imperceptible, so I feel it's accurate enough when we determine the average score by only 0.1 of error margin. (Can we imagine the difference between 3.3 and 3.4?)
(3) You mean the post-test evaluation?
(4) Rather, we want the SEM to be small enough to fill the requirement.
The rough estimation is done this way. First, score is between 1.0 and 5.0. So the Standard Deviation(SD) can't be more than 2.0. SD being 2.0 is highly unlikely because the score would be either 1.0 or 5.0, both 50% of the time and in that case, 1.0=Very Annoying so that the developers would get tons of bug reports. Let's say SD = 1.0. Standard Error of the Mean (SEM) = SD/sqrt(sample size). If we get independent 100 results, SEM=1.0/sqrt(100) = 0.1, which is small enough.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-12 03:59:35
If we have say 20 samples it is possible that this represents the universe of music for the encoders tested. But it is also possible that this is not the case. We just don't know no matter how hard we try to do a good job with sample selection. It can always be that that there are tracks out there not represented in the test sample set which show that a specific encoder (maybe the winner in the test) can behave poorly.


There were some quality verification tests of MPEG formats (MP3, AAC) where a different kind of signals were generally  joined into 3 big groups.  Transients (1), tonal (2), stereo (3).  Those are the most important groups.

Here (http://www.hydrogenaudio.org/forums/index.php?showtopic=77584&st=50&p=695576&#entry695576) is the example of  a representative set of samples.  All three groups have similar number of samples.

Here (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/all_samples.zip) is the set of samples  from public test (2011) and its clasification (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdHNWNWY4T0FFWjF4REtneGN1ekFiZFE&usp=sharing)


It's also an option to enrich a set with some aditional samples like apllause, mixed material like different combinations (on top and/or sequences) of speech/singing + music.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-12 11:12:44
Igor, I have understood that very well, and I have no doubt sample selection of that test was done with care according to the very reasonable principles you show up. So no problem at all with test preparation and conducting the test.

The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here. There's more statistical stuff like the clipping of the values at 5.0 which was mentioned here but which is ignored for the sake of getting  simple test results. But to me it's all over-simplification. And there's much more. For instance the judgements of the listeners are certainly not invariant in space and time, especially when the deviation from the original is perceptible, but close to nothing, that is for judgements clearly better than 4.0. I can definitely say that for me. I'm certainly not the perfect listener, but as this applies to me, I'm pretty sure that there are listeners out there to whom this applies as well - maybe to a much minor degree.

And there's nothing really bad about it: listening tests give important information about strengths and weaknesses of encoders. Quality just can't be put simply into a simple overall result of just one number, and things like confidence intervals are more than questionable here.

And to bring things back on topic: FhG AAC is a good candidate for your test, as is Apple AAC.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-12 12:05:11
The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here.

With your argument, no medicine is possible, as well as public healthcare, investigation of industrial pollution, nor even public transportation. Black swan may exist, but must be less than 1/sample_number to remain undetected, and it must be extremely unpleasant to affect the overall user experience, since it's rare. If it's extremely unpleasant, why the developers haven't got any report like that?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-12 13:21:52
I bootstrapped the last 2011 public listening test of AAC encoders @ 96kbps (280 donated results, 20 samples) to plan this upcoming test.
The past data may not be precisely applicable to an another future test, but you may get a 'sense' of 'How much effort do we need to bring the error margin down?' or 'Which plan is likely to make better use of the precious donated time?'. Enjoy!
(http://i43.tinypic.com/2z720c6.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-12 13:33:56
Taking the last test as a baseline, even if ones assumes that the total number of votes does not increase when we increase the number of samples, there is still a benefit (smaller error) in doing so. E.g. doubling the number of samples and halving votes/sample still yields a smaller error. The case where votes/sample is also taken constant is even better. As long as we're on the steep part of that curve there is no harm in increasing the number of samples.

But of course minimizing the error is only one part of the whole picture. As stated earlier, how to select the samples is a major point of debate.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-12 15:10:41
Thank You for your effort, Kamedo2. We should need an extra time to analyse statistics. It's all on todo list.


As for now the list of candidates (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing) was updated
Most of members are interested in testing VBR mode then a main goal of test will be comparison in this mode. In other words, the question is "how certain codecs perform (quality wise) in VBR mode at ~ 96 or 80 kbps ".

Until now the list of votes:
1.Apple AAC -  17
2. Opus - 17
3. Vorbis - 8
4. MP3@ 128 - 8

Possible:
Fraunhofer AAC - 7
MP3@96 - 7

Probably won't be tested:
MPC - 2
WMA Pro - 1
WMA Standard - 0


Bitrate (kbps) :
96 - 13
80 - 8
48 - 1

December 18 is a limit date to submit codecs.  Then we will move to bitrate verification, sample selection and especially stuff that Kamedo2 and halb27 have rised lately.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-12 15:38:10
I bootstrapped the last 2011 public listening test of Multiformat encoders @ 64kbps (531 donated results, 30 samples) as well.
The raw data is from here: http://www.hydrogenaudio.org/forums/index....showtopic=88033 (http://www.hydrogenaudio.org/forums/index.php?showtopic=88033)
Like I said, this data may not be precisely applicable to this new test, but maybe you can get the 'sense'.
(http://i40.tinypic.com/23uyhs8.png)
(http://i43.tinypic.com/xp7i4p.png)

Thank you IgorC for updating and maintaining the table.

For people who voted for 80kbps, I gently ask you to rethink.
80kbps is somewhere too low for an AAC-LC (but too high for a HE-AAC), and like the past 64kbps test and this test, Opus is likely to win.
http://listening-tests.hydrogenaudio.org/igorc/results.html (http://listening-tests.hydrogenaudio.org/igorc/results.html)
http://www.hydrogenaudio.org/forums/index....showtopic=97913 (http://www.hydrogenaudio.org/forums/index.php?showtopic=97913)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-12 16:54:38
Kamedo2, i don't know if you've noticed, given your skills and experience what you do is co-organizing. Great.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-12 17:04:13
IgorC, after the reply of Chris my vote no longer goes to Fraunhofer/fhgaacenc but only to Fraunhofer/fdkaac. Why fdk is not in the list? Did I miss anything?

Updated:
AAC/Apple 96 VBR
Opus 1.1 96 VBR
Fraunhofer/fdkaac 96 VBR

Thanks.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-12 18:18:31
Eahm,

Please, be patient. I'm updating it from time to time.
It would be hard to include every codec at once, it would  be a mess.
You mention it, it goes there.

Don't want to influence on your choice, but it'worth to mention that AFAIRC  there was  a comment stating that Winamp flavor of FhG has the most optimal quality comparing to other. Anyway it still has value to test the open source flavor too. It's up to you.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-12 18:19:08

Correct me if I'm wrong.
(1) Variance of overall means originates from two sources: variance of listeners' grades and variance of sound samples.
(2) In order to determine appropriate number of sound samples we should perform analysis of variance of means of sound samples for each codec.
(3) Some estimation of the appropriateness can be derived comparing confidence intervals of means of samples' means.   
(4) More precisely required number of samples can be determined by means of, for example, Cohen tables, proceeding from desired power of test and significance level.

Is your rough estimation obtained with the (4)? If not, could you make rough calculations as I'm not sure I can do this correctly.

(1) true
(2) We won't know the variance of means before the test. Instead, imagine how much accuracy we need. 3.0=Slightly Annoying 4.0=Perceptible but not annoying 5.0=Imperceptible, so I feel it's accurate enough when we determine the average score by only 0.1 of error margin. (Can we imagine the difference between 3.3 and 3.4?)
(3) You mean the post-test evaluation?
(4) Rather, we want the SEM to be small enough to fill the requirement.
The rough estimation is done this way. First, score is between 1.0 and 5.0. So the Standard Deviation(SD) can't be more than 2.0. SD being 2.0 is highly unlikely because the score would be either 1.0 or 5.0, both 50% of the time and in that case, 1.0=Very Annoying so that the developers would get tons of bug reports. Let's say SD = 1.0. Standard Error of the Mean (SEM) = SD/sqrt(sample size). If we get independent 100 results, SEM=1.0/sqrt(100) = 0.1, which is small enough.

Assuming SD = 1.0 and results = 100 we can go a bit further and calculate confidence interval of mean M for sound samples, which is [M - 2*SEM, M + 2*SEM] (http://en.wikipedia.org/wiki/Sample_size#Estimation_of_means). So width of this 95% interval is 0.4 unit (of score). Such interval allows to reliably discern means that differ >= 0.3 unit (allowing 25% overlap).

Using Cohen tables (http://en.wikipedia.org/wiki/Sample_size#By_tables) for determining number of samples gives even higher min.discernable distance between means: >= 0.46 (assumptions are as follows: SD = 1.0, results = 100, signif.level = 0.05, power of test = 0.8, Cohen table is for the case of two-group t-test)

In order to determine (representative) number of sound samples we should choose at least the size of conf.interval of mean of sound samples' means. Should it be approximately equal to conf.intervals of samples' means? In other words, should the accuracy of estimating sample means (variance of listeners) be equal to accuracy of estimating mean of those sample means (variance of sound samples)? In general, how to address uncertainty about overall means caused by variance of samples?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-12 19:10:49
Winamp flavor of FhG has the most optimal quality comparing to other. Anyway it still has value to test the open source flavor too. It's up to you.

Yes, I understand the Winamp flavor (let me call it fhgaacenc for simplicity) is "better" than others but I want to see a real test of how much effort they really put on an open source one.

I thought it was removed for some reason I missed because if I remember I wasn't the only one asking for it and I didn't see it in the list.

Thank for YOUR patience
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: darkbyte on 2013-12-12 19:28:48
Although a test @80kbps would be interesting, testintg at @96kbps is more useful. So my votes are:
- Opus 1.1 @96kbps VBR
- Apple AAC-LC @96kbps TVBR
- FhG AAC-LC @96kbps VBR
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-13 13:33:17
The problem is with statements about the test results, especially when based on strong statistical aggregation. We're simply not in the world of well-behaved probabilistic distributions, but what we're doing is more or less statistical sampling out of a world of black swans (as for sample selection, but there is more about the listeners as well, especially their sensitivity towards the various samples). Classical statistical analysis is misleading here.


I have no idea where you get this from. Not even remotely. "We can discard physics because I say so!"

What makes you believe this analysis is concerned with exceptionally rare events?

Quote
There's more statistical stuff like the clipping of the values at 5.0 which was mentioned here but which is ignored for the sake of getting  simple test results.


This is a patently false claim, which just illustrates you haven't looked at past discussions and you have no actual idea what you're talking about. We use bootstrap analysis in addition to ANOVA exactly because of this.

Please, give actual arguments. Right now you're just hand-waving with wrong assertions, and I'm not waving back.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-13 14:19:27
It's almost the same as the squashed version. You've said that "All the information about variability that you get from multiple listeners is forever gone", but I can say that data is not lost by the squashing.


This is a strange result, to me. The multiple listeners per sample give you information on how stable the sample score is, i.e. they tell you the uncertainty on the rating of the samples. So you are concluding this information does not help to establish the uncertainty of the final scores? We know that squashing the scores means the error on those ratings becomes lower, but why does knowing the *distribution of the error* not help you in the conclusion?

Imagine I gave you a list of 30 samples and each sample had been listened to by 1M listeners, i.e. the error on the score would be extremely small. I give you another list of 30 samples and each one has only been listened to by 1 listener. Your confidence on the (eventual) means of both examples is the same as long as the mean values are the same? This is weird.

In the calculation of the variability of the eventual mean, I would expect a weighting term related to the per-sample error. The variability of the eventual mean (i.e. the spread of all samples over the codec average) should not increase as much if we're adding a sample that has a mean that could be pretty far off from reality, compared to when we're adding a sample that we know we measured pretty accurately. I would also expect to weight the mean towards measurements with more certainty. (This is pure intuition speaking - maybe there's a mathematical result that firmly explains why this isn't needed or correct).

Maybe it doesn't end up mattering because the variability for the listeners per sample and the resulting variance is actually fairly equal over all samples?

I want to think a bit more about this and play with some simulations because it seems so strange.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-13 15:10:33
I bootstrapped the last 2011 public listening test of AAC encoders @ 96kbps (280 donated results, 20 samples) to plan this upcoming test.
The past data may not be precisely applicable to an another future test, but you may get a 'sense' of 'How much effort do we need to bring the error margin down?' or 'Which plan is likely to make better use of the precious donated time?'. Enjoy!
(http://i43.tinypic.com/2z720c6.png)


If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lozenge on 2013-12-13 15:44:52
Since my FLACs are generally stored on *nix systems and I transcode there, I'm more interested in seeing how the Open Source encoders compare, so: my vote goes to:

- Opus 1.1 96 VBR
- Fraunhofer/fdkaac 96 VBR
- Vorbis -q2
- Apple AAC 96 VBR


Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-13 15:54:24
If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.

Yes, you are doing it right. If you consider it mind-blowing, consider in the opposite direction. 28 donators/sample with 10 samples, 56 donators/sample with 5 samples.... 56 donators would be slightly more accurate than the 14 donators, but if you randomly re-pick 5 samples, you can easily imagine over-picking transients, or tonal samples, and it makes the final result unstable.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-13 15:59:16
I have tried  Fraunhofer/fdkaac -m 3 on a bunch of albums and couldn't get ~96 ~ 100 kbps. The real bitrate is ~110 kbps.  ~100 kbps is ok for a test. ~110 kbps isn't.

I will ask people to run Fraunhofer/fdkaac to see if it hits 96-100 kbps on a different albums. if not, only CBR is an option.

Apple AAC VBR mode hits a target bitrate on a bunch of albums.  LAME, Vorbis and Opus have a fine bitrate settings so here is no issue. FhG Winamp hits ~100 kbps, no issue.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-13 17:49:51
If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.
The point is that you get the same accurate result regarding the variance of the total sample (all samples together), so using 2 listeners makes you actually lose significant information on a per-sample basis. But if the only question is "how can I minimize the error of the overall result", i.e. find the best encoder on average, you can easily disregard that information. So, semi-intuitively this result seems to be understandable to me, but still mind-blowing, indeed. That's statistics. :-)

That also means that it's important to settle the question what the aim of this test should be before settling the number of samples or how to select them. Improve encoders by identifying problem samples? Or find the currently best encoder for a large variety of songs? Or both?!
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: 2012 on 2013-12-13 17:58:00
I have tried  Fraunhofer/fdkaac -m 3 on a bunch of albums and couldn't get ~96 ~ 100 kbps. The real bitrate is ~110 kbps.  ~100 kbps is ok for a test. ~110 kbps isn't.

I will ask people to run Fraunhofer/fdkaac to see if it hits 96-100 kbps on a different albums. if not, only CBR is an option.


Using '-vbr 2 -cutoff 14k' with ffmpeg should match better.

I don't know if using non-default settings like this is acceptable in a listening test. But maybe they are better than CBR.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-13 18:01:18
I want to think a bit more about this and play with some simulations because it seems so strange.

It's a tolerated procedure. I simulated. There are 32 independent samples on the left. Two random samples were paired and squashed, to be 16 samples. Again, squashed to be 8 samples.
Notice the confidence interval doesn't change.
(http://i43.tinypic.com/beh5w3.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-13 18:28:39
fdkaac 0.5.1 --help prints this:

Quote
VBR mode is not officially supported, and works only on a certain combination of parameter settings, sample rate, and channel configuration


(fdkaac was built using fdkaac_autobuild script from https://sites.google.com/site/qaacpage/cabinet (https://sites.google.com/site/qaacpage/cabinet) )
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-13 19:47:54
Assuming SD = 1.0 and results = 100 we can go a bit further and calculate confidence interval of mean M for sound samples, which is [M - 2*SEM, M + 2*SEM] (http://en.wikipedia.org/wiki/Sample_size#Estimation_of_means). So width of this 95% interval is 0.4 unit (of score). Such interval allows to reliably discern means that differ >= 0.3 unit (allowing 25% overlap).

Listening test results are typically highly correlated. You can typically discern more, but not always, as having 0.3 difference doesn't guarantee having an observed mean of 0.3 difference. Could be less.

Using Cohen tables (http://en.wikipedia.org/wiki/Sample_size#By_tables) for determining number of samples gives even higher min.discernable distance between means: >= 0.46 (assumptions are as follows: SD = 1.0, results = 100, signif.level = 0.05, power of test = 0.8, Cohen table is for the case of two-group t-test)

I'm skeptical that Cohen table is useful in here.

In other words, should the accuracy of estimating sample means (variance of listeners) be equal to accuracy of estimating mean of those sample means (variance of sound samples)?

It's the goal of your statistical analysis, so it's up to you. My recommendation is that, rather than to chase "Codec A and rival Codec B, which performs better for sure?" and look for the statistical analysis that guarantees low p-value, "Is Codec A true to the original?" and think of the tolerable error margin of the answer. The margin you tolerate. You can do the same for the Codec B.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-13 22:40:24
If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.
The point is that you get the same accurate result regarding the variance of the total sample (all samples together), so using 2 listeners makes you actually lose significant information on a per-sample basis. But if the only question is "how can I minimize the error of the overall result", i.e. find the best encoder on average, you can easily disregard that information. So, semi-intuitively this result seems to be understandable to me, but still mind-blowing, indeed. That's statistics. :-)

That's why I advocate to do statistics only for each sample, and leave the interpretation towards overall quality to the user. I think this represents reality best, especially as the outcome of the various samples doesn't have the same meaning to every user. A person who is very sensitive towards transients for instance will give these samples a much stronger weight than a person who is pretty insensitive here. I love the diagrams where the samples are shown on the x-axis and their average (and maybe more statistical) outcome on the y-axis, the outcome for each encoder shown in a different color. It shows it all on one glance without any over-simplification. Even for readers who don't want to go much into detail this diagram shows which encoders are attractive to use and which are not.
Most important: this way important information on sample performance is kept and not extremely aggregated into just one average plus additional statistical information the exact meaning of which is hardly understood by anybody turning us all to beleivers.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-13 22:56:32
That's why I advocate to do statistics only for each sample, and leave the interpretation towards overall quality to the user...

All previous HA tests had an average scores with statistics.
This will have it too.

Period.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: kennedyb4 on 2013-12-13 23:26:48
For my vote I would like to finally, if possible, end the debate of CVBR vs TVBR in AAC, preferrably Apple's encoder. The last test showed only a tendency for CVBR to be rated higher with no clear winner.

Maybe the bitrate should be pushed lower to do this.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-13 23:51:43
Concerning number of sound samples. Here is my findings.

There are several ways to find this number in order the test to be powerful. All of them require the knowledge of standard deviation of scores in population and heavily depend on it. Kamedo2 proposed SD = 1 giving some basic reasoning. As the value of SD is important for our calculations it should be grounded a bit better. Imagine ideal listening test when a codec processed all population of music; resulting audio was cut, say, by 15s peaces and subjectively evaluated by some panel of listeners. What would be SD of the scores?  Knowing distribution of those scores would help to find it. Can we guess the distribution? In order to have some idea here is distribution of scores for all samples and codecs in HA @96 listening test:

(http://imageshack.us/a/img132/8760/p5uf.png)

It seems to me reasonable to assume that all scores in population distributed from 5 to1 normally with m=5 and sd=1:

(http://imageshack.us/a/img41/244/nmsj.png)

Then 99.99% of all scores fall into [1, 5] interval. Scores which are close to 1 and 2 belong to killer samples. Using this pure speculative but reasonable model we get SD = 0.6, which substantially reduces number of required samples.

As I already mentioned rough estimation of the number of samples (n) could be made using simple formula (http://en.wikipedia.org/wiki/Sample_size#Estimation_of_means) for estimating of population mean:

n = 16*SD^2/W^2 , where W is width of 95% confidence interval.

For W = 0.4 and SD = 0.6 n = 36. In other words, using 36 randomly chosen from population sound samples we can estimate population mean with the following accuracy – width of 95% confidence interval will be 0.4. This will allow to reliably discern means which differ more than 0.3 units of score (D = 0.3). This rough estimation can be considered as the most optimistic for n.

More realistic estimation can be obtained with Cohen tables (http://www.lrdc.pitt.edu/schneider/P2465/Readings/Cohen,%201988%20%28Statistical%20Power,%20273-406%29.pdf) for multiple comparisons of means. Lets consider the simple case of comparing two means. We have SD = 0.6, p = 0.05, Power of test = 0.8 and the same distance between means D = 0.3. For these inputs table value for n = 64 (Effect Size f = D/(2*SD) = 0.25). In other words, using 64 sound samples we can't reliably discern two means which differ less than 0.3 units. For comparison of 5 means n = 84. And all this using very optimistic SD = 0.6.

Taking into account that n = 40 is almost max. realistic number of samples in any listening test, there is a choice between two cases:

(1) For the sake of possibility to generalize results to the whole population of music we seriously loose the power of test. In this case overall means and confidence intervals are calculated for sound samples. Even if confidence intervals turns to be small for selected samples the test will not be valid  because of inappropriate sampling of sound samples from population.

(2) Dropping that generalization we increase power because we no longer account variability of sound samples. In this case overall means and confidence intervals are calculated for grades (squashed for all sound samples). Results of such test are much more accurate but biased against particular sample set.

All listening tests that I saw was of (2) design. And sound samples are chosen to be representative not for the population of music but for the population of artifacts produced by codecs (problem samples). Exactly because of this bias developers of codecs (knowingly or unknowingly) are so serious about sample selection for listening tests. It matters. So, this is a real bias and it is unavoidable.

My conclusions. Number of sound samples can be arbitrary, more samples help to reveal more features of codecs. Having some limited number of listeners, the number of samples should provide at least 10-15 grades per sample. Maintaining equal number of samples from test to test is a plus. Practice of using the same samples from test to test are vulnerable to intentional tuning of codecs. It's a good idea to maintain some big bank of problem samples divided by types of artifacts and choose randomly samples from it for each next test. Along with overall means of codecs results should include per-sample graphs because they contain valuable info about codecs behavior. So, nothing new, except absence of the false generalization.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-14 11:12:57
Glad to see I'm not alone.
I can live with giving an overall average when those statistical 'proofs' that encoder 'A' is better than 'B' for the universe of music dissapear.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: [JAZ] on 2013-12-14 11:50:05
@halb27:  A statistical analysis never says "A is always better than B".  It says "There's a high enough degree of probability that A is better than B that taking A is expected to be the correct solution most of the time".

In other words, there's as much simplification in saying only "A is better than B", than in saying "for the universe of music".
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-14 12:18:44
Glad to see I'm not alone.
I can live with giving an overall average when those statistical 'proofs' that encoder 'A' is better than 'B' for the universe of music dissapear.
(http://i.somethingawful.com/forumsystem/emoticons/emot-ughh.gif)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-14 12:56:18

@halb27:  A statistical analysis never says "A is always better than B".  It says "There's a high enough degree of probability that A is better than B that taking A is expected to be the correct solution most of the time". ...

Sure. I didn't care to be that precise because my point is on the the claim part 'for the universe of music'. I'd even agree with that if we were randomly choosing track snippets from the universe of music. But we all know that the results of such a listening test would be very dull. So we use a selection of problem samples, at least to a significant percentage of all samples. That's what I called a world of more or less 'Black Swans' where it is not possible to generalize judgements for the individual samples to the universe of music.

Sure the more technical detail oriented people will prefer Serge's arguments.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-14 13:41:53
Sure. I didn't care to be that precise because my point is on the the claim part 'for the universe of music'. I'd even agree with that if we were randomly choosing track snippets from the universe of music. But we all know that the results of such a listening test would be very dull. So we use a selection of problem samples, at least to a significant percentage of all samples.


I believe the reason for the heavier representation of problem samples in a listening test is that they can spoil the whole user experience.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 03:19:17
Although a test @80kbps would be interesting, testintg at @96kbps is more useful. So my votes are:
- Opus 1.1 @96kbps VBR
- Apple AAC-LC @96kbps TVBR
- FhG AAC-LC @96kbps VBR

FhG Winamp or libfdk?

Since my FLACs are generally stored on *nix systems and I transcode there, I'm more interested in seeing how the Open Source encoders compare, so: my vote goes to:

- Opus 1.1 96 VBR
- Fraunhofer/fdkaac 96 VBR
- Vorbis -q2
- Apple AAC 96 VBR

Vorbis aoTuv or reference?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-15 08:09:51
IgorC, I vote for FAAC -b 96 as a low anchor (please, complete my vote). Lower bitrate will make differences hearable for deaf even, but we don't need them in our test 

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-15 08:49:20
If my above arguments are convincing, then it is reasonable to preserve previous design of the test: 20samples/5codecs+low anchor and choose another 20 samples of the same pattern of artifacts representation (or correct the pattern if necessary). This will help to preserve aprx. equal reliability of results from test to test. Also it seems that this design is on the edge of participation capability. So contenders could be for example as follows:
[blockquote]1. Opus
2. Apple AAC TVBR or CVBR
3. Fhg commercial
4. Fhg free
5. Vorbis or MP3[/blockquote]

Nero@64 from previous @64 test as low anchor?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-15 09:12:59
Nero@64 from previous @64 test as low anchor?

FAAC@96 is better. Fair, not misleading and the explanation is easier.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-15 09:39:28
FAAC@96 is better. Fair, not misleading and the explanation is easier.
OK.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 11:39:19
Out of curiosity I listened to some AAC@96 encoded (Apple AAC, Winamp FhG) tracks, and I'm really impressed. That's why AAC encoders are most prominent on my wish list.

For this reason I'd like to see a ~96 kbps listening test (target bitrate for a test set of regular music may deviate a bit from 96 kbps, because otherwise deviation from 96 kbps may be too big for some VBR settings) with

Opus
Apple AAC
Winamp FhG
FhG free version
mp3@128

participating.
I'm about to do some investigation about the exact details how to use these encoders.

Other than that I support Serge's suggestion for the number of samples and number of encoders.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gainless on 2013-12-15 13:37:49
My vote goes for:

Opus
Vorbis
Apple/FhG AAC
Helix Mp3, 128 kb/s
FAAC at 96 kb/s as low anchor

As for the decision between Apple and FhG AAC it might be a good idea to do a seperate listening test for these with similar samples after the "big" one is finished.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 15:34:00
I'm about to do some investigation about the exact details how to use these encoders.

Great. I'm concern only about FhG libfdk as I couldn't get 96 kbps setting for it on a bunch of heterogeneous albums.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 15:36:49
IgorC, I vote for FAAC -b 96 as a low anchor (please, complete my vote). Lower bitrate will make differences hearable for deaf even, but we don't need them in our test 

Totally agree
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-15 15:39:55
Great. I'm concern only about FhG libfdk as I couldn't get 96 kbps setting for it on a bunch of heterogeneous albums.

Has FhG fdk VBR 80 or 96 kbps mode?

fdkaac.exe -m 1 (libfdk-aac 3.4.12) gave me 94.1 for the whole Shpongle album
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: 2012 on 2013-12-15 15:44:56
I'm about to do some investigation about the exact details how to use these encoders.

Great. I'm concern only about FhG libfdk as I couldn't get 96 kbps setting for it on a bunch of heterogeneous albums.


Did you see my earlier reply ?
http://www.hydrogenaudio.org/forums/index....st&p=852904 (http://www.hydrogenaudio.org/forums/index.php?showtopic=103768&view=findpost&p=852904)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-15 15:59:46
I took several albums from my media library and encoded them with fhgaacenc / fdkaac / qaac. The results are:

1) FhG AAC (quality 1 / 2 / 3 / 4 / 5 / 6):
37 / 68 / 105 / 138 / 206 / 256 kbps

2) FDK AAC (quality 1 / 2 / 3 / 4 / 5):
84 / 91 / 106 / 130 / 220 kbps

3) QAAC (--rate keep --tvbr N, N = 0 / 9 / 18 / 27 / 36 / 45 / 54 / 63 / 73 / 82 / 91 / 100 / 109 / 118 / 127):
46 / 53 / 61 / 70 / 76 / 96 / 111 / 126 / 144 / 161 / 196 / 231 / 264 / 295 / 333 kbps
(--rate keep is the default setting for qaac)

4) QAAC --cvbr 96 gives 101 kbps.

added:
5) Opus 1.1 --bitrate 96 / 104 => 99 / 107 kbps.
6) aotuv 6.03 -q 2 / -q 2.5 / -q 2.99 => 98.4 / 104 / 109 kbps
7) LAME 3.99.5 -V 7 / 6.5 / 6.449 / 6 / 5 / 4.999 => 101 / 106 / 109/ 114 / 129 / 136
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-15 16:12:38
As for libFDK, VBR target bitrate seems to be declared in https://github.com/mstorsjo/fdk-aac/blob/ma.../src/aacenc.cpp (https://github.com/mstorsjo/fdk-aac/blob/master/libAACenc/src/aacenc.cpp) as the following:
Code: [Select]
static const CONFIG_TAB_ENTRY_VBR configTabVBR[] = {
  {AACENC_BR_MODE_CBR,   {     0,     0}} ,
  {AACENC_BR_MODE_VBR_1, { 32000, 20000}} ,
  {AACENC_BR_MODE_VBR_2, { 40000, 32000}} ,
  {AACENC_BR_MODE_VBR_3, { 56000, 48000}} ,
  {AACENC_BR_MODE_VBR_4, { 72000, 64000}} ,
  {AACENC_BR_MODE_VBR_5, {112000, 96000}}
};

First column is bitrate mode (you can set it by -m switch in case of fdkaac frontend), second column is for mono, third column is for stereo, for which you have to multiply by 2 to get actual target bitrate. So, -m 3 should target 96kbps (48000 x 2).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 16:30:27
My result for the AAC encoders (similar to what was written before), achieved with my standard test set of various pop music:

Apple --tvbr 45 (or similiar):    93 kbps
Apple --tvbr 54 (or similiar):  108 kbps
Apple --cvbr 96 (or similiar):  100 kbps
Winamp FhG VBR 3:              102 kbps
fdkaac --bitrate-mode 3:        103 kbps

Because of the fact, that quality demands can be chosen only in steps here, the most adequate settings for the test are IMO

Apple --cvbr 96 (or similiar):  100 kbps
Winamp FhG VBR 3:              102 kbps
fdkaac --bitrate-mode 3:        103 kbps

@2012: I don't like to see the encoders tested using lowpasses <16 kHz. It could alter quality just because of this (not necessarily towards the bad side).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-15 16:37:13
As for libFDK, VBR target bitrate seems to be declared in https://github.com/mstorsjo/fdk-aac/blob/ma.../src/aacenc.cpp (https://github.com/mstorsjo/fdk-aac/blob/master/libAACenc/src/aacenc.cpp) as the following:
Code: [Select]
static const CONFIG_TAB_ENTRY_VBR configTabVBR[] = {
  {AACENC_BR_MODE_CBR,   {     0,     0}} ,
  {AACENC_BR_MODE_VBR_1, { 32000, 20000}} ,
  {AACENC_BR_MODE_VBR_2, { 40000, 32000}} ,
  {AACENC_BR_MODE_VBR_3, { 56000, 48000}} ,
  {AACENC_BR_MODE_VBR_4, { 72000, 64000}} ,
  {AACENC_BR_MODE_VBR_5, {112000, 96000}}
};

First column is bitrate mode (you can set it by -m switch in case of fdkaac frontend), second column is for mono, third column is for stereo, for which you have to multiply by 2 to get actual target bitrate. So, -m 3 should target 96kbps (48000 x 2).

I also could not get the declared bitrates. -m1 gives ~94kbps. Is something wrong with the encoder?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 16:39:26
Great. I'm concern only about FhG libfdk as I couldn't get 96 kbps setting for it on a bunch of heterogeneous albums.

Has FhG fdk VBR 80 or 96 kbps mode?

fdkaac.exe -m 1 (libfdk-aac 3.4.12) gave me 94.1 for the whole Shpongle album


It's just one single album. Have You tried to run it on something else?  A few other albums?

I'm about to do some investigation about the exact details how to use these encoders.

Great. I'm concern only about FhG libfdk as I couldn't get 96 kbps setting for it on a bunch of heterogeneous albums.


Did you see my earlier reply ?
http://www.hydrogenaudio.org/forums/index....st&p=852904 (http://www.hydrogenaudio.org/forums/index.php?showtopic=103768&view=findpost&p=852904)

Yes, I've seen it.  I was just expecting other people to answer it. And halb27 did it.


Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-15 16:45:28
The default lowpass value for fdkaac -m 2 is ~13.1 kHz, for -m 3 it is ~14.3 kHz.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 16:54:31
OK, this changes things. Nonetheless I feel uncomfortable to modify lowpass this way.
But wait: how can we do better with a lowpass at 14 kHz when default lowpass is 14.3 kHz?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 17:12:10
My vote goes for:

Opus
Vorbis
Apple/FhG AAC
Helix Mp3, 128 kb/s
FAAC at 96 kb/s as low anchor


Guys, please specify which versions and settings. Fhg Winamp or libfdk? vbr, cbr?


As for the decision between Apple and FhG AAC it might be a good idea to do a seperate listening test for these with similar samples after the "big" one is finished.

What we should expect from pre-testing Apple and FhG AAC? According to the last test they have similar performance.  Both encoders were updated, but as far as I can see those are misc. updates.
So You can make your own blind test and bring it here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 17:14:42
I finished the Opus test for my test set of various pop music, and it matches perfectly my suggestion for the AAC settings:

Opus --bitrate 96: 101 kbps
Apple --cvbr 96 (or similar): 100 kbps
Winamp FhG VBR 3: 102 kbps
fdkaac --bitrate-mode 3: 103 kbps
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 17:38:27
Thank You, halb27,lvqcl and nu774 for bitrate reports.  Later we will do a table with all bitrate reports.

Let's see what bitrates other people are getting.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 17:51:46
Let use one fhg libdfk encoder as a reference.

Where we can get the "official" binaries?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 18:26:55
As for mp3 I've just done some listening tests with Helix -V60, lame3995m -V5, lame3100m -V5.
I'd like to see lame3100m -V5 in the test. 3.100alpha2 has very noticeable advantages on tonal problems over 3.99.5, and my extension gives the chance for a better quality for transient stuff or whenever standard Lame might show up weaknesses with short or mixed blocks.

So all in all I'd like to see

Opus --bitrate 96
Apple AAC --cvbr 96
Winamp FhG AAC VBR 3
fdkaac --bitrate-mode 3
lame3100m -V5


in the test.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-15 18:40:59
We're considering to include 4 codecs, that is already high.
That's the average number of codecs is proposed by people.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: 2012 on 2013-12-15 18:45:28
As for libFDK, VBR target bitrate seems to be declared in https://github.com/mstorsjo/fdk-aac/blob/ma.../src/aacenc.cpp (https://github.com/mstorsjo/fdk-aac/blob/master/libAACenc/src/aacenc.cpp) as the following:
Code: [Select]
static const CONFIG_TAB_ENTRY_VBR configTabVBR[] = {
  {AACENC_BR_MODE_CBR,   {     0,     0}} ,
  {AACENC_BR_MODE_VBR_1, { 32000, 20000}} ,
  {AACENC_BR_MODE_VBR_2, { 40000, 32000}} ,
  {AACENC_BR_MODE_VBR_3, { 56000, 48000}} ,
  {AACENC_BR_MODE_VBR_4, { 72000, 64000}} ,
  {AACENC_BR_MODE_VBR_5, {112000, 96000}}
};

First column is bitrate mode (you can set it by -m switch in case of fdkaac frontend), second column is for mono, third column is for stereo, for which you have to multiply by 2 to get actual target bitrate. So, -m 3 should target 96kbps (48000 x 2).

I also could not get the declared bitrates. -m1 gives ~94kbps. Is something wrong with the encoder?


Enable AAC_HE profile with VBR 2
Enable AAC_HE_V2 profile with VBR 1

With ffmpeg, you can do this with -profile:a aac_he{,v2}. Or you can use the -t parameter in aac-enc.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: 2012 on 2013-12-15 18:55:10
OK, this changes things. Nonetheless I feel uncomfortable to modify lowpass this way.
But wait: how can we do better with a lowpass at 14 kHz when default lowpass is 14.3 kHz?


Reread what lvqcl wrote.

VBR 3 defaults to 14.3k lowpass.
VBR 2 defaults to 13.1k lowpass.

VBR 3 gives us ~110kbps. That's why I suggested VBR 2 with 14k lowpass which gives us ~96-100kbps in my limited tests.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-15 19:29:32
We're considering to include 4 codecs ...

OK, so:

Opus --bitrate 96
Apple AAC --cvbr 96
Winamp FhG AAC VBR 3
fdkaac --bitrate-mode 3
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-16 01:16:12
Enable AAC_HE profile with VBR 2
Enable AAC_HE_V2 profile with VBR 1

With ffmpeg, you can do this with -profile:a aac_he{,v2}. Or you can use the -t parameter in aac-enc.

Yeah, it seems to be the intended usage, although nothing can stop us from using different combination  https://github.com/mstorsjo/fdk-aac/blob/ma...src/qc_main.cpp (https://github.com/mstorsjo/fdk-aac/blob/master/libAACenc/src/qc_main.cpp):
Code: [Select]
static const TAB_VBR_QUAL_FACTOR tableVbrQualFactor[] = {
  {QCDATA_BR_MODE_CBR,   FL2FXCONST_DBL(0.00f)},
  {QCDATA_BR_MODE_VBR_1, FL2FXCONST_DBL(0.160f)}, /* 32 kbps mono   AAC-LC + SBR + PS */
  {QCDATA_BR_MODE_VBR_2, FL2FXCONST_DBL(0.148f)}, /* 64 kbps stereo AAC-LC + SBR      */
  {QCDATA_BR_MODE_VBR_3, FL2FXCONST_DBL(0.135f)}, /* 80 - 96 kbps stereo AAC-LC       */
  {QCDATA_BR_MODE_VBR_4, FL2FXCONST_DBL(0.111f)}, /* 128 kbps stereo AAC-LC           */
  {QCDATA_BR_MODE_VBR_5, FL2FXCONST_DBL(0.070f)}, /* 192 kbps stereo AAC-LC           */
  {QCDATA_BR_MODE_SFR,   FL2FXCONST_DBL(0.00f)},
  {QCDATA_BR_MODE_FF,    FL2FXCONST_DBL(0.00f)}
};

IIRC setting VBR mode 1 on LC was not possible in the past.
Anyway, I have to note that VBR mode 1-5 is undocumented on aacEncoder.pdf and aacenc_lib.h.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lsn_RU on 2013-12-16 06:43:26
In the event that increase bitrate up to 110, I'm right now can't guess whether encoders could be winner. Test would be more intricate but it's interesting.
And if to test Opus at 48 KHz sources, he can much go up at list quality, in the final analysis Opus 1.0 to sound noisy in any bitrate, IMHO.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-16 08:30:23
1) Musepack at 96kbps will have lowpass ~14kHz. That's too low IMHO.

The default lowpass value for fdkaac -m 2 is ~13.1 kHz, for -m 3 it is ~14.3 kHz.
So it's acceptable for AAC, but not for Musepack? I realize that interest in Muepack is low, and that it likely will not perform that well. But at least arguments should be consistent.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: CoRoNe on 2013-12-16 12:12:33
Although my music is predominantly lossless, I just listened 2x 2hours to 20 of my favorite songs, which I encoded with Opus 1.1 to 80 and 96kbps. Eventhough it was just casual listening, there actually wasn't a moment I could tell I was listening to lossy music! I was thus really surprised by the transparency of 80kbps. This leads my to believe that testing 96kbps is overkill and will be very hard!

My votes therefor go to:
Opus 1.1 - 80kbps (--bitrate 80)
Aac (Apple) - 80kbps*
Vorbis (aoTuVb6.03) - 80kbps (-q1)
Mp3 (Lame 3.99.5) - 80kbps (-b 80)

*I never encode to aac, so I know nothing about settings and which version is better, but as the title of this thread is "Multiformat Listening Test" it would only be logical to put forward 1 aac contestant.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-16 12:46:32
Enable AAC_HE profile with VBR 2
Enable AAC_HE_V2 profile with VBR 1

With ffmpeg, you can do this with -profile:a aac_he{,v2}. Or you can use the -t parameter in aac-enc.


It looks like without explicitly setting the profile all vbr modes produce AAC-LC streams. I encoded only two albums (Shpongle and Prodigy) with "-m 1" and got 94.1kbit/s and 95.1kbit/s (AAC-LC) respectively with cut off at 13.1kHz.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-16 13:19:24
lvqcl, dakbyte, gainless,
Which FhG encoder(s) do You prefer to see in test?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-16 15:16:13
So it's acceptable for AAC, but not for Musepack? I realize that interest in Muepack is low, and that it likely will not perform that well. But at least arguments should be consistent.

1) I didn't say that 14 kHz is good for AAC encoding  For me, it's also too low.
2) From FFmpeg and AAC Encoding Guide (https://trac.ffmpeg.org/wiki/AACEncodingGuide): "But beware, it defaults to a low-pass filter of around 14kHz. If you want to preserve higher frequencies, use -cutoff 18000. Adjust the number to the upper frequency limit you prefer."
So, this guide recommends to tune the encoder and not to use default lowpass value.


Which FhG encoder(s) do You prefer to see in test?

FDKAAC has some drawbacks: experimental VBR mode, too low lowpass... I prefer FhG encoder as in Winamp.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lsn_RU on 2013-12-16 15:34:55
Just make "wide open" test with fullband FAAC (120-125 kbps) and results will astonish you. That's is 10 year old anchor. I hesitate what from it someone away off will come off.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-16 15:51:09
I'd rather like to hear comment on libFDK VBR mode from Chris.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gainless on 2013-12-16 16:07:13
lvqcl, dakbyte, gainless,
Which FhG encoder(s) do You prefer to see in test?

For FhG I prefer the Winamp version, for the test the more popular and better rated Apple encoder seems more sensible though. As there shall be only 4 codecs included, my vote is now:

Opus (1.1)
Vorbis (AoTuV)
Apple AAC
FAAC at 96 kb/s as low anchor

96 kb/s as overall bitrate.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-16 16:56:00
Gainless,
It's 4 codecs + low anchor (s).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gainless on 2013-12-16 18:22:39
Gainless,
It's 4 codecs + low anchor (s).

Nevermind then 

Opus (1.1)
Vorbis (AoTuV)
Apple AAC (CVBR)
+ Helix Mp3, 128 kb/s
FAAC at 96 kb/s (low anchor)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: kennedyb4 on 2013-12-16 21:32:05
Although my music is predominantly lossless, I just listened 2x 2hours to 20 of my favorite songs, which I encoded with Opus 1.1 to 80 and 96kbps. Eventhough it was just casual listening, there actually wasn't a moment I could tell I was listening to lossy music! I was thus really surprised by the transparency of 80kbps. This leads my to believe that testing 96kbps is overkill and will be very hard!

My votes therefor go to:
Opus 1.1 - 80kbps (--bitrate 80)
Aac (Apple) - 80kbps*
Vorbis (aoTuVb6.03) - 80kbps (-q1)
Mp3 (Lame 3.99.5) - 80kbps (-b 80)

*I never encode to aac, so I know nothing about settings and which version is better, but as the title of this thread is "Multiformat Listening Test" it would only be logical to put forward 1 aac contestant.


I agree. Had a great deal of difficulty with the last 96kbps test using good equipment with old ears though.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 12:05:38
For my vote I would like to finally, if possible, end the debate of CVBR vs TVBR in AAC, preferrably Apple's encoder. The last test showed only a tendency for CVBR to be rated higher with no clear winner.

Maybe the bitrate should be pushed lower to do this.

It will be hard to answer this question.  According to this  personal test (http://d.hatena.ne.jp/kamedo2/20121116/1353099244)CVBR is on par with TVBR 
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: darkbyte on 2013-12-17 12:27:17
lvqcl, dakbyte, gainless,
Which FhG encoder(s) do You prefer to see in test?



I prefer the FhG Winamp encoder. It's good to have a "higher quality than FAAC" open source encoder (FDK) but at 96kbps VBR it uses a too low lowpass cutoff which is very noticable for me. The Winamp encoder is obviously better with this.

I wonder if we could include Opus @80kbps and Opus @96kbps in the test aswell?  I think testing AAC-LC @80kbps doesn't make much sense, but it would be really interesting if Opus @80kbps could beat AAC-LC encoders @96kbps or not. Maybe Apple HE-AAC CVBR @80kbps or FhG VBR 2 for the same reason?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-17 13:47:59
Opus' outcome will be very interesting of course, but I can't see why expectations for Opus are that high. When I did some listening @96kbps a few days ago Opus and AAC performed great, with regular music as well as problem samples, with one exception: harp40_1 was encoded pretty badly, but only by Opus.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 13:54:42
I think a discussions of possible results generally leads to premature prejudice. That doesn't help.  Let's not discuss  quality of competitors. We will see later.

After the end of test  everybody will be welcome to give their opinions and observations.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-17 17:27:26
Bitrate distribution of the major encoders:
(http://i41.tinypic.com/t6bkp0.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 17:32:06
I wonder if we could include Opus @80kbps and Opus @96kbps in the test aswell?
  I think testing AAC-LC @80kbps doesn't make much sense, but it would be really interesting if Opus @80kbps could beat AAC-LC encoders @96kbps or not. Maybe Apple HE-AAC CVBR @80kbps or FhG VBR 2 for the same reason?

It's still not clear whether Opus@96k is any good comparing to AAC@96.
And now it's evident there is a need in a new AAC test. But it should be organized separately from this one as it's too much work to do.



Until now the list (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing) of the most voted codecs:
1. Apple AAC  - 22 votes
2. Opus - 22
3.  Vorbis - 11
4. MP3@128k - 9

(?)+ low anchor - FAAC 96 kbps CBR or ABR

Also I think it will be useful to include a low-middle anchor in addition to a low anchor. Thread. (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=98008&view=findpost&p=815560).
Shortly, a quality of recollected results should be higher because now a listeners should rank low-middle anchor higher than low anchor. If some particular listener rank the low anchor higher than  it's an indicator that something is wrong.

MP3@96 kbps has received a high number of votes , 8.   
Two birds in one shot! MP3@96kbps as low-middle anchor + we test it as one additional codec at 96 kbps.  Fortunately it's actually easy to test MP3@96 kbps. So it can be a good idea to add this codec despite we have already high enough number of codecs.

What do You think?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 17:42:37
Bitrate distribution of the major encoders:
(http://i41.tinypic.com/t6bkp0.png)


What about LAME 3.99.5 -V 4.99 (lvqcl's  post (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853033)) or halb27's 3.100m?

It's rather non-MP3@96 (real bitrate ~100 kbps) vs MP3@128 (real bitrate ~135 kbps)

Edit: fixed link
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-17 18:09:51
So if mp3@128kbps is to participate I'd welcome 3.100alpha2 because of its improved behavior for tonal issue (listen to Angels_Fall_First for instance).
Sure because of improved short block behavior I prefer my lame3100m variant over the original.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-17 18:13:48
(http://i43.tinypic.com/mm6c1j.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 18:16:47
So if mp3@128kbps is to participate I'd welcome 3.100alpha2 because of its improved behavior for tonal issue (listen to Angels_Fall_First for instance).
Sure because of improved short block behavior I prefer my lame3100m variant over the original.

Yes, we know that. But what about an average people saying that it was alpha and not final?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-17 18:20:38
I understand that.
We should just collect votes, and I gave mine. To be precise: I vote for lame3100m -V5.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-17 20:02:27
mp3 encoders, including the lame 3.100.a2 64bit and the ultra-fast Helix.
(http://i41.tinypic.com/120klj6.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 20:48:25
We should just collect votes, and I gave mine. To be precise: I vote for lame3100m -V5.

Sure.
Aslo it will be interesting to hear what Robert will tell.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-17 23:18:18
I prefer the FhG Winamp encoder. It's good to have a "higher quality than FAAC" open source encoder (FDK) but at 96kbps VBR it uses a too low lowpass cutoff which is very noticable for me. The Winamp encoder is obviously better with this.

I wonder if we could include Opus @80kbps and Opus @96kbps in the test aswell?  I think testing AAC-LC @80kbps doesn't make much sense, but it would be really interesting if Opus @80kbps could beat AAC-LC encoders @96kbps or not. Maybe Apple HE-AAC CVBR @80kbps or FhG VBR 2 for the same reason?

Updated.


Guys, check if your codec choice was submitted appropriately.
There is still one day (tommorow) to submit changes.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-18 01:56:55
I bootstrapped the last 2011 public listening test of AAC encoders @ 96kbps (280 donated results, 20 samples) to plan this upcoming test.
The past data may not be precisely applicable to an another future test, but you may get a 'sense' of 'How much effort do we need to bring the error margin down?' or 'Which plan is likely to make better use of the precious donated time?'. Enjoy!
(http://i43.tinypic.com/2z720c6.png)


If I interpret this correctly, instead of using 20 samples and a bunch (~14) of listeners, we could've used 65 samples with 2 listeners and have gotten an as accurate result (though way less useful for the developers) with less than half the effort? That's pretty mind-blowing.

Imagine there is 1 listener and infinite number of samples. Will it be as good as 10-15 listeners and 20 samples?

I think the graph implies an ideal correlation between the results of different listeners. Well, it's not totally uncorrelated but it's not 100%.
An ideal correlation would imply that all listeners have exactly the same hardware (headphones ...), exactly the same hearing, exactly the same age. etc.. etc.. etc...  That's why the results of one individual can't be enough representative even on infinite number of samples.

That's why I've mention about an inter-listener correlation a few times.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-18 03:03:37
Imagine there is 1 listener and infinite number of samples.

In case you have missed something. The listener is re-picked from the 25 listeners(2011 AAC@96) in each sample.
So if there is 1 listener and 1000 number of samples, The average workload for each listener is 40 samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-18 11:14:50
MP3@96 kbps has received a high number of votes , 8.   
Two birds in one shot! MP3@96kbps as low-middle anchor + we test it as one additional codec at 96 kbps.  Fortunately it's actually easy to test MP3@96 kbps. So it can be a good idea to add this codec despite we have already high enough number of codecs.

What do You think?


I like this idea.  So I vote for Musepack @ 96kbps with its ~14kHz lowpass as low anchor and MP3@96kbps for the low-middle anchor.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-18 14:15:03
I like this idea.  So I vote for Musepack @ 96kbps with its ~14kHz lowpass as low anchor and MP3@96kbps for the low-middle anchor.

Seems reasonable as  MPC isn't that good according to this personal test at  96 kbps (http://forum.hardware.fr/hfr/VideoSon/Traitement-Audio/mp3-aac-ogg-sujet_84950_1.htm)

Imagine there is 1 listener and infinite number of samples.

In case you have missed something. The listener is re-picked from the 25 listeners(2011 AAC@96) in each sample.
So if there is 1 listener and 1000 number of samples, The average workload for each listener is 40 samples.

ah, ok. It's in context of  this particular whole set of data.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-19 02:13:19
Most of people have prefered to test at 96 kbps.

The list  of codecs to test.
1.   MP3@128 kbps.
2.   Apple AAC
3.   Opus 1.1
4.   Vorbis  aoTuV
+middle-low anchor  FAAC 96 kbps *1
+low anchor (?) should be selected, discussable. A low anchor should have lower quality than FAAC 96 kbps.

and first approximation of the settings for ~96-100 kbps.
1. LAME 3.99.5 -V 5 or -V 4.99, halb27 LAME extension 3.100m, Helix?
2. QAAC, highest quality  (CVBR 96 or TVBR 45?)
3.  Opus  --bitrate 96
4.  Vorbis aoTuV -q 2 ... -q 2.5 (?)

The target bitrate is  ~96-100 kbps for Opus, Apple AAC, Vorbis. And MP3 ~130-135 kbps.


Agenda.
A choice of codecs.  December 8 – December 18. DONE.
Bitrate verification, a choice of settings – December 19 – December 23-25.
Sample selection  - December 25-26 – January 5.
Checking all conditions, preparations, dummy packages  - January  6 – January 10

*1  Probably MP3  at 96 k is too good to be a middle-low anchor.  MP3  96 k is actually close to Nero AAC  which still was pretty hard to spot for some listener in previous test.

It's time to verify bitrate, choose settings and an encoder for MP3.  Bitrate verification table. (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing)
Please submit the bitrates You get with the encoders. Example. (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853033)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-19 04:13:42
The target bitrate is  ~96-100 kbps for Opus, Apple AAC, Vorbis. And MP3 ~130-135 kbps.

I'd like to see the result of MP3 128k, not 135k. MP3 135k might be too hard for the beginner.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: LithosZA on 2013-12-19 05:01:38
Quote
+low anchor (?) should be selected, discussable. A low anchor should have lower quality than FAAC 96 kbps.

Maybe VisualOn AAC @ 192Kbps?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-19 14:38:41
The target bitrate is  ~96-100 kbps for Opus, Apple AAC, Vorbis. And MP3 ~130-135 kbps.

I'd like to see the result of MP3 128k, not 135k. MP3 135k might be too hard for the beginner.

Agree.  Have been discussed in channel #hydrogenaudio
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-19 14:52:54
Also I've noticed that not all codecs report bitrate equally. There can be differences ~2 kbps or so.

So if somebody will inform a bitrates it also will be useful to get a filesize or a real bitrate (total size/ total duration).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-19 15:14:26
These are the rules from previous public test.
There're some suggestions to improve/change them.

Code: [Select]
Participants who don't want to worry too much about the grading rules
can simply ignore them. Listeners should do their best to rank the samples and
be careful to identify the hidden references. Listeners should ABX tests
they are at all unsure.

1) If the low anchor is not graded, or if any hidden reference is graded
below 4.5-5 (see App.) the result is INVALID.

2) For each sample with a ranked reference or an ungraded low anchor the
listener will have a single chance to submit a replacement test run for
that sample. The replacement test must cover all codecs, not just the
codecs with the ranked reference. (This also covers cases where the
reference is ranked but still at or above 4.5)

3) If a listener submits 2/10 (3 for 20 samples submitted) or more INVALID
results then only ABX results will be accepted, or the listener will be excluded
completely in cases of apparently abusive behavior.

App. These rules aren't extremely strict in order to allow for simple human
error while still excluding careless participants.

A stricter procedure to exclude all ranked references risks a systemic
bias against any codec which are very good on a few samples and thus
subject to more reference confusion by causing those samples to be excluded
and weighing the test towards other samples.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-19 15:20:36
Average Bitrate of 122 songs, Speed in x realtime, command line.
Code: [Select]
91650 43.90 qaac --tvbr 45 -q 2 -o %o %i
99456 43.78 qaac --cvbr 96 -q 2 -o %o %i
95985 50.15 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 96k %o
97517 51.01 lame3.99.5 -V7 %i %o
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-19 16:49:01
Average Bitrate, Speed in x realtime, command line
AAC 96kbps:
91650 43.9 qaac --tvbr 45 -q 2 -o %o %i
99456 43.7 qaac --cvbr 96 -q 2 -o %o %i
95985 50.1 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 96k %o
92048 47.9 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 92k %o

MP3, Opus, Ogg Vorbis 96kbps:
97517 51.0 lame3.99.5 -V7 %i %o
95734 48.1 lame3.99.5 -V7.2 %i %o
92603 44.2 opus-1.1-rc-msvc2013\opusenc --bitrate 90 %i %o
98686 46.0 opus-1.1-rc-msvc2013\opusenc --bitrate 96 %i %o
86710 27.9 venc603(aoTuV) -q1.7 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o

MP3 128kbps:
124126 49.0 lame3.99.5 -V5 %i %o
130941 48.8 lame3.99.5 -V4.99 %i %o
128701 141.0 hmp3(Helix) %i %o -X2 -U2 -V60
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-19 18:11:41
Average Bitrate of 122 songs, Speed in x realtime (on i7 2.93GHz), command line
AAC 96kbps:
91650 43.9 qaac --tvbr 45 -q 2 -o %o %i
82546 42.9 qaac --cvbr 80 -q 2 -o %o %i
99456 43.7 qaac --cvbr 96 -q 2 -o %o %i
95985 50.1 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 96k %o
92048 47.9 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 92k %o

MP3, Opus, Ogg Vorbis 96kbps:
97517 51.0 lame3.99.5 -V7 %i %o
95734 48.1 lame3.99.5 -V7.2 %i %o
90787 49.4 lame3.99.5 -V7.5 %i %o
91595 44.7 opus-1.1-rc-msvc2013\opusenc --bitrate 89 %i %o
92603 44.2 opus-1.1-rc-msvc2013\opusenc --bitrate 90 %i %o
98686 46.0 opus-1.1-rc-msvc2013\opusenc --bitrate 96 %i %o
86710 27.9 venc603(aoTuV) -q1.7 %i %o
87603 27.3 venc603(aoTuV) -q1.8 %i %o
88547 27.1 venc603(aoTuV) -q1.9 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o

MP3 128kbps:
124126 49.0 lame3.99.5 -V5 %i %o
130941 48.8 lame3.99.5 -V4.99 %i %o
128701 141.0 hmp3(Helix) %i %o -X2 -U2 -V60
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-19 20:40:11
I think that tuning  q-parameters of encoders for a listening test using some arbitrary  set of tracks is not the best way of doing this. As I already  mentioned, using sets of tracks with different proportion of genres  will result in different target bitrates and corresponding codec  settings. It is just another source of variance for test results.  Different sets of tracks will favor different contenders (hopefully  not easy to understand whom exactly but who knows …).

My  suggestion is to tune q-parameters using test samples selected for  the test. This set of samples is the only unambiguous set and  it relates directly to the test. For example, Lame  V7 could be an anchor for this test. Using selected sound samples it  will produce some target bitrate (most likely not 96, doesn't  matter). Then all other codecs will be tuned to get close bitrates on  the same samples. This can be done either by averaging per-sample  bitrates or by concatenating all selected samples into one stream and  calculating target bitrate for it (much  easier). So all encoders will use equal number of bits for the stream but will  distribute them between samples differently, according to their psy.models. This  way the variance of codec settings due to arbitrary chosen audio  material will be completely eliminated. Bitrate verification of  resulting settings will still be necessary but just for reference purposes – to better understand what final bitrates could be  achieved with these settings with different genres. In previous HA  listening tests this was done vice versa – settings were tuned using arbitrary sound material and bitrates for selected test samples were given for reference.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-19 21:15:34
I prefer using a test set of hopefully representative music for deciding upon which settings to use.
And from the various results given here it looks like Apple --cvbr 96, Opus --bitrate 96, aoTuv -q2 give these contenders equal chances (maybe with a tiny bit stronger aoTuv setting). These are 'natural' choices moreover.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-20 02:15:47
Average Bitrate of 122 songs, Speed in x realtime (on i7 2.93GHz), command line
AAC 96kbps:
91650 43.9 qaac --tvbr 45 -q 2 -o %o %i
106035 39.2 qaac --tvbr 54 -q 2 -o %o %i
82546 42.9 qaac --cvbr 80 -q 2 -o %o %i
99456 43.7 qaac --cvbr 96 -q 2 -o %o %i
92048 47.9 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 92k %o
95985 50.1 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 96k %o
97619 59.8 ffmpeg_r59211 -i %i -vn -c:a libvo_aacenc -b:a 96k %o

MP3, Opus, Ogg Vorbis 96kbps:
97517 51.0 lame3.99.5 -V7 %i %o
95734 48.1 lame3.99.5 -V7.2 %i %o
94860 45.9 lame3.99.5 -V7.3 %i %o
93983 45.9 lame3.99.5 -V7.4 %i %o
90787 49.4 lame3.99.5 -V7.5 %i %o
91595 44.7 opus-1.1-rc-msvc2013\opusenc --bitrate 89 %i %o
92603 44.2 opus-1.1-rc-msvc2013\opusenc --bitrate 90 %i %o
98686 46.0 opus-1.1-rc-msvc2013\opusenc --bitrate 96 %i %o
99691 41.3 opus-1.1-rc-msvc2013\opusenc --bitrate 97 %i %o
86710 27.9 venc603(aoTuV) -q1.7 %i %o
87603 27.3 venc603(aoTuV) -q1.8 %i %o
88547 27.1 venc603(aoTuV) -q1.9 %i %o
88957 28.3 venc603(aoTuV) -q1.95 %i %o
89326 26.3 venc603(aoTuV) -q1.99 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o
96169 26.1 venc603(aoTuV) -q2.1 %i %o
97673 25.9 venc603(aoTuV) -q2.2 %i %o

MP3 128kbps:
124126 49.0 lame3.99.5 -V5 %i %o
130941 48.8 lame3.99.5 -V4.99 %i %o
128701 141.0 hmp3(Helix) %i %o -X2 -U2 -V60
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-20 08:50:05
I prefer using a test set of hopefully representative music for deciding upon which settings to use.
And from the various results given here it looks like Apple --cvbr 96, Opus --bitrate 96, aoTuv -q2 give these contenders equal chances (maybe with a tiny bit stronger aoTuv setting). These are 'natural' choices moreover.
Agree. Using of "natural" (integer) values for q-settings is another simple approach which has its pros. It makes more difficult to compare actual efficiency of codecs but provides clear answer for end users what codec/settings are better as they rarely use fractional values for q.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-20 13:08:14
Well, yes. The main reason why bitrate verification is done by different members is to avoid later discussion how reasonable was a choice of bitrate settings.
Different codecs tend to inflate bitrate (somewhat) on different music genres or particulas cases. During a preperation of the last test (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89208&view=findpost&p=761636) we had quite different bitrate reports per member but when an average bitrate was calculated it was clear what settings to go with.


88547 27.1 venc603(aoTuV) -q1.9 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o

Then we should probably stay at -q2, maybe -q2.x.


We're open to debate which MP3 encoder to include.
Currently ~50% of people have preffered LAME (sheet particular encoders (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing)). If there won't be any further suggestions, the last stable LAME 3.99.5 will be used. If someone is interested to see a different MP3 encoder  then post your suggestion here.

It's good to have LAME as it's the most popular and well optimized encoder.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-20 15:54:55
I tried things to be as fair as possible, but the qaac doesn't have a flexible option, and it's not possible for aoTuV to set it around 92kbps.
Here are my plans:
Plan A: 92kbps
91650 : qaac --tvbr 45 -q 2 -o %o %i
91595 : opus-1.1-rc-msvc2013\opusenc --bitrate 89 %i %o
89417 : venc603(aoTuV) -q1.9999 %i %o
92048 : ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 92k %o

Plan B: 99kbps
99456 : qaac --cvbr 96 -q 2 -o %o %i
99691 : opus-1.1-rc-msvc2013\opusenc --bitrate 97 %i %o
99763 : venc603(aoTuV) -q2.4 %i %o
99884 : ffmpeg_r59211 -i %i -vn -c:a libfaac -b:a 100k %o

I prefer the Plan B. For MP3 128kbps, lame 3.99.5 -V5 is a sweet choice, I believe.


Average Bitrate of 122 songs, Speed in x realtime (on i7 2.93GHz), command line
AAC 96kbps:
91650 43.9 qaac --tvbr 45 -q 2 -o %o %i
106035 39.2 qaac --tvbr 54 -q 2 -o %o %i
82546 42.9 qaac --cvbr 80 -q 2 -o %o %i
99456 43.7 qaac --cvbr 96 -q 2 -o %o %i
92048 47.9 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 92k %o
95985 50.1 ffmpeg_r57288 -i %i -vn -c:a libfaac -b:a 96k %o
98918 46.1 ffmpeg_r59211 -i %i -vn -c:a libfaac -b:a 99k %o
99884 46.7 ffmpeg_r59211 -i %i -vn -c:a libfaac -b:a 100k %o
91614 58.1 ffmpeg_r59211 -i %i -vn -c:a libvo_aacenc -b:a 90k %o
97619 59.8 ffmpeg_r59211 -i %i -vn -c:a libvo_aacenc -b:a 96k %o
100620 56.9 ffmpeg_r59211 -i %i -vn -c:a libvo_aacenc -b:a 99k %o

Opus, Ogg Vorbis 96kbps:
91595 44.7 opus-1.1-rc-msvc2013\opusenc --bitrate 89 %i %o
92603 44.2 opus-1.1-rc-msvc2013\opusenc --bitrate 90 %i %o
98686 46.0 opus-1.1-rc-msvc2013\opusenc --bitrate 96 %i %o
99691 41.3 opus-1.1-rc-msvc2013\opusenc --bitrate 97 %i %o
88547 27.1 venc603(aoTuV) -q1.9 %i %o
89326 26.3 venc603(aoTuV) -q1.99 %i %o
89417 27.2 venc603(aoTuV) -q1.9999 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o
96169 26.1 venc603(aoTuV) -q2.1 %i %o
97673 25.9 venc603(aoTuV) -q2.2 %i %o
99763 27.7 venc603(aoTuV) -q2.4 %i %o
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-20 16:29:20
As FAAC is a middle-low anchor there is no need to tune it to others IMHO. Let it be 96.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-20 16:37:54
Agree. It still sounds noticebly worse than tested codecs both at 92k or 100k.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-21 16:12:36
Opus 1.1@96k has the same bitrate  as Apple CVBR 96. approx. 100 kbps.

In the other hand, Apple TVBR 45 is ~94 kbps. Vorbis has not a soft bitrate curve (q1.999->q2) as previos posts have shown.

I have run a few settings on the samples from previous test  http://listening-tests.hydrogenaudio.org/i...all_samples.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/all_samples.zip)


Real bitrate. Filesize*8bits/duration(s):
Vorbis aoTuv 6.03 -q2 - 97.4 kbps
Apple TVBR 45 - 96.2 kbps
Opus 1.1  92k - 97.0 kbps

I'm good with any bitrate for MP3@128k while bitrate is around ~ 128-135. It's a kind of high anchor here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-21 17:05:01
I'm good with any bitrate for MP3@128k while bitrate is around ~ 128-135. It's a kind of high anchor here.

I like the lame 3.99.5 -V5, even if the bitrate is slightly less than 128. It's the version used by many people. Or CBR 128kbps.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-21 17:34:36
I like the lame 3.99.5 -V5, even if the bitrate is slightly less than 128. It's the version used by many people. Or CBR 128kbps.

Also vote for -V5, looks like higher bitrates will make the test too hard for listeners. And yes, -V5 is popular.

Vorbis aoTuv 6.03 -q2 - 97.4 kbps
Apple TVBR 45 - 96.2 kbps
Opus 1.1  92k - 97.0 kbps


May be we should settle on these settings now. After new samples will be selected I would like to ask Kamedo2 to plot his beautiful distribution of bitrates (like this one (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853168)) for the selected samples and settings. Thus we'll confirm them finally.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kohlrabi on 2013-12-22 02:00:54
May be we should settle on these settings now. After new samples will be selected I would like to ask Kamedo2 to plot his beautiful distribution of bitrates (like this one (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853168)) for the selected samples and settings. Thus we'll confirm them finally.
I don't think we should settle the settings at all until the final samples have been selected. For example, with the 20 samples from the last AAC listening test I get:
The problem is that you cannot fine tune the Apple encoder as much as the Xiph encoders, so we should take whatever bitrate close to 96k Apple's encoder achieves, and take that as a target bitrate for the other encoders.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 07:46:05
Bitrate distribution of my Pops and Jazz albums library, 122 songs.

The Apple CVBR has a quite narrow bitrate distribution, nearly constant.
(http://i39.tinypic.com/ak8r9x.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 09:19:05
  • Vorbis aoTuv 6.03 -q2 - 95.3 kbps
  • Apple TVBR 45 - 93.3 kbps
  • Opus 1.1  92k - 97.0 kbps
The problem is that you cannot fine tune the Apple encoder as much as the Xiph encoders, so we should take whatever bitrate close to 96k Apple's encoder achieves, and take that as a target bitrate for the other encoders.

Let's settle on these settings for the time being and start to select samples. We'll return to the settings afterwards. Bitrate distribution of newly selected samples will help us to tune other encoders to Apple one.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: robert on 2013-12-22 10:47:43
IMHO, it doesn't make sense, to tune the settings for selected samples. Determining comparable encoder settings and selecting samples should be done independant from each other. Take your whole music collection to find matching encoder settings.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 11:06:15
Take your whole music collection to find matching encoder settings.

What exactly collection - classical, pop, electronic, folk, jazz ... ? If all, in what proportion?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: robert on 2013-12-22 11:08:36
Everything in your reach.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-22 12:48:25
1679 lossless tracks, biased towards metal and folk. Bitrates as reported by foobar2000.

Opus libopus 1.1, --bitrate 96
Average bitrate: 100 kbps
Standard deviation: 9.7 kbps

Apple CoreAudioToolbox 7.9.8.3, TVBR 45
Average bitrate: 93 kbps
Standard deviation: 12.6 kbps

Vorbis BS; Lancer(SSE3MT) [20061110] (based on aoTuV b5 [20061024]), -q 2
Average bitrate: 92 kbps
Standard deviation: 7.4 kbps

Opus seems to use above-average bitrates on folk music with highly tonal components (e.g. accordion).
Apple drops bitrates significantly on lo-fi material (e.g. old recordings with low bandwith and/or mono only).
I hope I'm using the correct versions. Opus and Vorbis are the ones from foobar's "Free Encoder Pack 2013-12-06".
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 13:13:36
Robert,

What would be your suggestion which version of LAME to use in test? A stable 3.99.5, 3.100 or some of  a halb27's extensions?

Thank You.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 13:20:10
1679 lossless tracks, biased towards metal and folk. Bitrates as reported by foobar2000.

Opus libopus 1.1, --bitrate 96
Average bitrate: 100 kbps
Standard deviation: 9.7 kbps

Apple CoreAudioToolbox 7.9.8.3, TVBR 45
Average bitrate: 93 kbps
Standard deviation: 12.6 kbps

Vorbis BS; Lancer(SSE3MT) [20061110] (based on aoTuV b5 [20061024]), -q 2
Average bitrate: 92 kbps
Standard deviation: 7.4 kbps

Opus seems to use above-average bitrates on folk music with highly tonal components (e.g. accordion).
Apple drops bitrates significantly on lo-fi material (e.g. old recordings with low bandwith and/or mono only).
I hope I'm using the correct versions. Opus and Vorbis are the ones from foobar's "Free Encoder Pack 2013-12-06".


Gecko,

as I've already posted here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853373)
Quote
Opus 1.1@96k has the same bitrate as Apple CVBR 96. approx. 100 kbps.

In the other hand, Apple TVBR 45 is ~94 kbps...

Opus has a few kbits higher bitrate than TVBR 45.
Can You try 92k (or some other values) for Opus?

Oh, and You use an old version of aoTuv b5.  Current version is b6.03. http://www.geocities.jp/aoyoume/aotuv/ (http://www.geocities.jp/aoyoume/aotuv/)
Encoder http://www.geocities.jp/aoyoume/aotuv/binary/aoTuV_b6.03.zip (http://www.geocities.jp/aoyoume/aotuv/binary/aoTuV_b6.03.zip)

Today I will update the table of bitrate reports (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: robert on 2013-12-22 13:24:12
From those listed, I guess version 3.99.5 is the one most people (especially outside from HA) have an idea about, and can be used as a reference point for interpretation.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 13:25:46
Then it will be LAME 3.99.5 -V 5 unless there're some other suggestins.

Seems like most of people agree with it here.

I like the lame 3.99.5 -V5, even if the bitrate is slightly less than 128. It's the version used by many people. Or CBR 128kbps.


Also vote for -V5, looks like higher bitrates will make the test too hard for listeners. And yes, -V5 is popular.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 14:25:22
1679 lossless tracks, biased towards metal and folk. Bitrates as reported by foobar2000.

Opus libopus 1.1, --bitrate 96
Average bitrate: 100 kbps
Standard deviation: 9.7 kbps

Apple CoreAudioToolbox 7.9.8.3, TVBR 45
Average bitrate: 93 kbps
Standard deviation: 12.6 kbps

I'd be happier if you have the result of Apple CVBR 96, Opus 92, 97, and Vorbis 1.99, 2, 2.4, 2.5.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 14:30:05
Guys,

Kamedo2 will pick a bitrate reports of users.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 14:40:21
Opus seems to use above-average bitrates on folk music with highly tonal components (e.g. accordion).
Apple drops bitrates significantly on lo-fi material (e.g. old recordings with low bandwith and/or mono only).

I noticed that Opus shows "inverted" behavior - for complex and saturated music it uses less bits than for minimalistic one. For --bitrate 93:
[blockquote]Prodigy - 96.6 kbit/s
Shpongle - 101.4 kbit/s
Schnittke - 104.8 kbit/s
    [/blockquote]
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 14:43:51
(http://i39.tinypic.com/33xkodh.png)
Those who have a empty cell are recommended to report more to fill the space.
And lvqcl and halb27, how many tracks did you used? (Sorry If I had missed something)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 16:34:58
(http://i40.tinypic.com/11qsjkj.png)

Plan A:
TVBR 45
Opus 91
Vorbis q2

Plan B:
CVBR 96
Opus 94
Vorbis q2

I like the Plan B. The plan B is quite close to what's really used.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 16:39:41
1679 lossless tracks, biased towards metal and folk. Bitrates as reported by foobar2000.
...

It will be better to report a real bitrates (filesize/duration). I see the difference is ~ 2 kbps because of  different container overheads.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 16:43:51
1679 lossless tracks, biased towards metal and folk. Bitrates as reported by foobar2000.
...

You are using too many tracks and it is slowing down the entire process. I'd be happier if you pick 50 tracks randomly and quickly fill the blank in the table.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 16:50:03
Winamp FhG V3: 105 kbps
FDK V3: 106
aotuv -q 1.99: 91.6

Opus 92, 97, aotuv 2.4 - will test later

# of tracks: 86
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 17:06:22
I think there is no need to include a bitrate for codecs those won't be tested.

The list of codecs to test (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853223):
Quote
1.   LAME 3.99.5 -V 5
2.   Apple AAC
3.   Opus 1.1
4.   Vorbis aoTuV
+middle-low anchor FAAC 96 kbps *1
+low anchor ...
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-22 17:23:09
It will be better to report a real bitrates (filesize/duration). I see the difference is ~ 2 kbps because of  different container overheads.

OK, will do. Should I worry about stripping the tags?

You are using too many tracks and it is slowing down the entire process. I'd be happier if you pick 50 tracks randomly and quickly fill the blank in the table.

Yes, aoTuV encode took 1 1/4 hours. I'm working on a reduced but representative set. However this will take some time.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 17:28:17
OK, will do. Should I worry about stripping the tags?

hehe.

No, it's already hair splitting. Plus we store music with tags in real world.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 17:29:43
aotuv -q 2.4: 103 kbps
opus 92: 94.7 kbps
opus 97: 99.8 kbps
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 17:38:50
(http://i39.tinypic.com/dz9kpd.png)

Thank you for providing the useful data.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 17:56:44
opus 94: 96.8 kbps
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-22 18:05:56
(http://i43.tinypic.com/9jzq7q.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-22 19:32:30
100 tracks
edit:
kbps as: totalsize [bytes] / duration [sec] / 1024 * 8
kbps as: totalsize [bytes] / duration [sec] / 1000 * 8

Apple CoreAudioToolbox 7.9.8.3, CVBR 96: 98.6 101.0
Apple CoreAudioToolbox 7.9.8.3, TVBR 45: 91.6 93.8 (foobar: 92; 1679 tracks: 91.4 93.6)

FhG Winamp, vbr 3: 102.8

Opus libopus 1.1, --bitrate 92: 94.4 96.6
Opus libopus 1.1, --bitrate 94: 96.3 98.6
Opus libopus 1.1, --bitrate 96: 98.3 100.7 (foobar: 101; 1679 tracks: 98.0 100.3)
Opus libopus 1.1, --bitrate 97: 99.3 101.7

Vorbis aoTuV [20110424], q1.99: 88.2 90.3
Vorbis aoTuV [20110424], q2: 94.2 96.4 (foobar: 96; 1679 tracks: 94.2 96.4)
Vorbis aoTuV [20110424], q2.4: 98.2 100.5

As you can see, the chosen sample of 100 tracks is quite faithful to the bitrates achieved over the 1679 tracks. Median and stddev should also be similar.

edit: How come foobar deviates so much? Assuming foobar is discarding container overhead, shouldn't the reported bitrates be smaller than the ones calculated from the raw filesize? Am I doing something wrong?

edit: Arg! Damn kilo issue!  I hope I didn't mess up the numbers since I had to enter everything again.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 19:36:55
Well, kbps is totalsize [bytes] / duration / 1000 * 8
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 19:46:27
lvqcl is right. It's 1000.

Gecko, can You recalculate it again? Thanks.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 20:15:31
(http://i43.tinypic.com/9jzq7q.png)

I will ask people also report their bitrate on Vorbis -q2.2.

If we will go with plan B (~ 100 kbps) then we have a fair set of settings:
apple CVBR 96 - 101.1 kpbs
Opus 96 - 100.9 kbps
Vorbis - somewhere near to -q2.2 (?)

P.S. I'm updating the same table here (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=5)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-22 20:18:57
@Kamedo2:
My test set consists of 24 tracks which I chose from old (pre-loudness war) pop music as well as new one, of rather 'hard and wild' pop music as well as rather slow ballads.

My preferred settings are Apple cvbr 96, Opus bitrate 96,  aoTuV q2.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-22 20:20:14
Thanks lvqcl, IgorC. Fixed!

Does using 2^10 for kilo in an IT context make me a nerd?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 20:23:07
Does using 2^10 for kilo in an IT context make me a nerd?

It's rather ICT
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 20:30:28
Update.
Link (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=5) to table.

(http://s23.postimg.org/sc6ied4m3/bitrates.png) (http://postimage.org/)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Gecko on 2013-12-22 20:30:44
Vorbis aoTuV [20110424], q2.2: 99.6
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 20:33:31
aotuv -q 2,2: 100.9 kbps
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 20:44:24
Link (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=5)
(http://s15.postimg.org/a0pp0928b/bitrates.png) (http://postimage.org/)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 20:46:33
My preferred settings are Apple cvbr 96, Opus bitrate 96,  aoTuV q2.

+1
But aoTuV q2.2
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: LithosZA on 2013-12-22 21:03:05
Based on that table so far I would go with:
Apple TVBR 45 | ~94.9Kbps
Opus 92 | ~96.6Kbps
Vorbis q1.99 | ~91.7Kbps

OR

Apple CVBR 96 | ~101.1Kbps
Opus 97 | ~101.8Kbps
Vorbis q2.4 | ~102.7Kbps
,but at these bitrates the tests might be too difficult?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 21:09:39
Based on that table so far I would go with:
Apple TVBR 45 | ~94.9Kbps
Opus 92 | ~96.6Kbps
Vorbis q1.99 | ~91.7Kbps

I'm against of this set.
It's possible to low bitrate for Opus, like --bitrate 89. But there still a significant  bitrate advantage for TVBR (+3.5%). No go.
Let's wait for Kamedo2 and Kohlrabi to submit their rates for Vorbis -q2.2


It's more like
Apple CVBR 96 - 101.1 kbps
Opus 96 - 100.8 kbps
Vorbis -q2.2 - ~101 kbps (?)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 21:39:38
Sorry,  I still think that not equal averages of bitrates for selected sound  samples is a flaw of the test, because it means that this combination  of settings and samples favors one codecs and is unlucky for the  others. And the only reasonable excuse for this situation could be the  use of natural (integer) settings. The  explanation that those "equal" settings are calculated using some  big music library is not valid because we do not use that big library  in our test, we do not use even its representative sub-set, we use a  limited set of some marginal samples which have nothing in common  with that big library. So, why should we set codecs with one samples  and test them with another?

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 21:49:33
So basically you think that VBR is a useless thing. Right?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-22 21:51:15
IMO the average bitrate of an encoder setting should be taken from a test set. This is a general strategy which has nothing to do with a specific listening test.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 21:56:05
So an encoder with good VBR algorithm isn't better than an encoder with bad VBR algorithm?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: eahm on 2013-12-22 22:12:30
IMO the average bitrate of an encoder setting should be taken from a test set. This is a general strategy which has nothing to do with a specific listening test.

I'm not sure I 100% understand this but I think I am with you here. I believe the test, if ~96 kbps is chosen, must be done with the setting the developer or encoder gives us.

For example:
AAC-LC Apple CVBR 96 (~96 kbps) or TVBR 36 (~95 kbps)
AAC-LC Fraunhofer/fdk VBR 3 (~96-112 kbps)
Ogg Vorbis Q2 (~96-112 kbps)
Opus VBR 96 (~96 kbps)

If the bitrates are lower or higher I think it means the encoder doesn't need more in the first case or it does in the second. It depends how the developer tunes the encoder IMO.

If we are looking for a precise bitrate let's just use CBR or ABR.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-22 22:39:38
IMO the average bitrate of an encoder setting should be taken from a test set. ... .

Why are You saying it now and not a few days ago?


It's contrary what You've said a few days ago:
I prefer using a test set of hopefully representative music for deciding upon which settings to use.
And from the various results given here it looks like Apple --cvbr 96, Opus --bitrate 96, aoTuv -q2 give these contenders equal chances (maybe with a tiny bit stronger aoTuv setting). These are 'natural' choices moreover.

How do You expect people should get your posts if You quickly change your mind?  Huh?


...This is a general strategy which has nothing to do with a specific listening test

This is a general ... what?
How many times we've heared You saying  "bitrate increase on some particular hard sample/samples, but no significant bitrate increase overall" in your LAME extension threads?
And now what You suggest goes completely contrary of yourself.



@ Serge Smirnoff
Robert has crystal clearly answered your question here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853431)
There is nothing to add.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: kennedyb4 on 2013-12-22 22:44:06
I prefer using a test set of hopefully representative music for deciding upon which settings to use.
And from the various results given here it looks like Apple --cvbr 96, Opus --bitrate 96, aoTuv -q2 give these contenders equal chances (maybe with a tiny bit stronger aoTuv setting). These are 'natural' choices moreover.


I'm not sure this is a good idea. If the last big 96kbps test consisted of challenging or "killer" samples, it was already quite difficult to detect fault. Simpler samples may make the test longer and more tedious, as well as making the opportunity to detect flaws less frequent.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 22:58:56
So basically you think that VBR is a useless thing. Right?

No, I think that we should use different set of samples for choosing vbr settings, the samples that are actually participate in test, not some external. Encoder will choose how to distribute bitrate among those samples, so there is enough room to test the efficiency of vbr algorithm.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 23:06:01
@ Serge Smirnoff
Robert has crystal clearly answered your question here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853431)
There is nothing to add.

I saw. I don't agree with this answer. And that's why I posted my opposite opinion with supporting arguments.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: lvqcl on 2013-12-22 23:23:40
so there is enough room to test the efficiency of vbr algorithm.

I don't think so.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-22 23:35:27
so there is enough room to test the efficiency of vbr algorithm.

I don't think so.

What is your reasoning?

If  some vbr algorithm chooses lower/higher bitrates for our test set, what does it mean, is it more smarter or less smarter, more efficient or less efficient?

In other words, how non-equal overall bitrates of codecs with some test set helps to reveal effectiveness of vbr algorithms?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-23 03:06:14
These are just my opinions:

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-23 04:48:45
The average bitrate of my library, 122 tracks, speed (x realtime)
88547 27.1 venc603(aoTuV) -q1.9 %i %o
88957 28.3 venc603(aoTuV) -q1.95 %i %o
89326 26.3 venc603(aoTuV) -q1.99 %i %o
89417 27.2 venc603(aoTuV) -q1.9999 %i %o
95359 29.2 venc603(aoTuV) -q2 %i %o
96169 26.1 venc603(aoTuV) -q2.1 %i %o
97673 25.9 venc603(aoTuV) -q2.2 %i %o
99763 27.7 venc603(aoTuV) -q2.4 %i %o
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-23 05:27:27
Update (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=5)

Well, Plan B (~100 kbps)?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-23 08:42:49
IMO the average bitrate of an encoder setting should be taken from a test set. ... .

Why are You saying it now and not a few days ago?
It's contrary what You've said a few days ago:
I prefer using a test set of hopefully representative music for deciding upon which settings to use.

How do You expect people should get your posts if You quickly change your mind?  Huh?

??? It was/is meant to be exactly the same thing. I must have expressed myself badly.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-23 13:23:34
I split the two past listening test. The samples used in the 64kbps Multiformat test is more critical than the AAC 96kbps samples.

2011 AAC 96kbps 20 samples
103112 34.5 qaac_2.32 --cvbr 96 -o %o %i
94985 34.1 qaac_2.32 --tvbr 45 -o %o %i
108696 35.5 ffmpeg_r59211 -i %i -c:a libfdk_aac -vbr 3 %o
94132 43.2 0.1.8-win32\opusenc --bitrate 88 %i %o
96191 43.3 0.1.8-win32\opusenc --bitrate 90 %i %o
98257 42.7 0.1.8-win32\opusenc --bitrate 92 %i %o
100366 40.4 0.1.8-win32\opusenc --bitrate 94 %i %o
102434 41.7 0.1.8-win32\opusenc --bitrate 96 %i %o
103480 41.8 0.1.8-win32\opusenc --bitrate 97 %i %o
90855 24.7 venc(aoTuV 6.03) -q1.99 %i %o
97229 24.4 venc(aoTuV 6.03) -q2 %i %o
98033 24.5 venc(aoTuV 6.03) -q2.1 %i %o
99510 24.3 venc(aoTuV 6.03) -q2.2 %i %o
101652 24.3 venc(aoTuV 6.03) -q2.4 %i %o

2011 Multiformat 64kbps 30 samples
104653 33.9 qaac_2.32 --cvbr 96 -o %o %i
101608 34.1 qaac_2.32 --tvbr 45 -o %o %i
115220 35.8 ffmpeg_r59211 -i %i -c:a libfdk_aac -vbr 3 %o
101793 41.9 0.1.8-win32\opusenc --bitrate 88 %i %o
104016 41.7 0.1.8-win32\opusenc --bitrate 90 %i %o
106244 41.8 0.1.8-win32\opusenc --bitrate 92 %i %o
108471 41.9 0.1.8-win32\opusenc --bitrate 94 %i %o
110690 41.9 0.1.8-win32\opusenc --bitrate 96 %i %o
111806 42.0 0.1.8-win32\opusenc --bitrate 97 %i %o
102545 24.5 venc(aoTuV 6.03) -q1.99 %i %o
110032 24.4 venc(aoTuV 6.03) -q2 %i %o
110863 24.4 venc(aoTuV 6.03) -q2.1 %i %o
112330 24.5 venc(aoTuV 6.03) -q2.2 %i %o
114934 24.6 venc(aoTuV 6.03) -q2.4 %i %o
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-23 15:49:09
A verification of bitrate is near to be closed.

A spreadsheet. (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=5)
(http://s29.postimg.org/w9khp3wzr/bitrates.png) (http://postimage.org/)
Thank You to participants for their help.  Well, it's pretty clear what settings to use.


The following settings will be used:
LAME 3.99.5 -V5
Apple AAC CVBR 96 - 101.5 kbps
Opus 1.1 --bitrate 96 kbps - 101.7 kbps
Vorbis b6.03. aoTuV -q2.2 - 101.5 kbps
+middle-low anchor  FAAC 96 kbps
+low anchor



Agenda (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=6)
Now a discussion about test samples is open. How to choose them, quantity etc. You can submit your own samples as well.
A holidays are near  but  we  will still have time to choose samples until January 5 or so.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: funkyblue on 2013-12-24 02:18:28
Great work everyone  I look forward to participating.

Merry Christmas
Scott
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-24 14:14:46
This is what I believe as better version, but with the same conclusion:
(http://i40.tinypic.com/246mkiv.png)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-24 14:27:39
Agenda (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=6)
Now a discussion about test samples is open. How to choose them, quantity etc. You can submit your own samples as well.
A holidays are near  but  we  will still have time to choose samples until January 5 or so.


What length of sample should we target?  I think 30s is a pretty hard upper-limit, but I'd prefer things down around 10-12s.  I personally have a hard time comparing longer samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-24 14:56:29
Yes, I would say no less than 8 seconds and no more than 10-12.

P.S. The first 1-2 seconds are cut. So 10 seconds should be fine.

P.S2: Anyway it's discussable.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-24 15:57:40
I think the length of 10 seconds would give too much advantage to Opus. The file should be around 120KB, and the header of Vorbis is few kilobytes. Let it 20 or 30 seconds. Testers don't need to hear the entire length of the sample in every ABX sessions, sometimes it's too obvious.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-24 16:04:47
It's possible to encode a 30 sec sample and then indicate a  trim  to first 10 seconds an additional offset in ABC/HR Java program. Or use any other apllication to cut a decoded .wav
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-24 23:15:06
During the previous public test a large list of samples (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762952) was made. 
Then 20 samples (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762958) were randomly (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762351) picked.

Everybody is welcome to submit samples in Samples for a new multiformat public test (http://www.hydrogenaudio.org/forums/index.php?showtopic=103989), an upload thread


Also a few items to talk about:
Quantity of samples.
Sample duration
What portions of a new samples and of a killer samples to include.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-25 02:37:30
During the previous public test a large list of samples (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762952) was made. 
Then 20 samples (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762958) were randomly (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=89518&view=findpost&p=762351) picked.


I like the way the samples were chosen for the previous test. Specifically I like that there were buckets for the different types of music and a few were randomly selected from each bucket.  I don't think we should include speech buckets this time around as 100kbps seems an unlikely bitrate for speech encoding.  I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Regarding the number of samples in total, I think we should aim high.  I suspect at 100kbps a large number of the samples will be indistinguishable.  I think having a large total number of samples will increase the chance that we still get usable data.  If we make it clear that not hearing a difference is an okay and expected result, then I don't think it unduly strains the listener to have a large number of test samples.

The area I don't have a good feeling for is the killer samples.  I absolutely want them included. I'd much rather use a codec that produces barely detectable differences 20% of the time instead of a codec that's indistinguishable 90+% of the time but has clearly audible problems when it does falter.

I just don't have a sense how to include the killer samples fairly.  If we include 4 mp3 killers and 1 opus killer, does that penalize mp3 4 times as much?  Or is that fair if mp3 runs into trouble 4 times as often?  I was thinking maybe we could have killer buckets that we pick from evenly (1 or 2 samples each): opus killers, mp3 killers, aac killers, vorbis killers. 

Do we have aac killers?  Sample #7 from the 2011 test gave both aac encoders trouble while opus and vorbis did well.  Sample #14 and #29 seemed to give Vorbis the most trouble while not causing as many problems for opus and aac.  The harpsichord sample #2 seemed to be opus 1.0's achilles heel.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-25 02:57:58
It's possible to encode a 30 sec sample and then indicate a  trim  to first 10 seconds an additional offset in ABC/HR Java program. Or use any other apllication to cut a decoded .wav

I'm not sure why we would not just trim the sample's wav before encoding.  Is the concern that some encoders "come up to speed" over 1-2s and would be unfairly penalized by including the first couple seconds in the test samples?  Also I am assuming all the test clips will all be converted back to wav (flac) after encoding so headers and file sizes should have no effect.

I think we should take care to normalize the volume of the clips over the specific range that will be tested.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: birdie on 2013-12-25 12:39:10
I would like to see:

64k, 96k and 128k for



Testing bitrates over and equal to 192k seems to be pointless since all these codecs provide almost 100% transparency at high bitrates.

I don't want to see Lame MP3 since for bitrates lower than 128k it struggles to provide any decent quality.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-25 14:10:08
I don't think we should include speech buckets this time around as 100kbps seems an unlikely bitrate for speech encoding.  I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Video streaming sites, i.e.  Youtube use 96kbps for a default resolution (360p). In many cases it's a speech.
Also people were interested to see how well codecs have done on speech during the last test.

I would like to see a bucket for music without instrumentals though.  A single voice and multi-voice song / chant / a cappella would be nice.

Agree.
Good indication.

I just don't have a sense how to include the killer samples fairly.  If we include 4 mp3 killers and 1 opus killer, does that penalize mp3 4 times as much?  Or is that fair if mp3 runs into trouble 4 times as often?  I was thinking maybe we could have killer buckets that we pick from evenly (1 or 2 samples each): opus killers, mp3 killers, aac killers, vorbis killers. 

Do we have aac killers?  Sample #7 from the 2011 test gave both aac encoders trouble while opus and vorbis did well.  Sample #14 and #29 seemed to give Vorbis the most trouble while not causing as many problems for opus and aac.  The harpsichord sample #2 seemed to be opus 1.0's achilles heel.

Killer samples have some characteristics in common. They can contain sharp transients, pure tones, wide stereo separation or any combination of these signals. So it's possible to detect them without targeting one particular codec.
What if instead of submitting a killer samples for particular codec we will prepare two lists with somewhat hard samples and killers samples. Later we will randomly choose a test samples in proportion approx. 80/20 (?) of somewhat hard samples/ killers samples?

I think we should take care to normalize the volume of the clips over the specific range that will be tested.

Yes, a normalization is always done.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-25 14:14:48
@birdie
A choice of codecs was performed during December 8 - December 18.
https://docs.google.com/spreadsheet/ccc?key...p=sharing#gid=6 (https://docs.google.com/spreadsheet/ccc?key=0AivUr-pp6BuUdDRuSmNGQXphNGdxYjJrbHRFWU42NFE&usp=sharing#gid=6)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-25 14:23:48
And we had only 2 speech samples last time. A male English and another female English (singing). Samples 06 and 18. That doesn't hurt.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-25 18:33:51
Concerning bias of listening test due to variance of codec bitrates.

The table of codec bitrates for previous HA@96 listening test shows that resulting bitrates of vbr encoders are not equal for selected test set of sound samples (the test set).

Code: [Select]
                Nero	CVBR	TVBR	FhG	CT	low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 94.9 100.9 93.45 100.4 100.0 99.6
It looks like everybody understands that such inequality favors some codecs in the listening test. At least this is not a secret and IgorC mentioned about that here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=90403&view=findpost&p=767234).

Let's define the issue more clearly. We have the table of codec per-sample bitrates (above) and the table of codec per-sample scores:

Code: [Select]
                Nero	CVBR	TVBR	FhG	CT	low_anchor
Sample01 3.64 4.22 4.69 4.23 3.71 1.60
Sample02 4.05 4.47 4.13 4.52 3.46 1.41
Sample03 3.30 3.51 3.24 3.34 3.20 1.60
Sample04 3.57 4.52 4.55 4.73 4.41 2.42
Sample05 4.04 4.53 4.54 3.97 4.43 1.33
Sample06 4.19 4.58 4.59 4.62 4.65 1.52
Sample07 3.65 4.10 4.32 4.53 3.85 1.47
Sample08 3.83 4.62 4.41 4.49 4.18 1.67
Sample09 3.62 4.27 4.26 4.72 3.91 1.60
Sample10 3.66 4.30 4.34 4.24 4.26 1.72
Sample11 3.82 4.28 4.21 3.96 4.13 1.58
Sample12 3.48 4.67 4.37 4.35 3.81 1.48
Sample13 4.13 4.54 4.64 4.08 4.24 1.50
Sample14 3.42 4.32 4.40 4.29 4.10 1.34
Sample15 3.60 4.54 4.72 4.18 3.69 1.51
Sample16 3.92 4.70 4.52 3.98 4.26 1.44
Sample17 3.85 4.41 4.55 4.49 4.57 1.32
Sample18 3.67 4.79 4.37 5.00 4.83 1.42
Sample19 3.08 4.26 3.78 4.11 3.96 1.25
Sample20 3.34 4.72 4.65 3.43 3.88 1.27
------------------------------------------------------------
Mean 3.69 4.42 4.36 4.26 4.08 1.52

For each sound sample and four vbr encoders (first four columns) we can calculate coefficient of correlation between bitrates and corresponding scores. These twenty coefficients are below:

Code: [Select]
Sample01    0.6454
Sample02    0.6352
Sample03    0.7327
Sample04    0.2685
Sample05  -0.3851
Sample06    0.6219
Sample07    0.5927
Sample08    0.2423
Sample09    0.7509
Sample10    0.8660
Sample11  -0.4295
Sample12    0.6259
Sample13    0.6286
Sample14    0.7710
Sample15    0.5018
Sample16    0.1358
Sample17  -0.5315
Sample18    0.8167
Sample19  -0.4780
Sample20    0.2855

And here is bootstrap mean of these coefficients:
(http://img837.imageshack.us/img837/4213/c3i1.png)

We can see strong evidence of correlation between bitrates and scores (all means are significantly far from zero). In simple words, the final scores depend on resulting bitrates. This is a bias.

Once again, it seems that people here are well aware of this dependence but prefer to think that this bias is acceptable and even justifiable by the “nature of vbr encoding”. It is considered that target bitrates should be calculated using as big and varied as possible music library and inevitable inequality of bitrates with the test set is a consequence of their natural behavior and should be kept. So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient. You can't choose which hypothesis is true until you get the scores of perceptual quality. The variance of bitrates itself (without scores) can be interpreted both ways – as a smart decision of efficient vbr codec and as protective response of poor one. In other words, the variation of bitrates itself has no any useful meaning, it is just a random variation that introduces noise to the results of the test. The noise is so heavy (max. difference between bitrates is 8%) that all the punctiliousness with calculation of p-values looks even funny.
 
Consequently, if we want to compare efficiency of vbr codecs - their target bitrates with the test set should be set as close as possible to each other (s0). If this is not possible (due to discrete q-values), the goals of a listening test should be redefined because the test no longer compares efficiency of  their algorithms but compares perceived quality of particular settings of the encoders. Such test can be very useful as well, the only question is how to choose particular settings. Several options could be proposed:
[blockquote](s1) natural (integer) settings; results are easy to interpret and use

(s2) settings that produce equal bitrates with music of some genre (classic rock for example) or some predefined mix of genres; while one genre is acceptable to some extent, any mixture of them makes interpretation of results less clear.

(s3) settings that produce equal bitrates with personal music library of Bob; results are perfectly useful for Bob.

(s4) settings that produce equal bitrates with combined personal music libraries of Bob and Alice; results are less useful for both Bob and Alice; increasing number of participants worsens the usefulness further.

(s5) settings that produce equal bitrates for the whole population of music; results are useful for nobody, because it's hard to realize how your particular music (the one you usually deal with) sorts with that universe and how your particular bitrates sort with those “global” ones.[/blockquote]
Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties. The one thing we can be sure of is that it is not homogeneous, it is structured by genres at least. And here comes the main problem with calculation of “global” bitrates. This calculation is based on the assumption that for gradually increasing amount of music material the final bitrate of a codec tends to some certain value. It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this. In reality we calculate bitrates using some limited music material that a few people had at hand at the moment. If we add good portion of classical music the values will change, if we add proportional amount of space ambient the values will change again. Having restricted access to the population of music, this process is practically endless and does not lead to any final value. So the bitrates calculated this way can be safely considered as random because we can't even estimate how far they are from true “global” bitrates.

Anyway, even if we could manage to accomplish this task and calculate those “global” bitrates, they would have no practical meaning at all as already explained. Thus calculation of the bitrates (and corresponding encoders' settings) using aggregated music material (even all of it) has no any practical sense. It is just a very sophisticated way of choosing random bias for a listening test.

One more method should be mentioned for completeness (s6). Settings can be tuned for each sound sample to provide the same bitrate. Such test would be perfectly valid as it would show how efficiently each encoder uses the same amount of bits with each sample. Unfortunately this method is suitable only for encoders with continuous scale of q-parameter.

My conclusions. There are only two reasonable ways of setting vbr encoders for a listening test:
[blockquote](s0) settings that provide equal bitrates for all encoders with selected test set; in this case the listening test compares efficiency of vbr algorithms; the closer the bitrates, the more accurate the results (less noise due to variance of bitrates).

(s1) natural (integer) settings; in this case the test compares particular (popular) settings of encoders (in many cases results can be bias corrected afterwards, if this is the case (need research) then there is still a chance to make inference about efficiency of encoders).[/blockquote]
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-26 06:26:55
I think Serge's points have merit.

I think making sure every test sample is encoded at the exact same bitrate is an excellent idea for a 96kbps CBR listening test.
I think making sure each encoder averages the same bitrate over the test samples is an excellent idea for a 96kbps ABR listening test.
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-26 08:18:54
I think Serge's points have merit.

I think making sure every test sample is encoded at the exact same bitrate is an excellent idea for a 96kbps CBR listening test.
I think making sure each encoder averages the same bitrate over the test samples is an excellent idea for a 96kbps ABR listening test.
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.

And what is your variant for comparing CBR and VBR? And if for example some codec use cbr and vbr alternately. And if developers of codecs invent something completely different? Shouldn't we have a common procedure for testing efficiency of codecs regardless of their internal mechanics whatever it is? The more so as most listening tests are organized exactly for comparing alternative algorithms of coding. Efficiency, in the end, is a very simple concept - ratio of allowed bits and resulting perceived quality.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-26 09:00:25
Serge,
Stop poluting the  thread. If You have strong disagreements with a lot of people here about testing methology maybe you should open a separate thread. This disagreement comes already for years and jumping now into preparation discussion after two years of break (since a last test) is an inapropriate act given that we have a work to do in a short period of time.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-26 09:47:29
Serge,
Stop poluting the  thread. If You have strong disagreements with a lot of people here about testing methology maybe you should open a separate thread. This disagreement comes already for years and jumping now into preparation discussion after two years of break (since a last test) is an inapropriate act given that we have a work to do in a short period of time.

I see some flaws in your test setup that decrease accuracy of the test results, so I do my best to describe them providing arguments. You are the conductor of this test, so it's up to you whether to consider them or not. If you can't decide for yourself ask help from community. I just want this test to be properly organized and everybody understands what and why we do this and that in this test, what is the goal of the test. And preparation discussion is the best place for such controversy imho.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-26 10:25:40
I don't know if you have noted but most of people who post here are actually a listeners of previous tests. They were involved for a while. 
The problem is that we aren't sure about your approach. According to your tests mp3 64 kbps, 22 kHz is better ranked than Vorbis 64 kbps. That is a problem.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Garf on 2013-12-26 10:37:09
I think Serge's points have merit.

[...]
I think making sure each encoder averages the same bitrate over the superset of all music is an excellent idea for a 96kbps VBR listening test.

I think each of those has value.  The one that I'm most interested in is the unconstrained VBR listening test.


You are misunderstanding what Serge is saying. He says that despite a codec having a VBR setting that averages 96kbps over a superset of all music, we should use *another* bitrate/setting calibrated to produce 96kbps over the *test samples* because...VBR codecs that identify which samples need more bits produce better scores. This is somehow considered undesirable....because...don't ask me, it makes no sense.

I think the argument has been made before, and it was just as wrong back then as it is now. The point of VBR is so a codec can spend more bits where it needs to. Serge is now advocating that the working of VBR is "filtered" out of the test? If you want to do this you do a CBR test.

There is no point in using a mode and then trying to disable exactly the effect of that mode. This is insanity.

Quote
So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient...– as a smart decision of efficient vbr codec and as protective response of poor one.


I don't even get what this is supposed to mean, or why it would matter. The codecs produce the expected VBR bitrates over a large corpus. Why does it matter for what reason they're varying their bitrates over the test set? I can't even make sense of what point your last sentence is supposed to make, as far as I can tell you're making an artificial distinction so you can go on and fail to make any inference from that.

Quote
It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this.


I outlined such a method that is not very complicated earlier in this thread. Problem of the current method is that it biases to music that is more popular with our audience. The upshot of that flaw is that it makes the results more, not less, meaningful for our readers, although you're free to point out the results are biased more towards popular music rather than unpopular one when discussing the result.

The alternate is to not test (VBR) at all, which is even less useful.

Quote
(s0)...
(s5) settings that produce equal bitrates for the whole population of music; results are useful for nobody, because it's hard to realize how your particular music (the one you usually deal with) sorts with that universe and how your particular bitrates sort with those “global” ones.


This reasoning is completely and utterly bogus. The result from the test is what an average listener can expect on the average song with the tested codec+settings. In absence of more information, it's a very useful result to see which codec is best because the odds are always higher that this codec is also best if you pick a specific sample (genre) and a listener.

If you reasoning were valid (and as just demonstrated, it isn't), then there would be no point in doing any tests because the listeners *themselves* already vary.

Quote
I see some flaws in your test setup that decrease accuracy of the test results, so I do my best to describe them providing arguments.


Unfortunately I didn't find any valid argument regarding your stance on the VBR bitrates, and your proposal actively decreases the accuracy of a VBR test. The only one you made I consider valid is regarding sample selection, and that was pointed out and discussed already several times already in the past few pages.

Quote
I just want this test to be properly organized and everybody understands what and why we do this and that in this test, what is the goal of the test.


Yes, thank you again for illustrating that the setup is as close to optimal as we can get for now. People who want to understand why the bitrates for the sample set don't average 96kbps will have even more pages to refer to to understand.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-26 11:00:20
Please stop arguing against Serge's testing methodology as well as the one used here.
As for the latter, Serge, you can see that average bitrates for the various test sets used in this thread don't vary much.
More important, the comclusions towards fair settings for the participating encoders are exactly the same for every test set. And in case an encoder chooses higher bitrate than usual on a problematic spot it's quite natural that this encoder has a quality advantage here. Good detecttion of music that needs more bits should be rewarded, as long as average bitrate of a test set of regular music isn't increased.
That's the idea behind the testing methodology here. There maybe disadvantages with this approach, too, but this is the way we want to go here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-26 11:43:45
... The codecs produce the expected VBR bitrates over a large corpus. Why does it matter for what reason they're varying their bitrates over the test set?

Exactly, it absolutely doesn't "matter for what reason they're varying their bitrates over the test set". And if it doesn't matter then such variation should be removed from the test. Otherwise it is not understandable why such variation, which has no meaning is present in the test setup. It has no meaning, this is a point. It only spoils results, being just a bias.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-26 14:12:14
Regarding sample selection:

Can we assume the sample pool of the 2011 test is included? In any case, I (still) I recommend the test set I constructed in 2010, which Igor already kindly mentioned here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=852761):
http://www.hydrogenaudio.org/forums/index....st&p=695576 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=77584&view=findpost&p=695576)
IIRC only BerlinDrug was actually chosen from that list in the 2011 test. One of the samples, CantWait, is stereo-miked a-cappella male singing, which nicely fits the category TheBashar suggested here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853668).

I can provide samples which aren't available on HA any more.

Regarding VBR behavior:

Why worry about the VBR behavior now? In the 2011 96-kb test most coders behaved identically over the entire sample pool and over the actually tested subset of samples (FhG, CVBR, Dolby all ended up at an average of 100 kbps, TVBR was chosen as closely to 100 kbps as possible). Only the Nero encoder differed a bit, but it's not included in the 2014 test.

The point of VBR is so a codec can spend more bits where it needs to. Serge is now advocating that the working of VBR is "filtered" out of the test?

I understood something different, namely that the actually tested samples shall be coded with the target bit-rate on average. Meaning: of course the codecs can still run with VBR, but their average behavior shall be adjusted to the set of test samples. But like I said, in the previous test it didn't matter except for the Nero encoder (meaning no re-adjustment was really necessary), so I recommend focusing on the sample selection for now, and taking a look at the average bit-rates only once the test set is completed.

One question about a special scenario, though. Let's assume that all codecs' VBR settings were calibrated on a very large sample set S, incl. a handful of samples X where a codec greatly increases the bit-rate to a factor of, say, 1.5 times the target (sample-set-average) bit-rate. Let's also assume that the number of samples X is so small compared to S that their removal from S doesn't affect the calculation of the average VBR bit-rate over S. Now let's also assume that one or more of samples X are included in the listening test set L, which - since L is smaller than S - shall lead to the case that the average VBR rate over L (incl. X) be quite a bit larger than the average VBR rate over S (also incl. X).

In such a scenario, I conclude that there is no penalty for a codec which greatly boosts the bit-rate on some item (e.g. up to a factor of 1.5), when compared to another codec which does not boost the VBR rate on the same item (e.g. stays close to 1.0), even if both codecs end up providing (roughly) the same audio quality. Again, the a-priori bit-rate calibration is assumed to not have revealed this behavior, since set X is much smaller than S.

Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-26 15:56:02
Concerning sample selection.

Quote
It would be a perfect ground if we could select tracks randomly from the population. But this is impossible in practice, it needs tons of research to perform this.

I outlined such a method that is not very complicated earlier in this thread. Problem of the current method is that it biases to music that is more popular with our audience. The upshot of that flaw is that it makes the results more, not less, meaningful for our readers, although you're free to point out the results are biased more towards popular music rather than unpopular one when discussing the result.


So we have the listening test.

[blockquote](1) This test is not aimed to compare efficiency of vbr encoding, it compares encoders at specific settings.

(2) What are these settings and why they are important to us. Because these settings provide almost equal bitrates with aggregated music material of lvqcl, Gecko, kamedo2, Kohlrabi and two previous tests. We also believe that this music material is pretty common for our forumers and that's why these settings are interesting to compare (s4).
[/blockquote]
If both statements are correct then the only possible method  of choosing sound samples is randomly picking them from those aggregated sound material. As all the material is at hand there is no problem to perform almost perfect sampling (random and sufficient).

After the test set will be properly selected the bitrates of all codecs will be inevitably equal (magic, isn't it?). And this equality will indicate that the test set is representative. That is correct design of the test.

If we want to test encoders with some predefined set of samples then it is another (different) listening test. In this case the use of any external music material is irrelevant in the context of the test. And there are only two options of setting encoders – providing equal bitrates for the test set (s0) or using natural (integer) ones (s1), depending on what we want to compare in the test – efficiency or popular settings.

We need to decide what kind of results we want to see.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-26 18:36:59
Well,

Robert, lvqcl and Garf are disagree with you.

So my answer to you is: No.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-26 22:02:33
So if a codec consumes more bits with this particular test set it probably considered to be smart enough to spot problem samples and increase bitrate for them to preserve required quality. It is a valid hypothesis but there is an alternative one – the codec requires more bits than other contenders for this test set because its vbr algorithm is less efficient. You can't choose which hypothesis is true until you get the scores of perceptual quality.

The VBR's beauty is that it assigns more bits on where it needs while take away bits from where it needs less. If the vbr algorithm is less efficient and put more bits on random places fruitlessly, it will surely take away bits from where it needs. That's why many immature and poorly-tuned vbr encoders exist. The concern is very real. I'm currently putting a lot of effort on improving FFmpeg's native AAC encoder, both CBR and VBR. The CBR is constantly getting better, but the VBR, it distributes more bits on random noise, and distributes less bits on tonal samples(they need more bits). It is a disaster, and many tonal samples collapses, as well as degrades. The inefficiency is, of course, detectable by the traditional HA method.


Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-26 22:40:43
Hi all. I have something like an offer and also a question that should be thoroughly discussed (in my opinion).

You know that many recordings today uses full digital scale, up to 0 dBFS. And after the lossy encoding in almost all cases we get samples with level higher than 0 dBFS. First of all, if encoder itself uses fixed point processing this samples will be lost even during encoding process. But AFAIK encoders which take part in our test allow floating point, so encoding will be ok with any level.
But let's see what is happening then, especially in "real world". People just encode their recordings into some format, and then for example upload them to their portable players. Again, AFAIK almost all portable equipment uses decoders with fixed point processing. So if we have samples with levels much higher than 0 dBFS (1.00000), we will get a deep clipping on such equipment. And this clipping really can be audible (for example one time I successfully passed an ABX test comparing clipped mp3 with peak about 1.36, and the same MP3 with clipping removed).

So my question is: maybe we should consider clipping as a part of quality loses and as flaw of encoding algorithm. If so we must not take any action to prevent clipping (like attenuation before or after encoding).

I think it really makes sense, because it makes our test closer to real-life conditions.

I would like to know what all of you think about it.

add:
On other hand we can consider clipping only as a problem of decoder, not encoder's. In this case we take into account only irretrievable loses.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-26 22:57:22
The inefficiency is, of course, detectable by the traditional HA method.

I'm sure that if the test would be properly designed it can detect inefficiency even better.

Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.

Of course we can get some idea about music universe using limited amount of music. The problem with current listening test design is (using your analogy) that pharma-company study preferences of woman on global scale, sampling them from different countries, and then tests its products on North Koreans women.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-26 23:27:59
...AFAIK almost all portable equipment uses decoders with fixrecored point processing. So if we have samples with levels much higher than 0 dBFS (1.00000), we will get a deep clipping on such equipment. And this clipping really can be audible...

This is a real life problem which should be covered by the RG mechanism. We shouldn't blame an encoder IMO if lossy encoding peaks should be higher than with other encoders. We should rather use RG for the test samples whenever necesary.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-26 23:43:55
Quote
This is a real life problem which should be covered by the RG mechanism

ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. This is if you mean RG mechanism in some hardware players. If you mean foobar2000's RG - of course, it easily helps to prevent clipping. But anyway this requires additional processing, not just decoding.

Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-27 00:15:54
Regarding VBR behavior:

Why worry about the VBR behavior now? In the 2011 96-kb test most coders behaved identically over the entire sample pool and over the actually tested subset of samples (FhG, CVBR, Dolby all ended up at an average of 100 kbps, TVBR was chosen as closely to 100 kbps as possible). Only the Nero encoder differed a bit, but it's not included in the 2014 test.

I see big difference between two cases: encoder settings have been chosen correctly and encoder settings turned out to be correct (coincided with correct ones). Procedure should be clearly defined and has clear meaning.

One question about a special scenario, though. Let's assume that all codecs' VBR settings were calibrated on a very large sample set S, incl. a handful of samples X where a codec greatly increases the bit-rate to a factor of, say, 1.5 times the target (sample-set-average) bit-rate. Let's also assume that the number of samples X is so small compared to S that their removal from S doesn't affect the calculation of the average VBR bit-rate over S. Now let's also assume that one or more of samples X are included in the listening test set L, which - since L is smaller than S - shall lead to the case that the average VBR rate over L (incl. X) be quite a bit larger than the average VBR rate over S (also incl. X).

In such a scenario, I conclude that there is no penalty for a codec which greatly boosts the bit-rate on some item (e.g. up to a factor of 1.5), when compared to another codec which does not boost the VBR rate on the same item (e.g. stays close to 1.0), even if both codecs end up providing (roughly) the same audio quality. Again, the a-priori bit-rate calibration is assumed to not have revealed this behavior, since set X is much smaller than S.

Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?


Seems I'm the only one who thinks there should be a penalty. Such penalty is called bias correction after the test. But it's much better to avoid such situation at all. If L is properly sampled from S there will be no such problem - average vbr rates will be equal (with some error which can be controlled by varying the size of L). At the moment there is no relation between S and L (L doesn't belong to population S), as a result vbr rates of different encoders with set L have random variance. In HA@96 listening test max. difference between rates is 8% (93-101). This is fundamentally incorrect design of the test. You can't set codecs with one music material but test them with another.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-27 01:13:31
The inefficiency is, of course, detectable by the traditional HA method.

I'm sure that if the test would be properly designed it can detect inefficiency even better.

I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

Furthermore, calculation of the “global” bitrates can not be implemented in practice. Nobody knows actually how that music universe looks like – what size does it have, what structure, how does it change in time and how to get access to all of it. “The whole music universe” is absolutely unscientific quantity, we can only guess some of its properties.

Likewise, many "global" pharmaceutical or other investigations involving human don't test North Koreans. If we were to draw generalized conclusions about women, we must test all women on the planet which is impossible, or randomly pick large enough number of women across the globe to test. So, strictly speaking, they must pick one North Korean woman per 300 women, if we want a generalized statement about women. The process is typically omitted. Still, the results are typically highly applicable to North Koreans women, and foreign humanitarian medical aids have resulted in many positive results. We are all Homo Sapiens. We act like human, we do what human like to do, and we create what is pleasing to human beings.

Of course we can get some idea about music universe using limited amount of music. The problem with current listening test design is (using your analogy) that pharma-company study preferences of woman on global scale, sampling them from different countries, and then tests its products on North Koreans women.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well. By the way, if you believe the effect size is big, it's your job to demonstrate. Otherwise you can go to a hospital, question applicability to  nerds, queers, immigrants, amateur golfers, and you can stop any therapy there.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-27 08:50:22
I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

I mentioned that this scenario is unrealistic (but perfectly valid). Fine-tuning with the test set is the next option.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well.

But not vice versa when study of North Korean women is applied to the rest of the globe as it is done in the current design of the listening test.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-27 09:14:24
Finally the puzzle of equal bitrates for VBR encodes is solved. Here is how.

A set of test samples L for a listening test can be obtained in two ways – (1) sampled from some population of music S or (2) chosen independently according to some criteria, problem samples for example.

In case (1) if the test set is sampled properly (sufficiently and randomly) target bitrates of encoders are equal with both S and L, so it doesn't matter with what sound material they are calculated – whole population S or selected samples L. If there is possibility to find such settings that the target bitrates are equal then results of such listening test show comparison of encoders' VBR efficiency. If bitrates can't be set equal (due to discontinuity of q-parameters) then results of such listening test show comparison of encoders at specific settings. Such specific settings can be only of one kind – natural (integer) ones (as bitrates can't be set equal with S and consequently with L, so all other settings are just random without any meaning).

In case (2) the test set L is already predefined and the population of music S which it is sampled from is undefined (a population of problem samples would be best guess). Consequently there is no possibility to calculate bitrates (and corresponding settings) with S. Any attempt to do this with some other music population leads to random variance of bitrates with the test set L, because the latter is not representative to that music population chosen out of the blue. That random variance in turn leads to variance of results making them less accurate. Thus in case (2) target bitrates can be calculated only with the test set L (no other sound material is present in the context of such listening test). As in the first case there are two choices – to make bitrates equal for the test set L (results then show comparison of VBR encoders efficiency) or to use natural (integer) values (results then show comparison of popular settings). All other settings are just random without any meaning.

In case (1) the results of the listening test are biased towards population of music S which was chosen for the test (some genre or a mix of them). In case (2) the results are biased towards particular test set L.

Case (1) needs much more sound samples in the test set because results are considered to be generalized to the whole population S. All of listening tests that were ever conducted belong to case (2) - the test set was chosen according to some criteria (problem samples, usual samples ...) but never sampled from some population as in (1). And the reason is quite obvious - with more samples (that case (1) needs) the test become labor-intensive but results are hardly better than with problem samples.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-27 10:24:37
I don't think fine-tuning q-value on every sample is "properly designed". Most users don't do that. We can increase the set of samples to solve the concern of correlation.

I mentioned that this scenario is unrealistic (but perfectly valid). Fine-tuning with the test set is the next option.

Fine-tuning the test set, fine-tuning a large set of samples, both produce roughly the same result, anyway. And I like it.

Your concern could be solved by a concept of 'effect size'. We test the North Korean women and the non-North Korean women, and if the effect size is zero or small, the study of the rest of the globe can be safely applied to the North Korean women as well.

But not vice versa when study of North Korean women is applied to the rest of the globe as it is done in the current design of the listening test.

The testers and tested samples came from all over the world. Norway, France, Germany, Argentina, Japan.... Not a single person came from North Korea, though.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-27 13:00:29
Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?

Chris,

Yes, we all agree about it. Serge Smirnoff argues with himself.




Loosely following this thread I also conclude that most contributors find it acceptable that, given such a scenario, there is no such penalty. Is this correct?


Seems I'm the only one who thinks there should be a penalty.

Finally You have realized it.  Alleluia!





I will ask  moderators to split the thread. One thing is this particular test, another one is an endless disagreement between Hydrogenaudio and Sound Express.
We shouldn't suffer from this here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-27 14:00:43
I can provide samples which aren't available on HA any more.

Yes, please. Many samples have gone offline.

Regarding sample selection:

Can we assume the sample pool of the 2011 test is included? In any case, I (still) I recommend the test set I constructed in 2010, which Igor already kindly mentioned here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=852761):
http://www.hydrogenaudio.org/forums/index....st&p=695576 (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=77584&view=findpost&p=695576)
IIRC only BerlinDrug was actually chosen from that list in the 2011 test. One of the samples, CantWait, is stereo-miked a-cappella male singing, which nicely fits the category TheBashar suggested here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853668).

Agree, this set of samples is fantastic. It will be great to see at least a good part of them (if not all of them)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: halb27 on 2013-12-27 16:38:48
ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. ...

You make me worry. Do you know if the Rockbox RG mechanism does it right? Is it safe to assume that any player that provides a RG based 'prevent clipping' option does it well?
Sorry for being OT for a moment.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-27 17:04:49
Rather than work off a testimonial that does not satisfy the requirements of this forum, I would like to see proof that this will be a legitimate issue with the samples in question.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: nu774 on 2013-12-27 17:07:29
ReplayGain can prevent clipping only if it recieves floating point data. Otherwise samples are already clipped. ...

You make me worry. Do you know if the Rockbox RG mechanism does it right? Is it safe to assume that any player that provides a RG based 'prevent clipping' option does it well?
Sorry for being OT for a moment.

As far as I can see rockbox is using int32 as it's internal sample format for DSP.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-27 20:24:17
Rather than work off a testimonial that does not satisfy the requirements of this forum, I would like to see proof that this will be a legitimate issue with the samples in question.

I understand your precaution.  But something should be done to avoid a flood of 100+ lines posts like here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853699). I don't even read that. It really slows down discussion.

Organizers (including me) are open to critics and suggestions. You can ask people who are involved into test. 

Though let's face it. Watch out who comes critics from.  Sound Express had received very negative critics for his tests . And it's not just me.

He speaks here about a "mathematical perfection", "a bitrate should be exactly the same".
To begin with, it doesn't take into account that different formats have different overheads. So there will be 2-3% of difference in overhead if samples are short enough (10 seconds).
And it was only "to begin with..."

2-3%. So what has happened to mathematical perfection?  It's not perfect anymore. Not even close.
Math is a very good tool but without a good context and interpretation it has little value.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: bandpass on 2013-12-27 21:04:24
I think we should take care to normalize the volume of the clips over the specific range that will be tested.

Yes, a normalization is always done.

Are the details available of the normalisation method (and any other pre-processing) that will be used?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-27 21:45:24
I understand your precaution.  But...

Care to elaborate rather than use my comment as an excuse to berate Serge?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-27 22:20:35
I don't get your reaction.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-27 22:29:52
IgorC, if you missed something, during this discussion I analysed your/HA listening test setup. This analysis has no connection to SE tests at all. I did this because I intuitively saw the flaw but couldn't prove my suspicions. Those long posts reflect my progress in the above analysis. It was a research in real time if you want. Finally I found the flaw and disclosed it in this post (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=103768&view=findpost&p=853777). In short, calculation of target bitrates with huge aggregated music material has no sense in the context of the listening test and leads to inaccurate final scores of codecs. This became possible due to incorrect use of statistical methods (incorrect sampling). The flaw is serious and affects not only current test but also previous ones (HA@64 and HA@96 at least), not completely invalidates them but changes interpretation of results and changes generalizations. I call for serious examination of the issue as it is not late yet. If this is a scientific discussion, let's discuss arguments and figures, not personalities.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-27 23:05:20
Quote
Are the details available of the normalisation method (and any other pre-processing) that will be used?

Normalization of decoded files .wav is done in ABC/HR Java. http://listening-tests.hydrogenaudio.org/i.../ABC-HR_bin.zip (http://listening-tests.hydrogenaudio.org/igorc/aac-96-a/ABC-HR_bin.zip)
It was mention before that there can be a better ways to do normalization.
Steve Fore Rio has rised question of pre-normalization of a source files before encoding as well.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: greynol on 2013-12-28 04:07:34
I don't get your reaction.

Nor I yours.

Does it make sense to take extra precautions to ensure samples do not clip during decoding to fixed-point?
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: C.R.Helmrich on 2013-12-28 13:00:35
Yes, please. Many samples have gone offline.

OK. I'm on vacation right now, I'll provide necessary samples around Jan 8. Btw, I found this thread (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=4602) and remembered that I also have a backup of the samples from the now defunct/hijacked website ff123.net/samples. I can provide those as well.

Regarding clipping during decoding: I tend to avoid clipping in any case in my own listening test. That's why my 2010 HA set peaks at around -1 dBFS.

Chris
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-28 16:15:55
As one of the organizers, I want to express (just express, not to be involved into endless debates) my opinion about some questions that were discussed here last time.

1. Quality settings selection. What is actually an indicator of encoding effectivity? It is a "quality/size" ratio of course, or better "quality/average bitrate" ratio. If we want to compare quality of some encoder for some bitrate value we need to get absolutely identical bitrates first of all. The problem is how to do it for VBR mode. Here we must firstly realize who and how uses encoding. So we must to know what is a target audience of our test. Then we must understand that people encode absolutely different kinds of music material. So to get ABR for some quality settings, ideally we must analyze all music that's included in music libraries of our target audience. Or we must make a selection that represents for example genre distribution of this libraries. This is what we actually trying to do. And here is the only right way: - 1. maximize the number of analyzed tracks 2. Maximize their diversity 3. Bring the genre distribution (I mean percentage ratio) as much as possible closer to the actual distribution in our target audience's libraries.

So the more is tracks number and their diversity, the closer is our result to the "true value". Following the logic, if we decrease size of analysed material, the random error rises and results become less accurate. So analyzing of resultant bitrate for just 20-40 samples isn't a way to go. Because we need a value that will be true in average for maximum large music library.

On other hand, I saw here some fair comments about sample selection. But that's another step. See below.

2. Sample selection. Again, ideally we must orientate on the basis of our target audience's taste. So, about that comment. We really need a set of samples that represents variety of music that's listened by people in real life. But in this case we must absolutely realize who is our target audience. Because if we will analyze the average musical taste of all people in the world we will get majority of pop music. So we need to think: who and for what could use this mostly "enthusiastic audio formats". Then, again, we must make a statistical analysis of their musical libraries, group recordings by genre (or some other attributes) and then make a random selection of small samples number from each group. But this is ideally. This way will lead to proper analyzing of encoders' effectivity in average for average statistical audio enthusiast's library. But actually this is unimplementable. So we need to go some other way.

Above I've mentioned "some other attributes" for grouping the music material. This is also reasonable decision: we group our samples by it's kind: sharp transients, pure tones, wide stereo. Note that we should select really complex samples (but not just killer samples, especially we should not use completely synthetic non-musical samples), because in real life most of music is easy to encode, and will give no audible differences after encoding.
So in this case we have next requirements:

1) Collecting large number of hard-to-encode (but musical) samples.
2) Grouping them by some attributes. For example by signal type (pure tone/transient/stereo), then in each group we may include representatives of each music genre to make testing even more objective. Also we must include into each group the samples with and without vocal, and so on.
3) After grouping we make a random selection of samples from each group (subgroup).

This way we will get not so objective results as in first case (with analysing of percentage ratio for each genre), but much more informative. After testing we can present not only average results for all samples, but also results for each group of samples, so we'll be able to evaluate behaviour of each codec with each group (for example to compare which codec is better for transients). That's why we need to increase number of samples - because we must have at least few samples in each subgroup.

Also we need to make sample selection by users, who won't test all of them, completely random.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-28 16:33:39
3. And going back to the question of clipping prevention. I've rised this question, and now, after deliberation, I want to say what I think about it.  I think that we must completely eliminate clipping from our test. Because we test encoders and encoding. And in process of encoding level excess don't mean quality loses. Clipping is a problem of further decoding and processing, and we must test the maximum potential of codec which includes decoding to 32-bit float for example, and there will be no additional quality loses (clipping) in this cases.

So eventually I suggest encoding of original samples, than decoding to 32-bit float and peak normalization for every sample to 0 dBFS (equal gain level for each group of decoding result). For example, we encode Sample 1 to AAC, Opus and MP3. After decoding to 32-bit float we get maximum peak value on LAME: +2 dBFS. Then we decrease level for each decoding result by 2 dB and then convert data from 32-bit float to 16 bit (which is supported by ABC HR). This an example of how to play lossy audio with maximum quality (e.g. using foobar2000 RG/prevent clipping).
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: bandpass on 2013-12-28 20:23:14
we must test the maximum potential of codec which includes decoding to 32-bit float for example

Providing some head room at the input to the coder seems safest. I hope this means that post decoding normalisation would not be needed, but if it is, loudness-based, rather than peak-based seems safer.

In case the input sample is not correctly band-limited (due to loudness-wars smashing, or due to synthesised waveforms), band-limiting the sample before coding is also a good precaution.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Serge Smirnoff on 2013-12-28 20:46:52
1) Collecting large number of hard-to-encode (but musical) samples.
2) Grouping them by some attributes. For example by signal type (pure tone/transient/stereo), then in each group we may include representatives of each music genre to make testing even more objective. Also we must include into each group the samples with and without vocal, and so on.
3) After grouping we make a random selection of samples from each group (subgroup).

This way we will get not so objective results as in first case (with analyzing of percentage ratio for each genre), but much more informative. After testing we can present not only average results for all samples, but also results for each group of samples, so we'll be able to evaluate behavior of each codec with each group (for example to compare which codec is better for transients). That's why we need to increase number of samples - because we must have at least few samples in each subgroup.

I like this idea too. Maintaining a properly segmented bank of sound samples would be helpful for many audio researchers and enthusiasts. Along with killer-samples such bank can contain "ordinary" ones of different types - genres, voices, noisy/dirty, clipped ... . Some system of tags could be sufficient for the purpose. Then depending on the goal of the test samples with appropriate tags can be randomly selected. The use of ordinary (usual) sound samples for listening tests is a common practice especially for testing of low bit-rate encoders. For tests @96+ usual sound material is helpless. And this is another reason for not using in 96+ tests sample sets that are representative to some population of music - this ends up with large set of usual samples which are very hard to test @96+. On the other hand this approach could be interesting for testing @24-64.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-28 21:04:04
we must test the maximum potential of codec which includes decoding to 32-bit float for example

Providing some head room at the input to the coder seems safest.


Maximizing this way requires 2 or more passes for tuning the gain value (we can't predict how far each encoder will go over the original peak value). These are just needless difficulties and give no advantages over after-decoding normalisation (which is performed in foobar2000's ABX for example).

Quote
I hope this means that post decoding normalisation would not be needed, but if it is, loudness-based, rather than peak-based seems safer.

Loudness-based normalizing will be done by ABC HR as well.

Quote
In case the input sample is not correctly band-limited (due to loudness-wars smashing, or due to synthesised waveforms), band-limiting the sample before coding is also a good precaution.

Don't really think this is needed, maybe it's even wrong (not corresponds to real encoding conditions). Because we test the whole encoding algorithm, including low-pass filtering. And must keep material for encoding in it's original form.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-29 00:14:36
In case the input sample is not correctly band-limited (due to loudness-wars smashing, or due to synthesised waveforms), band-limiting the sample before coding is also a good precaution.

Don't really think this is needed, maybe it's even wrong (not corresponds to real encoding conditions). Because we test the whole encoding algorithm, including low-pass filtering. And must keep material for encoding in it's original form.


I agree that we should not be doing band-limiting.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: TheBashar on 2013-12-29 00:17:47
Oops
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Kamedo2 on 2013-12-29 12:19:09
3. And going back to the question of clipping prevention. I've rised this question, and now, after deliberation, I want to say what I think about it.  I think that we must completely eliminate clipping from our test. Because we test encoders and encoding. And in process of encoding level excess don't mean quality loses. Clipping is a problem of further decoding and processing, and we must test the maximum potential of codec which includes decoding to 32-bit float for example, and there will be no additional quality loses (clipping) in this cases.

If many decoders clip, my recommendation is to let it clip. LAME deal it by decreasing the volume (about 2%), so it won't be painfully bad. If you decode them to float 32bit and decrease the volume by 25%, almost all clipping can be avoided, but I want the test to be close to people's actual listening conditions, including portable players.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Steve Forte Rio on 2013-12-29 13:38:06
3. And going back to the question of clipping prevention. I've rised this question, and now, after deliberation, I want to say what I think about it.  I think that we must completely eliminate clipping from our test. Because we test encoders and encoding. And in process of encoding level excess don't mean quality loses. Clipping is a problem of further decoding and processing, and we must test the maximum potential of codec which includes decoding to 32-bit float for example, and there will be no additional quality loses (clipping) in this cases.

If many decoders clip, my recommendation is to let it clip. LAME deal it by decreasing the volume (about 2%), so it won't be painfully bad. If you decode them to float 32bit and decrease the volume by 25%, almost all clipping can be avoided, but I want the test to be close to people's actual listening conditions, including portable players.


In real conditions there are also cases when simplified decoders reduces playback quality, and many other things that could affect quality. These are the problems of device's design.

Also for portable players I recommend to use MP3/AACGain utility. If users want to get maximum quality they have to use this utility before uploading tracks to their portable devices (or other hardware). And again, we must target on encoding quality and irreversible quality losses only.

At long last, clipping can introduce unneeded deviations in quality comparison, which will not be valid for cases of proper playback (with clipping prevention) in reality.

I think that clipping (and it's audibility) problem must be investigated in a separate research.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-29 14:05:17
Hello, Guys.


Some of you have already recevied an invitation to a new forum "Audio Folks". 
It's a provisionary one until we get ready an official site. http://audiofolks.proboards.com/ (http://audiofolks.proboards.com/)

It's good to have an alternative. Now we have more flexibility to organize different topics, more fluid talk etc.
We consider to move all staff there in these days.  So we're waiting You there. Please register and help us to do a good test as we've done it previously. 

STATUS of a Public Multiformat Listening Test (http://audiofolks.proboards.com/thread/2/status-public-multiformat-listening-test)
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-29 14:31:48
As people have started to register on a new place this thread  is considered obsolete.

No more discussion here. We bring it down.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: Alexxander on 2013-12-29 14:37:18
A kind of OT, or not.

...
It's a provisionary one until we get ready an official site. http://audiofolks.proboards.com/ (http://audiofolks.proboards.com/)

It's good to have an alternative. Now we have more flexibility to organize different topics, more fluid talk etc.
We consider to move all staff there in these days.  So we're waiting You there. Please register and help us to do a good test as we've done it previously. 
...

I'm confused. A new site to continue talking about the same listening test? What can't be done on HA and needs therefore a new site?
I haven't much time and am not eager to have to spend more time visiting another audio site.

Edit: IgorC posted while writing this post that this discussion thread will be ended right now. I guess I won't participate on the listening test. Very confusing 
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-29 14:49:43
Alexxander,

I have sent a PM  to You.

It is the end here.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-29 15:04:40
Alexxander,

There is no change.
We're still the same people (participants, organizators etc.) who have conducted previous test.
Just a new place.
Title: New Public Multiformat Listening Test (Jan 2014)
Post by: IgorC on 2013-12-29 15:15:02
I will ask administrators to close this discussion.

Organization of this  test has moved to Audio Folks (http://audiofolks.proboards.com/thread/4/public-multiformat-listening-test-discussion)

Thank You.