Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Public Listening Test [2010] (Read 176734 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Public Listening Test [2010]

Reply #275
Thank you, Alex.

Speaking of low anchor, Emese is most hard sample which I've ever seen.  64 kbps low anchor is actually not that awful for rest of the samples.

Chris has prepared concatenated  reference sample.
Link to download: h*tp://www.mediafire.com/?yhwmwzjgmm3
I've also attached some possible low anchors: itunes 64 CBR, itunes 64 CVBR and CT 80. CT for 80 kbps low anchor as Apple has bug on LC-AAC 80-96 kbps.

01 BerlinDrug
02 AngelsFallFirst
03 CantWait
04 CreuzaDeMä
05 Ecstasy
06 FallOfLife_Linchpin
07 Girl
08 Hotel_Trust
09 Hurricane_YouCant
10 Kalifornia
11 Memories
12 RobotsCut
13 SinceAlways
14 Triangle_Glockenspiel
15 Trumpet_Rumba
16 Waiting

Public Listening Test [2010]

Reply #276
I tried the files.

Personally I don't think the low anchor is optimal when the first thing that you hear is the obvious low-pass that makes the encoding entirely different from the others.

I'd like to suggest FAAC (v.1.28 from rarewares) with an adjusted low-pass frequency, for instance:

-q 35 -c 18000

I tried the above and it works pretty well with the concatenated sample.

EDIT

If it would appear to be too good or bad for a specific sample the q value could be adjusted for that sample.

Public Listening Test [2010]

Reply #277
itunes CVBR 64 is noticeably better than FAAC -q 35 -c 18000

Public Listening Test [2010]

Reply #278
itunes CVBR 64 is resampled to 32 kHz and low-passed at about 12 kHz, otherwise it sounds pretty "clean". It doesn't really help to understand what kind of artifacts (distortion, noise, pre-echo, etc) the sample may produce.

If -q 35 is too bad a higher value can be used.

In addition it would be better to include only 44.1 kHz samples. Sample rate switching may produce additional problems with the ABC-HR program, some operating systems, and/or some sound devices.

Public Listening Test [2010]

Reply #279
Hm, good points indeed.

Then we should encode to  FAAC -q>35 -c 18000 for low anchor

Public Listening Test [2010]

Reply #280
If we're looking for an anchor emphasizing artifacts to be expected, why not use MP3, e.g. LAME CBR at the lowest setting which doesn't downsample to 32 kHz? I think we could actually use the old "version 1.0" Fraunhofer encoder from 1994(?) with an additional 16-kHz lowpass filter applied before encoding (that should avoid the bug).

Edit: The more I think of it, the more I believe we should use two anchors to stabilize the results: one to define the lower end of the grading scale, the other to define a mid-point of the scale. For the lower end, I just imitated the world's first audio encoder: our test set downsampled to 8 kHz using Audition and saved as 8-bit µ-Law stereo Wave file. That's a 128-kb encoding. Nice demonstration of how far we've come in the last 40 years or so

µ-Law file: http://www.materialordner.de/wsRJHTtgLzlgF...TouJw5xomU.html

Edit 2: When using the µ-Law file as anchor, of course it will be upsampled to 44 kHz again.

Maybe a 96-kb MP3 would be just fine for an intermediate anchor.

Edit 3: Can someone please upload Fraunhofer's 1994 encoder (l3enc 0.99) here? Roberto's original page expired.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #281
Regarding the splitting of the concatenated encodes: I think we should use CopyAudio by Kabal et al. from McGill university to simply cut the Wav decode into the appropriate chunks. Reason:

http://www-mmsp.ece.mcgill.ca/Documents/Downloads/AFsp/

  • We can cut off the first 2.1 seconds of the test set, i.e. the HA introduction (stabilization part for CBR encoders).
  • We don't have to worry about encoder delay: we can split accordingly since the delay is known in advance.
  • CopyAudio can be sent to the listeners since it's freeware, and it's available for Linux/Mac and Windows, which
  • allows us to provide scripts for Linux/Mac and Windows, run by the listeners, to prepare the entire test from the concatenated .m4a encodes, i.e. decode to WAV and split into separate files.
  • We could even handle resampling for the anchor(s): there's also a tool ResampAudio in the Afsp package.

What do you guys think? If you agree, I'll write "prepare_test.bat" and "prepare_test.sh" Windows and Linux scripts for the ABC/HR package over the weekend.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #282
Ok, Chris, your applications are better. 
I'm also fine with any of low anchors. So FAAC or LAME are just fine.
PM or send Email how you want to proceed.


Public Listening Test [2010]

Reply #284
Thanks a lot, lvqcl! I tried a 112-kb encode with the - apparently bug-free - l3enc version 2.60 (Linux version). The quality is actually too good for a mid-anchor. 96 kbps unfortunately don't work in the unlicensed version. We are currently investigating LAME at 96 kb and 44 kHz sampling rate as anchor.

For the record, the lower anchor will be created and decoded with the following commands. This yields a delay-free anchor.

Code: [Select]
ResampAudio.exe -s 8000 -f cutoff=0.087 -D A-law -F WAVE ha_aac_test_sample_2010.wav ha_aac_test_sample_2010_a-law8.wav
ResampAudio.exe -s 44100 -D integer16 -F WAVE ha_aac_test_sample_2010_a-law8.wav ha_aac_test_sample_2010_a-law.wav
del ha_aac_test_sample_2010_a-law8.wav

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #285
What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).
Copy Restriction, Annulment, & Protection = C.R.A.P. -Supacon

Public Listening Test [2010]

Reply #286
I don't understand why two low anchors would be needed. Wouldn't it better to let the "mid" anchor define where the the lower end of the scale is?  Then there would possibly be a bit wider scale for the contenders. Ideally the low anchor would then get 0-3 and the contenders 2-5. IMHO, it would be enough that there is one low anchor that can be detected easier than the actual contenders.

Also, I don't understand why some old/mediocre MP3 encoder/setting would make a better low anchor than FAAC. FAAC would nicely represent the basis of the more developed AAC encoders. FAAC can be adjusted freely to provide the desired quality level. "-q 35 -c 18000" worked for me, but perhaps -q 38, -q 40 or so would work as well.

In general, it would be desirable that all encoders, including the low anchor, are easily available so that anyone can reproduce the test scenario (for verifying the authenticity of the results) or test different samples/encoders using/including the tested encoders and settings in order to get comparable personal results. Also the procedure to decode and split the test sample should be reproducible by anyone.

Public Listening Test [2010]

Reply #287
I don't understand why two low anchors would be needed. Wouldn't it better to let the "mid" anchor define where the the lower end of the scale is?  Then there would possibly be a bit wider scale for the contenders. Ideally the low anchor would then get 0-3 and the contenders 2-5. IMHO, it would be enough that there is one low anchor that can be detected easier than the actual contenders.

Use of two anchors follows the MUSHRA methodology and is an attempt at making the grading scale of this test more absolute. After all, all encoders in this test sound quite good compared to old/simple encoding techniques or lower bit rates. As the name implies, the lower anchor shall define the lower end of the scale and should give the listeners an idea of what we mean by "bad quality" (range 0-1). The hope then is that this reduces the confidence intervals (grade variance) for the other coders in the test, including the mid anchor (which should end up somewhere in the middle of the grading scale).

Quote
Also, I don't understand why some old/mediocre MP3 encoder/setting would make a better low anchor than FAAC. FAAC would nicely represent the basis of the more developed AAC encoders. [...]

Actually, it seems it doesn't. In my first informal evaluation, I noticed that FAAC is tuned very differently than the other AAC encoders in the test (less pre-echo, more warbling), and it seems LAME@96kb emphasizes the artifacts of the codecs under test (pre-echo, warbling on tonal sounds, etc.) better than FAAC@64. Btw, the bandwidth of LAME@96 is close enough to the codecs under test (around 15 kHz).

Quote
In general, it would be desirable that all encoders, including the low anchor, are easily available so that anyone can reproduce the test scenario (for verifying the authenticity of the results) or test different samples/encoders using/including the tested encoders and settings in order to get comparable personal results. Also the procedure to decode and split the test sample should be reproducible by anyone.

Agreed. Igor and I are working on scripts, run by the listeners, which do all the decoding and splitting of the bit streams and creation of the (decoded) anchors. My commands for the lower anchor above are a first attempt at this.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #288
Ideally the low anchor would then get 0-3 and the contenders 2-5. IMHO, it would be enough that there is one low anchor that can be detected easier than the actual contenders.

As the name implies, the lower anchor shall define the lower end of the scale and should give the listeners an idea of what we mean by "bad quality" (range 0-1).

The ITU-R five grade impairment scale that is used is between 1 (Very Annoying) and 5 (Imperceptible).
Bad quality would be in range 1-2, probably closer to 1.

Public Listening Test [2010]

Reply #289
Use of two anchors follows the MUSHRA methodology and is an attempt at making the grading scale of this test more absolute. After all, all encoders in this test sound quite good compared to old/simple encoding techniques or lower bit rates. As the name implies, the lower anchor shall define the lower end of the scale and should give the listeners an idea of what we mean by "bad quality" (range 0-1). The hope then is that this reduces the confidence intervals (grade variance) for the other coders in the test, including the mid anchor (which should end up somewhere in the middle of the grading scale).

In the past 48 and 64 kbps tests most samples were difficult to me because the low anchor was too bad and the remaining scale wasn't wide enough for correctly stating the differences between the easier and more difficult samples. I.e the low anchor was always like a "telephone" and got "1". The actual contenders were considerably better, but never close to transparency. So the usable scale for the contenders was mostly from 2.0 to 3.5. Actually, even then the grade "2" was a bit too low for correctly describing the difference between the low anchor and the worst contender. At the other end of the quality scale the difference between the reference and the best contender was always significant and anything above 4 would have been too much for the best contenders.

Of course the situation is different in a 128 kbps AAC test, but there is a danger that the two anchors will occupy the grades 1-4 and the actual contenders will get 4-5 and once again be more or less tied even though the testers actually could hear clear differences between the contenders.

Quote
Actually, it seems it doesn't. In my first informal evaluation, I noticed that FAAC is tuned very differently than the other AAC encoders in the test (less pre-echo, more warbling), and it seems LAME@96kb emphasizes the artifacts of the codecs under test (pre-echo, warbling on tonal sounds, etc.) better than FAAC@64. Btw, the bandwidth of LAME@96 is close enough to the codecs under test (around 15 kHz).

I see. I didn't actually try to do that kind of complex cross-comparison so you know more about this than I. You could have posted the explanation earlier... 

Public Listening Test [2010]

Reply #290
The ITU-R five grade impairment scale that is used is between 1 (Very Annoying) and 5 (Imperceptible).
Bad quality would be in range 1-2, probably closer to 1.

Oops. That's my mistake and probably Chris just repeated it. I wrote the reply a bit hastily. By default ABC/HR for Java shows five integer grades from 1 to 5 (though that is configurable).

Public Listening Test [2010]

Reply #291
Of course the situation is different in a 128 kbps AAC test, but there is a danger that the two anchors will occupy the grades 1-4 and the actual contenders will get 4-5 and once again be more or less tied even though the testers actually could hear clear differences between the contenders.

The method of statistical analysis which we will be using this time will take care of this: http://www.aes.org/e-lib/browse.cfm?elib=15021 Getting two MUSHRA-style anchors (one for worst quality, one for intermediate quality, and hidden reference for best quality) into our test allows us to use MUSHRA-style evaluation for our test, as stated in the referenced paper.

Quote
I see. I didn't actually try to do that kind of complex cross-comparison so you know more about this than I. You could have posted the explanation earlier... 

Sorry, I only did these tests a few days ago

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #292
What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).

When I have time, I'll certainly blind-test GXLame against LAME (because I'm interested in your work). However, assuming GXLame sounds better than LAME at low bit rates, I still tend towards LAME as anchor for this test. Here's why: unlike the codecs under test, anchors are supposed to produce certain artifacts, not avoid them.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #293
OK, I changed my mind and go along with Alex. The mid anchor will be a "compromised" AAC encoding at 96 kbps VBR. More precisely, one without TNS and short blocks and a bandwidth of 15.8 kHz. It will be created with FAAC v1.28 and the following commands:

Code: [Select]
faac.exe --shortctl 1 -c 15848 -q 50 -w ha_aac_test_sample_2010.wav


Decoder-wise, I'm not sure yet. Either NeroAacDec 1.5.1.0 or FAAD2 v2.7. Can someone point me to an Intel MacOS X (fat) binary of the latter?

Chris
If I don't reply to your reply, it means I agree with you.


 

Public Listening Test [2010]

Reply #295
What do you think about getting GXLame in as a low anchor (or even a competitor in a non-AAC test)? It's a low-bitrate MP3 encoder, so it just might fit the bill somewhere between V0-V30 (V20 averages 96kbps and defaults to 44kHz).

When I have time, I'll certainly blind-test GXLame against LAME (because I'm interested in your work). However, assuming GXLame sounds better than LAME at low bit rates, I still tend towards LAME as anchor for this test. Here's why: unlike the codecs under test, anchors are supposed to produce certain artifacts, not avoid them.

Chris


That's perfectly understandable. With its t4 release, I think it's actually quite competitive--I rushed to finish it in time for this test.
Copy Restriction, Annulment, & Protection = C.R.A.P. -Supacon

Public Listening Test [2010]

Reply #296
In response to www.hydrogenaudio.org/forums/index.php?showtopic=77809:

Quote from: C.R.Helmrich link=msg=0 date=
Quote from: muaddib link=msg=0 date=

Also it would be beneficial to create tutorial with each,single,small step that proper test must consist of.

Do you mean a tutorial for the listeners on "what the rules are" and how to proceed before and during the test? That sounds good. Will be done.

I finally found some time for this test again. I've managed to write a nearly test-methodology (ABC/HR or MUSHRA) and user-interface independent instruction sheet to guide the test participants through a test session. It's based on my own experience and adapted to this particular test with regard to anchor and hidden-reference selection and grading. I'v put a draft under

www.ecodis.de/audio/guideline_high.html

A description of said "general test terminology", i.e. an explanation of terms such as anchor, item, overall quality, reference, session, stimulus, and transparency, will follow.

Everything related to listener training, i.e. how to use the test software, what kinds of artifacts to expect, and how to spot artifacts, will also be discussed separately. As mentioned, this instruction sheet is the "final one in the chain" and assumes a methodology- and terminology-informed, trained listener.

If you're an experienced listener and feel that your approach to a high-bit-rate blind test is radically different from my recommendation, please let me know about the difference.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #297
If you're an experienced listener and feel that your approach to a high-bit-rate blind test is radically different from my recommendation, please let me know about the difference.


Chris, I'm not experienced listener at all, and also my headphones are really poor. But I would love to know what people think about my way. I actually don't care about ABX probabilities but simply mux encoded and raw audio into L and R channels so that I can hear both signals simultaneously.

Also since there is some activity related to the test, I'm wondering whether someone could reach Opticom, or just have access to OperaDigitalEar to get advanced PEAQ scores for the test samples.

Public Listening Test [2010]

Reply #298
I actually don't care about ABX probabilities but simply mux encoded and raw audio into L and R channels so that I can hear both signals simultaneously.

My initial guess is that this is dangerous! You will probably hear artifacts which are inaudible if you just listen to the original and coded version, one after the other, and you might not hear certain artifacts which are clearly audible if you listen to both channels of the codec signal. Example: if original and coded version are slightly delayed to each other, you'll hear this with your approach because human hearing is very sensitive to inter-aural delay. However, if both coded channels are delayed by the same amount compared to the original two channels, this might be inaudible if you listen to both coded channels (which you should). I've never ABXed this way.

Objective quality measures will be done, but might not be published with the results (don't know if I'm allowed to publish Advanced PEAQ scores, the license is owned by my employer, not by me), especially not before the test.

Chris
If I don't reply to your reply, it means I agree with you.

Public Listening Test [2010]

Reply #299
and you might not hear certain artifacts which are clearly audible if you listen to both channels of the codec signal.

What kind of artifacts could be missed? Excluding stereo issues I can only imagine a very far fetched example. Anyway,
this method can be thought as unit test. Here is what I usually do

[font= "Courier New"]
%%
[a, fs] = wavread('sampleA.wav');
[b, fs] = wavread('sampleB.wav');

[c, i] = xcorr(sum(a,2), sum(b,2), 4096); % fftfilt in Octave
i = i(abs( c )==max(abs( c )));

a(1: i) = [];
b(1:-i) = [];
a(length(b)+1:end) = [];
b(length(a)+1:end) = [];

%%
j = round(rand);
x = circshift([a(:) b(:)], [0 j]);

wavplay(x, fs, 'async')
[/font]