Topic: Blind test challenge (Read 50731 times)previous topic - next topic

0 Members and 1 Guest are viewing this topic.
• KikeG
• Developer
Blind test challenge
Reply #25 – 28 February, 2003, 05:37:19 AM
Quote
Quote
It's 7/15 + 14/15 what makes 21/30, and I don't know how singificant that is.

21/30 is more like 2.1%.

But, if you add the other results, (6/7), then you get 27/37, which is 0.4 %.

BTW, at http://www.kikeg.arrakis.es/winabx you can download a cute coloured Excel table with all binomial distribution p-values up to 100/100.

• Garf
• Developer (Donating)
Blind test challenge
Reply #26 – 28 February, 2003, 07:36:45 AM
I could not ABX any of the clips in a casual session.

Question to KikeG: if you add up *all* your attempts, is it still significant too? I ask because you stated things like

Quote
'First I tried with my somehow boomy Sony MDR-7506, and seemed to hear some diferences, but I was unable to get any good ABX scores.'

Quote
At first I couldn't really notice any differences, just my imagination, so I didn't get any good scores.

But you ignore those results.

This isn't nitpicking - I abx'ed 2.wav 12/13 but if I add up all my attempts it's not significant.

• Continuum
Blind test challenge
Reply #27 – 28 February, 2003, 10:19:50 AM
The problem is: the probability of reaching a 95%-confidence with (e.g.) 100 trials is far more than 5%, it's 20%!

So the overall test-length (like 16) has to be fixed in a certain way a priori.

If a listener produces divergent results at different test-sessions (like 14/16, 5/12), I tend to think, that he could hear a difference in one case and none in the other. Our hearing precision changes. Whatever that means...

http://www.hydrogenaudio.org/forums/index....t=ST&f=1&t=3175

• KikeG
• Developer
Blind test challenge
Reply #28 – 28 February, 2003, 10:30:36 AM
Quote
But you ignore those results.

From a totally strict point of view, yes, I guess I should account all trials to account for total statistical probability, without possible dispute.

However, I think one can follow a more flexible way of doing things: one could consider the first rounds as "warm-up" or search for something to latch on, discard them once you latch on something audible, and consider as valid rounds the ones where you have latched on something, trying to get a significantly low p (<1%). It's also quite significant that in this previous "warm-up" rounds one gets bad scores, but in the "latched" ones one gets good scores, and I think this means something.

I think that these first rounds can't be counted has having same significance as the final "latched" ones. I don't know how statistically valid is this approach, but IIRC ff123 used to do something similar.

All this is more valid if you haven't done lots of unsuccesful trials before the "latched" ones. Even if you have, a few rounds of succesful rounds once you are "latched" are proof of audible difference for me.

Anyway, my total results are 47/61 (p<0.1%) for 4.wav, and 21/37 (p=25.6%) for 5.wav. For both I did  several short warm-up-search-for-something rounds, that I didn't consider very significant. For the first, at last I did several succesful rounds that raised up the global score, but for the last I did just one, the last one. I could try more succesful rounds of 5.wav, but I'm quite confident the results are significant, and I don't have much time to keep on it. However, if anyone wants absolute proof, I'll try again.

• Pio2001
• Global Moderator
Blind test challenge
Reply #29 – 28 February, 2003, 02:19:20 PM
No need for additional rounds. I think that we can consider the results of different rounds separately.

There are times when I'm tired and can't concentrate... but usually, I keep on anyway, answer eagerly... and get all wrong.
But after a good rest, concentrating properly, doing pauses, and patiently waiting to be sure to hear something before answering, I can get a perfect result on the same samples.
In this case, I discard the first round, and only take the second one into account.

From statistic point of view, the total score must be taken into account to give an accurate result about by general ability to hear a difference, but the result is meaningless because it's just a fixed number, while my hearing varies, as I listen carefully or not.

The separate results of each round tells my ability to hear the difference 1-when I'm not listening carefully, and 2-when I'm listening carefully. The general result, taking all rounds into account, is somehow a compromise between the two.

• Garf
• Developer (Donating)
Blind test challenge
Reply #30 – 01 March, 2003, 11:12:01 AM
I think the concept of 'warm-up' rounds and 'serious' rounds as discussed here has a problem if they are not fixed a priori.

How do you determine when you are done 'warming-up' and ready to go for the real test? You look at the ABX scores you are getting. But the ABX score is also what determines whether we get a significant result or not. So whether or not something 'counts' depends on how well it proves what we're trying to prove. That's not sound.

Even more problematic is the concept of 'bad tests' (when you were tired they don't count). How do you determine that it's a bad test? You see that you're not getting good ABX scores. Oops.

I think neither of this is statistically sound. I'd like to point that for the MAD Challenge ff123 doesn't allow it either. It may be passable for less serious tests, but I'm starting to be more strict on myself after noticing how easy it is to pass a test 'by accident' if you're not careful.

• Continuum
Blind test challenge
Reply #31 – 01 March, 2003, 01:13:38 PM
Due to the following passus,
Quote
My recommendation is that the moment you achieve 95% confidence, you should stop and claim victory.
the MAD-challenge might be a bad example: Everyone could pass it, theoretically, if he has enough time at his hands (because of the reasons linked above).

But I agree: the point when the "serious" test starts should be well-defined a priori. The best way would be to start the abx-machinery only when you believe to hear a difference.

• Pio2001
• Global Moderator
Blind test challenge
Reply #32 – 01 March, 2003, 01:21:21 PM
It's a question of time for me. I can very well define when an ABX is serious a priori, but this can happen no more than once a week.
When I need to test, I run the ABX whether I'm tired or not, and see the results. All the best if they match.

• ff123
• Developer (Donating)
Blind test challenge
Reply #33 – 01 March, 2003, 06:38:43 PM
Quote
Due to the following passus,
Quote
My recommendation is that the moment you achieve 95% confidence, you should stop and claim victory.
the MAD-challenge might be a bad example: Everyone could pass it, theoretically, if he has enough time at his hands (because of the reasons linked above).

But I agree: the point when the "serious" test starts should be well-defined a priori. The best way would be to start the abx-machinery only when you believe to hear a difference.

I never went back to change the rules in the MAD challenge after we went through the whole exercise of figuring out the best way to do ABX "profiles."  Fixed profiles would eliminate the bias inherent in being able to see the ABX scores as the test is performed (so you can stop whenever it's to your advantage to do so).

For that matter, I never incorporated the ABX profile concept into ABC/hr.  So much to do and so much laziness preventing me from actually doing it

ff123

• Garf
• Developer (Donating)
Blind test challenge
Reply #34 – 02 March, 2003, 05:15:46 PM
Quote
Due to the following passus,
Quote
My recommendation is that the moment you achieve 95% confidence, you should stop and claim victory.
the MAD-challenge might be a bad example: Everyone could pass it, theoretically, if he has enough time at his hands (because of the reasons linked above).

Please explain, I don't see how having a lot of time allows you to pass the MAD Challenge.

• Pio2001
• Global Moderator
Blind test challenge
Reply #35 – 02 March, 2003, 05:20:25 PM
Here are the present results (Voltron, your results have disappeared, could you recover them please ? They were close to success)

The common setup for recording is the analog input of the Sony DTC55ES DAT deck, sample rate=48 kHz (it only supports 32k and 48k). Optical output, Fostex optical to coaxial SPDIF converter, Marian Marc 2 coaxial digital input, clock set to digital input, recording in SoundForge 4.5, 48 kHz 16 bits stereo. The Marian digital Recording have been checked to be error free with CD Playback.
The leveling is done selecting exactly the same range (with maybe two or three sample of difference along  30 seconds of selection) in the original and the copy, and asking the statistics. Then a level correction is applied (typically 1.2 db), with two digits accuracy. No dither, no floating point process (SF 4.5 has a 16 bits engine, I've been told).

File 1 : same as File2, with a cheap (8 €) 5 meters CINCH extention [2] in addition to the cable [1] used for File2, leveled.
File 2 : Winamp 2.81, WinXP, WaveOut, Marian Marc 2 analog output, max volume, two meters cheap TRS to CINCH adapter with a loose contact [1]. Oddly, the Marian clock was slaved to the 48 kHz input, but the Winamp 44.1 kHz playback went flawless. I don't know if it can set the output to 44.1 while the input s 48kHz. Leveled
File 3 : same as File 5, with the 5 meters cheap cable [2] in addition. Leveled
File 4 : Winamp 2.81, WinXP, WaveOut, SoundBlaster 64 PCI-V, max volume, cable number [1] (see above). Leveled
File 5 : Yamaha CDX860 CD Player from 1991 (450 € at this time). Show no errors on pressed or burned CD in the SPDIF output. Custom RG179bu CINCH cable.

Original : CD ripped in secure mode, resampled to 48 kHz, leveled equal to File 5

KikeG : listening on computer built in SoundMax soundcard+ Senheiser HD560 and Sony MDR-7506
Pio2001 : external Sony DTC55ES as converter, Arcam DivaA85 Ampli, Senheiser HD-600 headphones
Voltron : Turtlebeach SantaCruz with Sony MDR-v250 Headphones

ABX results :

File 1 (Marian with 7 meters cable) :
Pio2001 : Failure
Garf : Failure
Voltron : Failure ?

File 2 (Marian with 2 meters cable) :
Garf : Failures, then 12/13
Pio2001 : Failure
Voltron : Failure ?

File 3 (Yamaha CD Player with 6 meters cable)
Pio2001 : Failure
Garf : Failure
Voltron : Failure ?

File 4 (Soundblaster 64)
Pio2001 : Failure
Garf : Failure
KikeG : Failure then 7/7, 14/17, 10/11, 12/14, total 47/61
Voltron : Success

File 5 (Yamaha CD Player with custom CINCH cables)
Pio2001 : Failure
Garf : Failure
KikeG : Failures, then Success. Total 21/37
Continuum : Failure
Bedeox : 7/15 then 14/15 then 6/7
Voltron : Failure ?

File 3 vs File 5 (Addition of 5 meters of cheap cable on the Yamaha Player)
Bedeox : Failure

In conclusion, this test brings more questions than answers.

The failures tend to show that an SB64 soundcard can sound very close to the original, not to mention the CD player, and that 5 meters of cheap CINCH cables have no effect on the sound (not to mention one meter only).
However, it can be objected that the listening sessions were done on computer soundcards (exept Pio2001, and maybe Continuum, Garf and Bedeox), and with headphones (exept maybe Continuum and Garf). While audiophile CD Players and audiophile CINCH cables are supposed to improve sensitive high end speaker systems.

The success on File 5 would show that 450 € is not enough for a CD Player (at least back in 1991) in order to get a perfect sound, and that audiophile CD players in the 1000 € range are worth the price.
But before jumping to this interesting conclusion, some obvious flaws must be eliminated.
The difference between the reference file and the number 5 can also come from
-The two processes (resampling and leveling) through which the reference file went, at 16 bits processing.
-The quality of the Sony DTC 55ES recording

The first problem can be tested passing the same sample in two opposite leveling/resampling processes. One leveling should be done between the two resampling processes, so as to avoid getting conjugate process for up/downsampling. It should be 44.1->48, level -1.5 db, 48->44.1, level +1.5 db.
If the result is not ABX able, the processes are ruled out as source of audible differences.
The second problem is more difficult to test. I could record the same as File 5, but with the Marian Analog input instead of the Sony. If both recordings sound the same (no ABX possible), and are both ABXable from the reference one, it is likely that the difference come from the CD Player, and not the recording device.

Bedeox and KikeG, you abxed File 5. Are you interested in going on ? Or anyone else. I can provide the mentionned samples, along with a new reference one, if you want. This time, instead of Depeche Mode, that I chose casually, I would rather use Rebecca Pidgeon : an audiophile recording from Chesky records (but you would need to ABX File 5 again)

At the end, the possible proof that audiophile CD Players are worth, but in 1991, who knows if nowadays CD players are better... it is said so.
If one of you has a recent hifi CD player worth at least 300 € and a good ADC, it would be better to use them.

For now, one thing is sure, if a cheap line cable has any effect on the sound, it is very, very litttle. The RMS level loss is 0.00 +/- 0.01 db for 5 meters.

Sorry for not providing more informative results. When I started this, test, I hoped than no one could ABX the CD Player, even on high end systems, but I see that it is not the case.

• Garf
• Developer (Donating)
Blind test challenge
Reply #36 – 02 March, 2003, 05:26:40 PM
I used SB128 into HD580's for this test.

The 12/13 ABX was gotten by randomly hitting the keys while not wearing the headphones (Case is my witness on IRC). It's got <0.2% significance, something to ponder about. (Main reason why I argued the significance of the other tests as well )

• Pio2001
• Global Moderator
Blind test challenge
Reply #37 – 02 March, 2003, 07:31:29 PM
Did you really hit randomly the keys ? No only A, only B, or AB scheme ? I've noticed in ABX comparator, that the same sample may be played several times in a row (though I didn't compute the probability it could happen).
Are the ABX programs using trustful random generators (our programming teacher told us "Never use the built in random generator ! Always use the one in the math library...")

• ff123
• Developer (Donating)
Blind test challenge
Reply #38 – 02 March, 2003, 10:55:13 PM
Quote
Did you really hit randomly the keys ? No only A, only B, or AB scheme ? I've noticed in ABX comparator, that the same sample may be played several times in a row (though I didn't compute the probability it could happen).
Are the ABX programs using trustful random generators (our programming teacher told us "Never use the built in random generator ! Always use the one in the math library...")

abchr, at least, no longer uses rand().  It uses the "Mersenne Twister" Garf found:

http://www.math.keio.ac.jp/~matumoto/ver980409.html

Hans Heijden had found one sequence which showed moderate evidence against randomness on a runs test, which prompted me to change the random function.  However, all of the other runs I tried myself passed for randomness, so I'm not sure any change was really necessary.

ff123

• KikeG
• Developer
Blind test challenge
Reply #39 – 03 March, 2003, 06:21:59 AM
Quote
abchr, at least, no longer uses rand().  It uses the "Mersenne Twister" Garf found:

Funny, I have been using that same random number generator for some of my internal utilities from some time, but right now I don't remember for sure if I WinABX uses that one or the built-in rand() function of BC++ Builder, I think it uses the later (I don't have access to the code right now). But it uses it in the "proper" way, the one you use too ( not rand()%n ), so this shouldn't be a problem.

• KikeG
• Developer
Blind test challenge
Reply #40 – 03 March, 2003, 07:07:44 AM
Quote
The difference between the reference file and the number 5 can also come from
-The two processes (resampling and leveling) through which the reference file went, at 16 bits processing.
-The quality of the Sony DTC 55ES recording
...
At the end, the possible proof that audiophile CD Players are worth, but in 1991, who knows if nowadays CD players are better... it is said so.

I suspect that differences heard are more due to the Sony DAT recorder. Let me explain why:

Analyzing 5.wav against the original using a FFT analizer, seems that the 5.wav file has some strange frequency and phase response behaviour, that I think can be due to slight speed-up and slow down of the recording, similar to wow and flutter of analog recorders. To check this you could repeat the procedure but with a single 1 KHz tone signal.

In this 5.wav clip, what I heard is some slight emphasis of highs at the beginning of the song, that are really easy to hear having "fresh" ears.

By the way, I just ABX'ed it again, in just in a single round: 16/20 p=0.6%. I quickly (1 minute) got 7/7 (p=0.8%) at the beginning, but I wanted "absolute" proof and kept on, I guess my ears got a little bit tired or stressed, and then failed some trials, up to the final score.

Global score is 37/57, p=1.7%.

• KikeG
• Developer
Blind test challenge
Reply #41 – 03 March, 2003, 08:33:48 AM
Looking a bit more into the FFT analyses, seems that the difference is only of speedup in case of 5.wav.

Also, I ABX'ed 2.wav too, in a single round 25/35 p=0.8%. This time the difference seems to be the opposite: 2.wav sounds a little bit duller, which is confirmed by FFT analyses, it seems to be a little bit slown down in comparison with the original.

This is a little bit strange, a detailed objective analysis (measurements) should be used to see what is happening.

• Continuum
Blind test challenge
Reply #42 – 03 March, 2003, 10:16:46 AM
Quote
Please explain, I don't see how having a lot of time allows you to pass the MAD Challenge.

The conventional p-value calculation uses the fact that the number of trials is fixed a priori.
E.g. You decide to perform 8 trials. Then you achieve a score of 7 correct trials.
The p-val then is the probability to get 7 or 8 trials correct.

Now consider a different situation: Instead of fixing the number of trials, you decide on a certain confidence level (calculation based on the current trial in the same way as above) you want to reach.
E.g. You want to reach 95%-confidence (in the classical sense) and stop as soon as this condition is satisfied. Now the following are your win-conditions:
5/5, 7/8, 9/11, 10/13, 12/16, 13/18, ...
So, the probability to pass this is test by guessing is not only 0.05 but something like:
P(5/5) + P(7/8 and not 5/5) + P(9/11 and neiter 5/5 nor 7/8) + ...
which tends to 1  .

If you are interested in more information, check the Statistics for Abx-thread. There are some experimental results and calculations and a proposed compromise between the free-length test and sufficiently-significant-while-not- to-hard tests.

• NumLOCK
• Developer
Blind test challenge
Reply #43 – 03 March, 2003, 11:42:28 AM
Quote
Quote
Did you really hit randomly the keys ? No only A, only B, or AB scheme ? I've noticed in ABX comparator, that the same sample may be played several times in a row (though I didn't compute the probability it could happen).
Are the ABX programs using trustful random generators (our programming teacher told us "Never use the built in random generator ! Always use the one in the math library...")

abchr, at least, no longer uses rand().  It uses the "Mersenne Twister" Garf found:

http://www.math.keio.ac.jp/~matumoto/ver980409.html

Hans Heijden had found one sequence which showed moderate evidence against randomness on a runs test, which prompted me to change the random function.  However, all of the other runs I tried myself passed for randomness, so I'm not sure any change was really necessary.

ff123

If you don't need speed, the best known pseudo-random generator is B.B.S (Blum-Blum-Shub).  The difficulty to predict a single output bit from all previous others, is proven to be as hard as factoring an arbitrary-sized integer.

If factoring a 500-digit number sounds too easy    , it can be possible to make a PRNG based on the discrete logarithm problem
Try Leeloo Chat at http://leeloo.webhop.net

• Pio2001
• Global Moderator
Blind test challenge
Reply #44 – 03 March, 2003, 04:22:53 PM
I've compared the reference with File 5. File 5 runs indeed faster... 0.002 % faster (which is, from a pitch point of view, is 0.0002 tones, the extreme limit of audibility being 0.01 tone, for very well trained people).

Here's the sonogram of the difference between the two files (offsetted by 40 samples, so that the symmetry is more visible). You have to substract the samples and listen to the result to understand the pattern.
It means that both clocks are wow and flutter free, the two clocks (playback and record) just don't run at the same frequency.

Listening to the vanishing point of the differences, where the two clocks are in synch, it sems that there is a difference between the two files in the low frequencies.

But the speed difference can't account for the audible difference, and isn't necessary the Sony's fault.

• TJA
Blind test challenge
Reply #45 – 03 March, 2003, 09:09:11 PM
You just cannot use most random functions in libraries.
Only thing i know that works for a SHORT time is /dev/random from LINUX.

Here a part of the man-page to that:

The random number generator  gathers  environmental  noise
from  device  drivers  and  other  sources into an entropy
pool.  The generator also keeps an estimate of the  number
of  bit  of  the  noise  in  the  entropy pool.  From this
entropy pool random numbers are created.

When read, the /dev/random device will only return  random
bytes  within the estimated number of bits of noise in the
entropy pool.  /dev/random should  be  suitable  for  uses
that  need  very  high quality randomness such as one-time
pad or key generation.  When the entropy  pool  is  empty,
mental noise is gathered.

All other implementation - that mostly use mathematical function and NOT an entropy pool - will not work!
I´m sorry if those mentioned library has the above entropy pool, but as far as i know, most libraries do NOT!
3.90.3 --alt-preset extreme -V0 --lowpass 20.5 -> yeah!
"extremist of extreme", johnV @ Sep 13 2002 - 02:01 PM  ;-)

• KikeG
• Developer
Blind test challenge
Reply #46 – 04 March, 2003, 03:23:39 AM
Quote
I've compared the reference with File 5. File 5 runs indeed faster... 0.002 % faster (which is, from a pitch point of view, is 0.0002 tones, the extreme limit of audibility being 0.01 tone, for very well trained people).

I think the difference is not on the perceived pitch (musical tone), but maybe more in the fact that the 5.wav file has its high frequencies a little bit displaced up in the frequency scale due to this faster play, resulting into perceived louder highs. The higher the frecuency, the more it is displaced up in the frequency scale.

And yes, clock speed differences aren't necesarily the DAT's fault.

• Pio2001
• Global Moderator
Blind test challenge
Reply #47 – 04 March, 2003, 06:50:16 AM
Quote
the fact that the 5.wav file has its high frequencies a little bit displaced up in the frequency scale due to this faster play, resulting into perceived louder highs

I can't believe it.

Do you realize that all the 1000 to 2000 Hz octave, for example, is just changed into 1000.002 to  2000.004 Hz ?

Edit : how much better is our threshold of hearing at 1000.002 Hz compared to 1000.000 Hz ?

• KikeG
• Developer
Blind test challenge
Reply #48 – 04 March, 2003, 08:26:00 AM
The effect is higher at, say, 15 KHz. But it's still quite small, so I don't know for sure what is really happening.

• Garf
• Developer (Donating)
Blind test challenge
Reply #49 – 04 March, 2003, 02:53:47 PM
Quote
E.g. You want to reach 95%-confidence (in the classical sense) and stop as soon as this condition is satisfied. Now the following are your win-conditions:
5/5, 7/8, 9/11, 10/13, 12/16, 13/18, ...
So, the probability to pass this is test by guessing is not only 0.05 but something like:
P(5/5) + P(7/8 and not 5/5) + P(9/11 and neiter 5/5 nor 7/8) + ...
which tends to 1  .

Are you sure? It's counterintuitive to me (as are many statistics, but anyway )

It's P(5/5) + P(7/8 or 8/8 and not 5/5) + P(9/11 or 10/11 or 11/11 and not 5/5 or not 7/8 or not 8/8) + ...

The chances are interdependent, failure on the first influences success on the second one and so on.

A silly test is to write a simulation that keeps guessing in ABX, if you are right it has to pass eventually.