Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Overcoming the Perception Problem (Read 70655 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Overcoming the Perception Problem

Reply #100
Does an untestable hypothesis even have a place in science?


That question is certainly up for discussion. Popper would answer a resounding 'no': untestable hypotheses are unscientific. That is not to say they are unimportant (imagine there is a $DEITY that sentences you to salvation or damnation based on the colour of your shoes – you can hardly dismiss that as unimportant), just to say that they are outside the realm of science. That is, kinda, twisting the question around: does science even have a place in the discussion of untestables?

Then on the other hand, you have cases which are in principle testable, but you won't ever find the data – or it may or may not be so that there will ever be revealed enough information (you don't know yet). Form a “We believe that this happened because ...” hypothesis. Likely there is a grey area between “testable”, “will become testable, just wait and see”, “may or may not ever become testable” and “won't ever be testable”. And, what if it is “in principle testable”, but you are fully aware that hardly anyone makes a decent attempt at it? Is that a failure of science as such? Is it unscientific to base yourself on such a hypothesis? (Warning: potentially leading trick question.)

Overcoming the Perception Problem

Reply #101
I think that's an intelligent response, and this could be an interesting discussion.

However, I'm reminded of the reality of blind and sighted testing, and I think we're way off on a tangent.

Sighted listening tests can have all the problems of "stress" and "altered perception" that item ascribes to blind tests. If item is right, the very fact that we are not listening purely for the enjoyment of the music (or why ever you normally listen to your stereo), and are listening, at least partly, with a view to forming a judgement - that simple fact has nullified our attempt to form a correct judgement related to the practice of normal listening.

Blind, sighted, whatever - irrelevant. Asking the question has made the question unanswerable.


And yet, in practice, it's only expectation bias that seems to be a problem. Everything else has an effect on what I perceive, but not a systemic effect which makes me consistently prefer the wrong thing - not if the test is designed properly.

I think Arny quietly makes the same excellent point over and over again, such that readers miss the power and relevance of it: people who haven't even tried sighted and blind and double blind testing really don't know what they're talking about. While the philosophical discussions we've had here are interesting, and consideration of the "what ifs" might be great fun, it's a complete and utter waste of time compared to getting some practical experience of the issues surrounding these things - and learning about your own responses to sighted vs blind testing. That's the real eye (ear!) opener. Do that - experience the magnitude of the expectation bias problem vs every other possible problem - and then come back and argue the philosophy of the situation if you still think its relevant.

Cheers,
David.

Overcoming the Perception Problem

Reply #102
Sighted listening tests can have all the problems of "stress" and "altered perception" that item ascribes to blind tests. If item is right, the very fact that we are not listening purely for the enjoyment of the music (or why ever you normally listen to your stereo), and are listening, at least partly, with a view to forming a judgement - that simple fact has nullified our attempt to form a correct judgement related to the practice of normal listening.


Item's argument is "I hypothesize incorrectly that there is a problem with DBT for audio, based on flawed reasoning and fantasy. Now please demonstrate that this isn't so." It's a cop-out. We cannot learn anything from this line of thought.

Had I not believed that item is sincere, I would have accused him/her of eloquently trolling the board.

Overcoming the Perception Problem

Reply #103
This might explain it, if it's the same Item?


Item Audio


Overcoming the Perception Problem

Reply #104
I for one do not discard the statement “Putting people in a testing lab alters their behaviour”. Not even if “perception” is part of “behaviour”. Greynol (post #9) puts forth the opinion that people are more alert in a testing lab – in which case the behaviour does indeed change.

But it is hardly controversial to claim that the test setup matters. Indeed, DBT procedures are introduced for precisely that reason – it is well documented that placebo is too significant to be ignored.

- If the lab setup misses some (real, as opposed to imagined) differences only every now and then (unsystematically), then it simply requires more trials to pin them down. That is not much of an issue as long as the number is reasonable. If it is/were so that a sighted test “detects” a difference in two trials but is useless to prove any differences in thousand, and a DBT detects and reliably establishes in fifteen, then the latter is superior.
And even then, the practitioners would have a gut feeling on how many trials would suffice. A hidden piece of information saying that under ideal circumstances you could reduce the # of trials from 15 to 8 ... ? well, that is just a matter of cost-efficiency.

- It is known and completely uncontroversial that a statistical test with a finite number of trials is inherently prone to miss real effects which are sufficiently weak. If the lab setup completely misses some real differences – either because they for some reason show up so much rarer that we cannot detect them, or because for some mysterious reason they are completely gone – well, too bad; but compare this to a test setup which we know from the outset that distort the result so much that it is useless from the beginning.


I guess the latter is only food for this typical anti-science stand of “error in one, error in all”, used to bulk together every science that has ever made an inaccurate statistical prediction or refrained to make one that by coincidence hit. The pet of denialists of evolution, geosciences, smoking-induced cancer, and not to mention Godwin's ineffable application of the infinite monkey theorem. Statistical science has wrongs. It is a method of reducing errors, at avoiding them by numbers, but not at one omniscient stroke getting rid of them all – that's a luxury reserved to those who have received a G-d-given truth revelated to them once and for all. To those, any misprediction – or prediction failed to be made – is just a proof that science cannot be trusted any more than the previous doomsday that passed unnoticed.

Overcoming the Perception Problem

Reply #105
I for one do not discard the statement “Putting people in a testing lab alters their behaviour”. Not even if “perception” is part of “behaviour”. Greynol (post #9) puts forth the opinion that people are more alert in a testing lab – in which case the behaviour does indeed change.


The above is an incomplete statement of the claims we're dealing with.

The actual claim is “Putting people in a testing lab always alters their behavior in such a way that they are always transformed from highly sensitive detectors of audible differences into lumps of coal who are deaf to all but the grossest of auditory stimuli".

My point is that the alteration of behavior is always negative for audible differences, according to the DBT critics.

No known instances of even the best of their listeners ever beating the sensitivity-robbing failings of the DBT system, ever.

That would appear to be their story. ;-)

Overcoming the Perception Problem

Reply #106
The actual claim is “Putting people in a testing lab always alters their behavior in such a way that they are always transformed from highly sensitive detectors of audible differences into lumps of coal who are deaf to all but the grossest of auditory stimuli".


And even then one would have to establish that the highly sensitive detectors of audible differences aren't swamped by the highly sensitive detection of placebo. One could in principle be lucky, in that the information revealed in sighted tests overshadow the misinformation, but why bet on it? (It isn't that hard to check, given time and resources and ... a slight willingness to sacrifice honesty – what about a sighted version of http://www.matrixhifi.com/ENG_contenedor_ppec.htm bar a white lie about which gear was playing? Edit at the end of the posting.)

Or we could just put the snakeoil investment plan on hold until we get better (and less uncomfortable) brain scans.


Edit: should have bookmarked this one, but google is my friend:
[blockquote]We also heard David Wilson's fascinating presentation of his conception of system hierarchy. He compared a pair of Wilson Sophias driven by a Parasound stereo power amplifier with a competitor's flagship speaker and an extremely powerful premium-priced amplifier. Not, as he explained, because he thought the Sophias sounded better, but to prove that meaningful comparisons could be made between systems assembled according to different priorities. This was a demo aimed at his hi-fi dealer clientele, after all (it's a trade show, remember?), but there's a kicker: after we all confirmed that we could hear meaningful differences, Wilson whipped a fake component shell off the digital source and revealed that with the Wilson speakers we weren't listening to the $20,000 CD player that had been used for the competitor's speakers, but an Apple iPod playing uncompressed WAV files![/blockquote]
http://www.stereophile.com/news/011004ces/ , found via http://www.head-fi.org/t/486598/testing-au...laims-and-myths .

Overcoming the Perception Problem

Reply #107
Indeed, DBT procedures are introduced for precisely that reason – it is well documented that placebo is too significant to be ignored.


I am no scientist nor technician (nor a native English speaker either as you may guess, I am sorry) but I really don't understand how the ABX method is expected to ban “placebo” as a variable of the testing procedure. If I understand the way the tests are run it is not “blind” at all. People know what are A's and what are B's and they are asked to match them to X's. If your inner belief is (possible placebo effect) that there is an “audible” difference between A's and B's and the difference is not “audible” you will fail to get a result beyond mere random guessing. In this case we can affirm placebo effect is avoided (but at the cost of not being able to tell odd from a maybe statisical relevant result). But what if your inner belief is that there is no difference at all? I guess in this case placebo would affect the result of the test. My point is that this method does not seem “scientific” at all to me, it is completely “asymmetric” as shown by the fact that in the best hypothetical case can verify something only if it's obvious under test condition and is unable to falsify any hypothesis from the less than obvious to the absolutely impossible.

Overcoming the Perception Problem

Reply #108
Indeed, DBT procedures are introduced for precisely that reason – it is well documented that placebo is too significant to be ignored.


I am no scientist nor technician (nor a native English speaker either as you may guess, I am sorry) but I really don't understand how the ABX method is expected to ban “placebo” as a variable of the testing procedure. If I understand the way the tests are run it is not “blind” at all.


You certainly have a few points. Brief and itemized:
- The three-letter acronym you quoted was “DBT”, not “ABX” 
- ABXing can be done sighted or not. (“can” means “can”, not “should”)
- Yes, those prejudiced to believe that there is no difference, will be inclined to report no difference. But
(i) As we cannot, strictly speaking, prove negatives any more than we can disprove a “before this universe was born, I was incarnated as a Russell's teapot” (yes this is totally unsymmetric, that is well known ... and widely accepted as appropriate) – the conclusion from a negative is simply “Do not reject the null hypothesis”. As opposed to “Null hypothesis proven”, which is not valid.
(ii) We can remedy by introducing a comparison with known difference. (E.g., if test subject cannot differentiate between first-release and a brickwalled remaster, then you should not be surprised if they cannot identify a 96kb/s lossy of one of them.)

Overcoming the Perception Problem

Reply #109
There is little point to any listening test, blind or sighted, where the listener is pre-disposed to ignore any audible differences.

The onus is on those who say/believe that they hear an audible difference to demonstrate it in a double-blind test. Those who say there is no audible difference have no reason to take the test.

(Though, occasionally, people who don't think they hear an audible difference will take the test for the heck of it, or to satisfy their own curiosity, and in doing so sometimes prove to themselves and others that there is a barely audible difference).

Cheers,
David.

Overcoming the Perception Problem

Reply #110
My main interest in ABX software is to determine if a treatment I am considering is worthwhile or not. This is mainly in doing "restoration" of old recordings. There is generally, if not always, more than one way to deal with a problem.

As an easy example, consider declicking a recording made from an old LP, although declicking is not where I tend to do such comparison testing. At one extreme, one can proceed manually, click by click, trying to get the best result each time. At the other extreme one can run an number of batch steps on the entire recording, without regard to the particular treatment of any individual click. Batch declicking ALWAYS modifies many transients that are not clicks. This can easily be verified objectively from the data, without regard to what it sounds like.

The point of my making the comparison is that it could mean the difference between twenty-five hours of careful, repetitive stress damage producing work and a half hour of automated computer processing. The preliminary tests, to decide what to do, involves selecting a couple of short passages that seem likely to show differences, then doing them in two or more ways, then testing to find out if I can tell any difference in the finished products. As with testing lossy compression against an original, it is usually easy to show a physical difference. The question is whether or not that difference is audible.

Sometimes I am gratified to find that the fast easy way is just as good as the long hard way, although I always wonder about the possibility that other people might hear something I don't. There have been many times where I cannot identify any difference in the treatments. As far as I can tell, I make choices during the test because I have to say either X or Y in order to go forward. My random scores say the different approaches make for identical results, audibly speaking. This is the result I often want; it lets the work proceed with less effort and time.

However, on a number of occasions, although I can not identify any difference, I have come up with perfect matches every guess. This could happen randomly, even though the probability is very small. But is randomness a reasonable explanation for it to occur every once in a while -- for one individual?

Overcoming the Perception Problem

Reply #111
However, on a number of occasions, although I can not identify any difference, I have come up with perfect matches every guess. This could happen randomly, even though the probability is very small. But is randomness a reasonable explanation for it to occur every once in a while -- for one individual?
About 1-in-20 for p=0.05

Overcoming the Perception Problem

Reply #112
You certainly have a few points. Brief and itemized:
- The three-letter acronym you quoted was “DBT”, not “ABX” 
- ABXing can be done sighted or not. (“can” means “can”, not “should”)
- Yes, those prejudiced to believe that there is no difference, will be inclined to report no difference. But
(i) As we cannot, strictly speaking, prove negatives any more than we can disprove a “before this universe was born, I was incarnated as a Russell's teapot” (yes this is totally unsymmetric, that is well known ... and widely accepted as appropriate) – the conclusion from a negative is simply “Do not reject the null hypothesis”. As opposed to “Null hypothesis proven”, which is not valid.


Yes, I’m sorry, I omitted a step. I was talking about the way double blind procedures are said to be applied to ABX and other analogous methods of conducting listening tests. The Russell paradox has nothing to do in my opinion with the effectiveness of the way you plan to get rid in your test result of influences caused by subjective reactions not correlated with the stimulus you are testing. In the case of listening tests, in my opinion, the usual way has logical flaws.
DBT procedure imply that the examiner and the examinees are both not aware of every relevant information that might influence (consciously or unconsciously) the result of the test. DBT procedures are being developed to run test where “the mind” might influence results but you can also verify results by direct observation of the examinee. This is impossible for perceptual listening tests and their results may be also influenced (being perceptual) by almost everything. What I mean is that the level of “blindness” must be set to maximum and you have to cross check your results and your test procedures and if you use a statistical approach you need to be very rigorous.

Quote
(ii) We can remedy by introducing a comparison with known difference. (E.g., if test subject cannot differentiate between first-release and a brickwalled remaster, then you should not be surprised if they cannot identify a 96kb/s lossy of one of them.)


Yes, that is a way I was thinking about. But I have the feeling that even in highly regarded scientific circles (as AES) this is not the approach (but I may be proven wrong) and everybody prefer to run a raw ABX test without to much worries. I think that the attitude towards the whole thing is the one expressed, e.g., by 2Bdecided's last post and in my opinion is not an useful one.


Overcoming the Perception Problem

Reply #113
Indeed, DBT procedures are introduced for precisely that reason – it is well documented that placebo is too significant to be ignored.


I am no scientist nor technician (nor a native English speaker either as you may guess, I am sorry) but I really don't understand how the ABX method is expected to ban “placebo” as a variable of the testing procedure. If I understand the way the tests are run it is not “blind” at all. People know what are A's and what are B's and they are asked to match them to X's.


The the test is sighted for As and Bs, but blind for Xs.  Only Xs are scored.

Quote
If your inner belief is (possible placebo effect) that there is an “audible” difference between A's and B's and the difference is not “audible” you will fail to get a result beyond mere odd.


odd? I hope you mean random.

Random scores for identifying the X's  is consistent with the idea that ABX is a DBT.

Quote
In this case we can affirm placebo effect is avoided (but at the cost of not being able to tell odd from a maybe statisical relevant result). But what if your inner belief is that there is no difference at all?


Often, the sighted part of the test is sufficient to dispel that notion.

Quote
I guess in this case placebo would affect the result of the test.


You can check that out by not telling the listeners what A & B actually are or even simply lying to them. Simple enough and its been done many times. Doesn't seem to improve the results.

Quote
My point is that this method does not seem “scientific” at all to me, it is completely “asymmetric” as shown by the fact that in the best hypothetical case can verify something only if it's obvious under test condition and is unable to falsify any hypothesis from the less than obvious to the absolutely impossible.


It appears to me that you don't understand the test. I sense a language problem.

You're staking your argument on the word obvious which seems to be used in a very vague way.

For example, aren't all reliably audible differences in some sense obvious?

The ABX critics will say that ABX works well enough for differences that are very, very  obvious, but not at all for differences that are in their judgement merley obvious.

Your obvious is my subtle or vice-versa! ;-)

Overcoming the Perception Problem

Reply #114
There is little point to any listening test, blind or sighted, where the listener is pre-disposed to ignore any audible differences.


Right, and the listener training scheme I point out in an earlier post filters those people out.

Quote
The onus is on those who say/believe that they hear an audible difference to demonstrate it in a double-blind test. Those who say there is no audible difference have no reason to take the test.


Agreed.

Quote
(Though, occasionally, people who don't think they hear an audible difference will take the test for the heck of it, or to satisfy their own curiosity, and in doing so sometimes prove to themselves and others that there is a barely audible difference).


Agreed.

The sighted aspects of ABX can help that happen more often.

Overcoming the Perception Problem

Reply #115
odd? I hope you mean random.

Yes, I am sorry, please change the word "odd" in my post with "random guessing". Please do not think mine is an attack to the ABX method in itself, I just have doubts aroused by the way the tests are in many cases actually done and maybe by the fact I do not understand the procedure (and the language). So before I reply to this post saying something stupid, please give me a link to this:

Quote
Right, and the listener training scheme I point out in an earlier post filters those people out.

Overcoming the Perception Problem

Reply #116
odd? I hope you mean random.

Yes, I am sorry, please change the word "odd" in my post with "random guessing". Please do not think mine is an attack to the ABX method in itself, I just have doubts aroused by the way the tests are in many cases actually done and maybe by the fact I do not understand the procedure (and the language). So before I reply to this post saying something stupid, please give me a link to this:

Quote
Right, and the listener training scheme I point out in an earlier post filters those people out.




http://www.hydrogenaudio.org/forums/index....st&p=811818

Overcoming the Perception Problem

Reply #117
However, on a number of occasions, although I can not identify any difference, I have come up with perfect matches every guess. This could happen randomly, even though the probability is very small. But is randomness a reasonable explanation for it to occur every once in a while -- for one individual?
About 1-in-20 for p=0.05


The statistics I took was interesting, and kind of fun, but it was a very long time ago. I've had no use for trying to remember any of it since then except for very simple circumstances. With my usual ten trials for a sample, that comes out to p=0.001 that all 10 of my correct guesses were the result of chance, or likely to happen 1 time in 1000, no? If it happened with two consecutive ten trial tests, each on a different album, it would be the product, 0.001X0.001 or  1 in 1,000,000?

I'm not sure what the calculation would be if, instead of consecutive tests, the two peculiar outcomes of ten trails tests were separated by ten or twelve different albums for which the tests were more easily comprehended. I.E. instead of ten trials of correct guesses where I am unable to tell how I make the correct choice, each of the intervening album tests seemed reasonably clear: I can't tell any difference and the results are about 50/50 or the results are, more or less, ten perfect guesses, but I can tell why.

Overcoming the Perception Problem

Reply #118
With my usual ten trials for a sample, that comes out to p=0.001 that all 10 of my correct guesses were the result of chance, or likely to happen 1 time in 1000, no? If it happened with two consecutive ten trial tests, each on a different album, it would be the product, 0.001X0.001 or  1 in 1,000,000?


The coin has 50/50 chance of getting one right. Under the null (i.e., if you were guessing like a fair coin), the probability of getting N out of N right, is 1:2^N. With N=10, the coin has got the chance of 1/1024. For N=20, the coin has got the chance of  1/(1024*1024), i.e. slightly less than one in a million.

If you get 9 of 10 right: There are 1024 different outcomes; one with 10/10 right and ten with precisely 9/10.  Probability of getting 9 or more by coinflipping, is 11/1024, or just above 1 percent. (That is: if you are targetting a p=.01 threshold, decide to do 11 rather than 10. Also, 8/10 barely misses the .05 threshold.)


I'm not sure what the calculation would be if, instead of consecutive tests, the two peculiar outcomes of ten trails tests were separated by ten or twelve different albums for which the tests were more easily comprehended.


By “for which the tests were more easily comprehended”, do you mean “for which the differences are easier to catch” or ... ?

If you are testing tenfourteen albums, score 50/50 on thirteen of them and 9/10 on the fourteenth, that just misses the .05 threshold. Is your question then whether the fourteenth is different from the others?

Overcoming the Perception Problem

Reply #119
To try to state if more clearly.

First, I am talking about samples treated two or more different ways, so there are data differences,  but I don't know if there will be audible differences. I run ABX tests to find out what I can or can't hear.

I do a ten trial test on a sample. I can not (consciously) hear any difference, therefore I have no idea whether X is A or B. I make a choice based on how I feel about it at the moment, which is indeed rather vague and may or may not be identical to flipping a coin and using the result. Anyway, I get all ten correct.

There is some probability that I did this simply through random chance. Now I chose another sample. Lets make it from a different album (also recorded from an LP) to make it less likely that I am somehow biasing things with the album I just used. Also lets choose some different treatment to use on this sample, to make sure there isn't a lurking bias in how I prepared the two copies of the sample.

Now I run an ABX test with ten trials on this new material. Again, I can't hear any difference but I make an attempt to guess something. Again I get all ten correct, I know not how.

If looked at in isolation, the probability for this second test's results are the same as the probability for the first test's results. However, if both are random results, together the probability is much less. It is at least as small as getting 20 random correct guesses on one test of twenty trials, no? If I had run 200 independent tests to get those two peculiar ones, they might seem less peculiar, or perhaps to say it might seem less likely that there is some unconscious but non-random factor operating.

The question presented in the last post has to do with such 10/10 tests happening every once in a while. How is the probability of N such results computed if there are five, six, or twelve (but not hundreds) of non-weird tests conducted between those getting this peculiar result? Is there any difference, considering all the tests done,  if all the weird ones occur consecutively compared to if they occur at random intervals among more probable results?

The more probable results have been defined, in the last post as (1) I get good scores because I can recognize the differences and (2) I get random scores because I can't recognize the difference. The "weird" tests are those where I can't identify a difference but get all correct.

People often do things without conscious intention and without conscious awareness. Parents respond in a quite unreasonable way to something their child does and are totally unaware that they are repeating the irrational emotional behavior of their own parents, who probably learned it from their parents, etc.

Emotions effect perception. Expectation and belief effect perception. The main concern here at HA is the tendency of these factors to produce perceptions that do not conform to sensory input. The ABX test is supposed to block that.

Too many of the "weird" results described above might be evidence of the opposite effect: the differences are not below the threshold of detection but for some reason are below the threshold of consciousness.

That may be a rather outre hypothesis but that is the reason for my questions. Having experienced it myself from time to time (with no way to know if it is or is not really just random), I've often wondered about this when people report high identification scores without really saying whether or not they knew what they were hearing -- or maybe I just haven't paid close enough attention to what has been written.


Overcoming the Perception Problem

Reply #120
To try to state if more clearly.

First, I am talking about samples treated two or more different ways, so there are data differences,  but I don't know if there will be audible differences. I run ABX tests to find out what I can or can't hear.

I do a ten trial test on a sample. I can not (consciously) hear any difference, therefore I have no idea whether X is A or B. I make a choice based on how I feel about it at the moment, which is indeed rather vague and may or may not be identical to flipping a coin and using the result. Anyway, I get all ten correct.

There is some probability that I did this simply through random chance. Now I chose another sample. Lets make it from a different album (also recorded from an LP) to make it less likely that I am somehow biasing things with the album I just used. Also lets choose some different treatment to use on this sample, to make sure there isn't a lurking bias in how I prepared the two copies of the sample.

Now I run an ABX test with ten trials on this new material. Again, I can't hear any difference but I make an attempt to guess something. Again I get all ten correct, I know not how.


AFAIK, you are not doing a test with a guaranteed null outcome, like say a comparison of interconnects.

The results you've described could be attributed to listener learning.

The big question is what happens when you run the third and fourth sets of 10?

Overcoming the Perception Problem

Reply #121
If looked at in isolation, the probability for this second test's results are the same as the probability for the first test's results. However, if both are random results, together the probability is much less. It is at least as small as getting 20 random correct guesses on one test of twenty trials, no?


Let me address this first – I know you ask more interesting questions below. You have a null hypothesis that you are equivalent to a fair coin. The p-value is calculated under this hypothesis. But if the stopping time is stochastically dependent on the history, then you can no more think in these 2^n terms. If you have prespecified "do 10 and then do 10 on another sample pair", then the probability that the coin would get everything right, is 1:2^20, or one in a million.

But – and I know this is out of line with your hypothetical results – if you fail the first test, then run a second, and then – because that turned out with a low “standalone” p-value – you stop, then what? That's a different story. To calculate the coin's probability, you need to specify the stopping rule precisely, and the easy way of doing that, is to require the user to prescribe a number of experiments: do this, then carry the result to your local statistician.




The question presented in the last post has to do with such 10/10 tests happening every once in a while. How is the probability of N such results computed if there are five, six, or twelve (but not hundreds) of non-weird tests conducted between those getting this peculiar result? Is there any difference, considering all the tests done,  if all the weird ones occur consecutively compared to if they occur at random intervals among more probable results?

The more probable results have been defined, in the last post as (1) I get good scores because I can recognize the differences and (2) I get random scores because I can't recognize the difference. The "weird" tests are those where I can't identify a difference but get all correct.



I'll present a setup, and it may or may not be what you have in mind:

Suppose that you have many distinct pairs: signal1A+signal1B, signal2A+signal2B, etc., up to signalNA+signalNB. Suppose that you ABX each pair “10 times” -- call that a “pair” in the following.  Calculate p-values, pair by pair.  Then one question is: how likely is it that one of these pairs would have a certain “low” p-value?  I.e., you want the distribution – under the null – for the smallest number of p1,p2,...,pN. You then want the appropriate quantile for this distribution as a benchmark.  Again it is crucial to specify the test for each pair – and the total number of pairs – in advance.  (Otherwise, you would get an N-dependent m(N) which the ignorant or dishonest user could stop first time it looks favourable.)

More generally, you might for each single pair #n form an alternative hypothesis Hn: “Detectable difference between the A and B of pair #n.” Then ask the questions:
- what is the number of “false positive” pairs reported if you calculate each p-value and cherry-pick?
- what should the threshold be to reject the “Every Hn false” null?
- what is the expected number of true Hn's, given the data?


There is a theory for testing multiple hypotheses simultaneously.  One not uncommon way is the Bonferroni correction, see http://en.wikipedia.org/wiki/Bonferroni_correction and the references therein.  Another approach is Schweder / Spjøtvoll (1982) (free version here).




People often do things without conscious intention and without conscious awareness.


Yes.  They pay notice when something “incredible” happens. Surely this cannot be a coincidence? Yes it can, if the number of trials is big, and did you count how many “attempts” Chance made at getting your attention?

If one can test again, then fine: do that. Use random/arbitrary discoveries as a basis for forming hypoehteses, and then set up a new single test.  (Does not work in the science of history ... unless you are so unfortunate that (i) people don't learn, and (ii) you have a hotline to a dictator who thinks your experiment idea sounds funny.)

Overcoming the Perception Problem

Reply #122
So it really isn't possible to say much, statistically, about some casual observations of what passes one by. One has to adopt some particular statistical model and collect data in accordance with the model's requirements. I suspect, in regard to this question, the most one could say after collecting enough data is that the number of 10/10 scores, where I believe I hear no difference between A and B, is either within normal variations or is unusually common, depending upon how the number add up. If one wanted to actually get a handle on the idea of whether or not (some) people can hear, and respond to, small differences, without being aware that they can hear them, one would have to come up with some more clever experiments.

Most of the time when I am convinced that I do not hear a difference, I give up part way through. Perhaps that indicates some kind of psychological bias against finding that there is (seems to be) a difference in the treatments. I think it just indicates boredom. I don't think I ever even bother to check the score for the trials I did complete. Those that I mentioned, 10 correct out of 10 when I have no idea which is which, are exceptions where I had some particular reason, or whim, to produce a score.

Overcoming the Perception Problem

Reply #123
Most of the time when I am convinced that I do not hear a difference, I give up part way through. Perhaps that indicates some kind of psychological bias against finding that there is (seems to be) a difference in the treatments. I think it just indicates boredom. I don't think I ever even bother to check the score for the trials I did complete. Those that I mentioned, 10 correct out of 10 when I have no idea which is which, are exceptions where I had some particular reason, or whim, to produce a score.


I see a different issue here. When we make a change we are hoping for a difference that we don't have to resort to high effort testing to hear.

I've done ABX tests that were positive for audible differences without ever actually hearing what I thought was really a difference. There was a technical difference that on a good day may have been large enough to hear, but it was on the borderline.  I walked away not exactly a fan of working to obtain that difference. ;-)

Overcoming the Perception Problem

Reply #124
So it really isn't possible to say much, statistically, about some casual observations of what passes one by. One has to adopt some particular statistical model and collect data in accordance with the model's requirements.


Well ... even when you cannot get the one and only true p-value out of a dataset without making explicit or implicit assumptions, the working practitioner need not be completely lost, of course.  In sciences where you cannot redo experiments, and have to pick up the data you get, they can still do statistical analyses, but they will be more vulnerable to the assumptions on the statistical model. For example, just because you cannot run the dinosaur age over, that doesn't mean that statistics couldn't be a useful tool in paleontology. And it is hardly controversial to claim that the Wall Street crash 15 years and a week ago, is sufficient to reject a hypothesis of random walk Gaussian logreturns. Among financial analysts, there was a “this month's million-year-event” tongue-in-cheek expression; on one hand, if you dig through your data looking for something that looks weird, you will find it, but on the other, using “worst day” as a test statistic isn't that far from what you could have chosen ex ante. You may of course argue that if you just pick up the ex post extreme events, you should have taken into account e.g. the merchantile exchange as well (if nothing happened there, there would be – grossly oversimplified – twice as many “normal” days, right?), but that does only contribute a minor tweak to an insane p-value.

There are a couple of types of inferences we must avoid. One is “a man produces so many sperms that the p-value that you have precisely that genetical combination, is one in a hundred million (even given that we know which ejaculation you were conceived from)” [no pun intended for the p here ...]. Fallacy:  shouldn't there have been a winner in the raffle/tombola? Another – suppose you have a lottery based on betting on a random draw of one number from 1 to (large) N. It isn't given that anyone will have betted on the correct one, but an ex ante probability (under the null) of someone winning, requires some (statistical) knowledge of the bets. One bet? A billion bets? It matters.