## Discussion of NHK ultrasonic test

#####

Link to the paper.

Disclaimer: I'm not a statistician. I'm an engineer that happened to use probabilities and statistics, probably a little more than other engineers.

First, let's understand their main experiment design. From the paper: "The tests were conducted on 36 subjects who evaluated each sound stimulus once or twice. Thus, each stimulus was evaluated 40 times in total." So, there are 720 cells (36x20), and 800 trials were conducted (40x20). The average cell count is ~ 1.111, in other words each stimulus was tested by each subject 1.111 times on average. This implies that a subject was tested on average on 22.222 stimuli (not all of them distinct of course). There is no word how the "extra" 80 trials were conducted, but let's assume they were distributed randomly amongst the cells.

The only reason I see why someone would come up with a design like the above, is to test some omnibus hypothesis. But there is none tested in the paper. Instead there are doing row and column tests. The problem is that this (randomized) design seems very awkward for such purpose. Due to the 1.111 average cell count, for any row or column, you get a repeated measures test, but with a large number of empty cells for the 2nd measurement.

On a repeated measures test, if you choose to ignore the repeated part, you get less significant results. For an argument, look at repeated measures t-test or ANOVA in a statistics book. There is no word in the paper that they considered the row and column tests as repeated measures. Given the low ratio of the repeated measures (11%), they probably did not.

The other aspect that strikes me is the lack of any attempts at significance level adjustment for this many tests. The probability of “scoring” at least once in the 5% alpha region in one of 36 independent Bernoulli trials is ~0.842 (geometric distribution).

The tough question is: are these tests independent? The answer is no, not in the sense that the dependent variables (outcomes of the trials) depend on the stimulus used and the subject being tested. So adjusting the alpha value is not reasonable without estimating the correlations.

Estimating power of an omnibus test in the NHK design goes well beyond my statistical knowledge. You'd have to assume some sort of GLM model, possibly semi-parametric. Then you'd have guesstimate the covariance matrix before any power estimates can be done. So the power approximation would be difficult to get, and the accuracy would depend on the guesstimates.

What follows is an attempt to calculate power for one sound stimulus test, namely stimulus number 1. Since we don't know how many subjects were actually tested twice on that stimulus, I'll do the computation the extreme case where each subject was tested once. This is actually impossible, so it's more than the best case.

The observed success frequency stimulus 1 was 0.625 i.e. 25/40 (if I got that number right from their graph). We can use this as estimator for true p, but it's better to vary p and have an idea how it influences power. PASS gives:

Numeric Results when H0: P = P0 versus Ha: P = P1 > P0.

Target Actual Reject H0

Power N P0 P1 Alpha Alpha Beta If R>=This

0.04614 40 0.5000 0.5050 0.05000 0.04035 0.95386 26

0.13260 40 0.5000 0.5500 0.05000 0.04035 0.86740 26

0.31743 40 0.5000 0.6000 0.05000 0.04035 0.68257 26

0.44057 40 0.5000 0.6250 0.05000 0.04035 0.55943 26

0.57208 40 0.5000 0.6500 0.05000 0.04035 0.42792 26

0.80745 40 0.5000 0.7000 0.05000 0.04035 0.19255 26

0.99208 40 0.5000 0.8000 0.05000 0.04035 0.00792 26

Report Definitions

Power is the probability of rejecting a false null hypothesis. It should be close to one.

N is the size of the sample drawn from the population. To conserve resources, it should be small.

Alpha is the probability of rejecting a true null hypothesis. It should be small.

Beta is the probability of accepting a false null hypothesis. It should be small.

P0 is the value of the population proportion under the null hypothesis.

P1 is the value of the population proportion under the alternative hypothesis.

So power is 0.8 for p=0.7, and only 0.44 for the observed frequency of 0.625. Stimulus 1 had the highest observed frequency, so this is even lower for the rest. We used Bernoulli trials, so we assumed that there is one “true” probability that each stimulus trial be successful, regardless of the ability of the subject attempting the trial. So this is just an approximation.

EDIT: removed the Chernoff bound since it was solving the wrong problem.