Skip to main content
Topic: Discussion of NHK ultrasonic test (Read 5077 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Discussion of NHK ultrasonic test

Link to the paper.

Disclaimer: I'm not a statistician. I'm an engineer that happened to use probabilities and statistics, probably a little more than other engineers.

First, let's understand their main experiment design. From the paper: "The tests were conducted on 36 subjects who evaluated each sound stimulus once or twice. Thus, each stimulus was evaluated 40 times in total." So, there are 720 cells (36x20), and 800 trials were conducted (40x20). The average cell count is ~ 1.111, in other words each stimulus was tested by each subject 1.111 times on average. This implies that a subject was tested on average on 22.222 stimuli (not all of them distinct of course). There is no word how the "extra" 80 trials were conducted, but let's assume they were distributed randomly amongst the cells.

The only reason I see why someone would come up with a design like the above, is to test some omnibus hypothesis. But there is none tested in the paper. Instead there are doing row and column tests. The problem is that this (randomized) design seems very awkward for such purpose. Due to the 1.111 average cell count, for any row or column, you get a repeated measures test, but with a large number of empty cells for the 2nd measurement.

On a repeated measures test, if you choose to ignore the repeated part, you get less significant results. For an argument, look at repeated measures t-test or ANOVA in a statistics book. There is no word in the paper that they considered the row and column tests as repeated measures. Given the low ratio of the repeated measures (11%), they probably did not.

The other aspect that strikes me is the lack of any attempts at significance level adjustment for this many tests. The probability of “scoring” at least once in the 5% alpha region in one of 36 independent Bernoulli trials is ~0.842 (geometric distribution).

The tough question is: are these tests independent? The answer is no, not in the sense that the dependent variables (outcomes of the trials) depend on the stimulus used and the subject being tested. So adjusting the alpha value is not reasonable without estimating the correlations.

Estimating power of an omnibus test in the NHK design goes well beyond my statistical knowledge. You'd have to assume some sort of GLM model, possibly semi-parametric. Then you'd have guesstimate the covariance matrix before any power estimates can be done. So the power approximation would be difficult to get, and the accuracy would depend on the guesstimates.

What follows is an attempt to calculate power for one sound stimulus test, namely stimulus number 1. Since we don't know how many subjects were actually tested twice on that stimulus, I'll do the computation the extreme case where each subject was tested once. This is actually impossible, so it's more than the best case.

The observed success frequency stimulus 1 was 0.625 i.e. 25/40 (if I got that number right from their graph). We can use this as estimator for true p, but it's better to vary p and have an idea how it influences power. PASS gives:

Code: [Select]
Numeric Results when H0: P = P0 versus Ha: P = P1 > P0.
                                     Target     Actual                Reject H0
Power       N    P0        P1        Alpha      Alpha      Beta       If R>=This
0.04614    40    0.5000    0.5050    0.05000    0.04035    0.95386    26
0.13260    40    0.5000    0.5500    0.05000    0.04035    0.86740    26
0.31743    40    0.5000    0.6000    0.05000    0.04035    0.68257    26
0.44057    40    0.5000    0.6250    0.05000    0.04035    0.55943    26
0.57208    40    0.5000    0.6500    0.05000    0.04035    0.42792    26
0.80745    40    0.5000    0.7000    0.05000    0.04035    0.19255    26
0.99208    40    0.5000    0.8000    0.05000    0.04035    0.00792    26

Report Definitions
Power is the probability of rejecting a false null hypothesis. It should be close to one.
N is the size of the sample drawn from the population. To conserve resources, it should be small.
Alpha is the probability of rejecting a true null hypothesis. It should be small.
Beta is the probability of accepting a false null hypothesis. It should be small.
P0 is the value of the population proportion under the null hypothesis.
P1 is the value of the population proportion under the alternative hypothesis.


So power is 0.8 for p=0.7, and only 0.44 for the observed frequency of 0.625. Stimulus 1 had the highest observed frequency, so this is even lower for the rest. We used Bernoulli trials, so we assumed that there is one “true” probability that each stimulus trial be successful, regardless of the ability of the subject attempting the trial. So this is just an approximation.

EDIT: removed the Chernoff bound since it was solving the wrong problem.
The earth is round (P < 0.05).  -- Cohen J., 1994

Discussion of NHK ultrasonic test

Reply #1
Does this mean that the ultrafrequency stimulus did not affect most of the subjects?
Can you explain the conclusion of your analysis.

Discussion of NHK ultrasonic test

Reply #2
My understanding of what gaboo is saying is that:

- The type II error of the test results are high, which (very roughly) means that there is a high chance of falsely claiming no detectable difference
- The relatively haphazard statistical methods reduce the significance of the results considerably

IOW, NHK says more or less nothing one way or another. Did I interpret you correctly gaboo?

Discussion of NHK ultrasonic test

Reply #3
Quote
My understanding of what gaboo is saying is that:

- The type II error of the test results are high, which (very roughly) means that there is a high chance of falsely claiming no detectable difference
- The relatively haphazard statistical methods reduce the significance of the results considerably

IOW, NHK says more or less nothing one way or another. Did I interpret you correctly gaboo?
[a href="index.php?act=findpost&pid=254224"][{POST_SNAPBACK}][/a]


Yes. The power I've guesstimated (big time) seems too low for subtle effects, especially if we consider the observed ABX success frequencies indicators of the true probability of detecting a difference. The guesstimated power does look good for busting any night-and-day kinda' claims.

I've been looking at a way to get a stronger result in that direction from the published data, something like "the probability that any new individual testing those 20 stimuli [with that equipment] may guess with an average frequency above 0.9 (MFV) is less than 0.05 (p-value)". The H0 for this would be: mf1 >= MFV or mf2>= MFV or ... or mfi >= MFV with i=1:20, with the alternate hypothesis H1: for all i=1:20, mfi < MFV. I think I can do this with an omnibus test: a multivariate GLM using a Dunnett contrast, thus avoiding any Bonferonni p-value adjustments. To be able to make the claim about any other subject, I'd have to consider the subjects as a random effect; this only affects how the sums of squares are calculated. Ignoring the partial multiple measurements issue, I'd have to fabricate a random 36x20 matrix with elements in {0,1} with the row and column averages (marginal means) equal to the per subject & per stimulus observed frequencies from the paper(*). This would take some work. But this is not the main problem.

I have a methodology problem with this approach. I'd have to put a constant vector MF = (MFj = MVF), j=1:20 as part of the observed values, i.e. it would be the 37th factor level, and the Dunnett contrast would compare all means against the mean of MF which is exactly MFV. The issue is exactly that MFV is constant, not from a random variable. None of the references I looked at says that this plain wrong, but I didn't find any examples doing this either. I suspect that I have to do something special, e.g. adjust the degrees of freedom. I'm not very willing to do the work and post the result, unless I'm sure I was doing it right. If anyone reading this has a clue on this issue, please let me know.

_______
(*) To be even closer to the NHK paper, I'd need to make the cells have a value in {0, 0.5, 1, 2} and a count in {1, 2}. The means would become weighted means, and I’d have to impose the additional constraint that all cell counts must sum up to 800. Given the low repeated measures (~11%) in their experiment, dealing with weighted sums is probably not worth the hassle.
The earth is round (P < 0.05).  -- Cohen J., 1994

Discussion of NHK ultrasonic test

Reply #4
I had the wrong approach for the problem. But so did the authors. This is proof that engineers need to take a (graduate level) course in multivariate analysis before being allowed to play social scientists.

Here is how to model the problem as multivariate variable. The individual responses to the d=20 stimuli are samples from d-dimensional distribution. We cannot allow partial responses, i.e. each individual must be tested on all 20 samples an equal number of times. So we can only have 36 samples with the NHK data. It is possible to use this model with repeated measures, but those have to be complete across the 20 samples.

In order to simplify the discussion, I'll assume the data has MVN'ity i.e. the samples come from a multivariable normal distribution with (d-dimensional) mean vector m, and a dxd variance-covariance matrix s.

In the multivariate case, the confidence interval becomes a confidence region (CR). For MVN, this is a d-dimensional ellipsoid with axes parallel to the eigenvectors of s, and with elongations along these axes proportional with the square roots of eigenvalues of s. The volume of the ellipsoid depends on the confidence level alpha.

The equivalent of t-test in d-dimensions is Hotelling’s T2 (T-square) test. It simply tests if a vector m0 falls within the CR ellipsoid. To do the basic ABX test "is this by chance", we have to test if the 20-dimensional ellipsoid includes the d-dimensional point with all coordinates equal to 0.5. This is only one test, so does not require any Bonferonni adjustments.

In order to calculate confidence intervals for each component of the mean vector, we have to enclose the ellipsoid in the tightest fitting d-dimensional bounding box that is parallel with d axis. This bounding box will enclose a volume of space that has a lot of "false negatives", i.e. it has less power than the ellipsoid. The nice thing about it, is that it's dimensions yield the confidence intervals which our brains can comprehend. These intervals are called T2 intervals. They also allow us to see if the CR is included in a certain region of the d-dimensional space, for instance to verify if all components of the mean are less than a given value. Such verifications do not require Bonferonni adjustments, because the confidence level for the entire T2 box has been set.

The bad news is that T2 intervals are very conservative, for the two-dimensional case they are more conservative than Bonferonni-adjusted intervals. I don't know how the relationship evolves with the increase in dimensions.

There are non-parametric equivalents of Hotelling's T2.

I have to go to work now, to be continued.
The earth is round (P < 0.05).  -- Cohen J., 1994

Discussion of NHK ultrasonic test

Reply #5
Well, that's just great, looks like my evil designs for running my own complicated listening tests are shot.  Could you explain in more detail why you decided to go with such a multivariate analysis, besides the fact that it does appear to be a bit cleaner (which I do agree with?) And what do I need to learn in order to be able to follow you?

Discussion of NHK ultrasonic test

Reply #6
Quote
Well, that's just great, looks like my evil designs for running my own complicated listening tests are shot.  Could you explain in more detail why you decided to go with such a multivariate analysis, besides the fact that it does appear to be a bit cleaner (which I do agree with?) And what do I need to learn in order to be able to follow you?
[a href="index.php?act=findpost&pid=254557"][{POST_SNAPBACK}][/a]


I don't have time to post much now, but the above model is not as good as I thought. MVNity is violated big time if the cells are in {0,1}, and the T2 intervals are too large to be useful. If you do repeated measurements in each cell, in the 2D case the distribution is going to look like a bunch of 4 little mounds, instead of a single big one. So, we have to consider (yes) even more dimensions. One for each cell, so dxk dimensions. The extra step is to notice that in each cell we are estimating a probability. I think these were the reasons that led Rasch to formulate the problem in log odds. Rasch's logit model for counts is simply logit(outcome_ij) = bi - dj. Where bi is the ability of the subject and dj is the difficulty of the sample. This is a GLM with link function logit. This a ®MLE problem, so I have no idea right now if besides the cell estimates you can get any CRs. It may be possible to get CRs indirectly, via CRs for the parameters. I'll get a Rasch model book tomorrow.
The earth is round (P < 0.05).  -- Cohen J., 1994

Discussion of NHK ultrasonic test

Reply #7
Problem solved. Look at the graph here for n=800 and unequal probabilities.

For an explanation why this is correct look here.
The earth is round (P < 0.05).  -- Cohen J., 1994

Discussion of NHK ultrasonic test

Reply #8
So if I understand your conclusion right:

1) The fact that each listener is working off a different p-value is not a problem, as long as the sample size is large (as described on the graph)
2) No overly complicated multidimensional test is necessary - something like repeated measures t-test or ANOVA ought to be OK
3) The test still needs more samples per listener to get really good results

Discussion of NHK ultrasonic test

Reply #9
Quote
So if I understand your conclusion right:

1) The fact that each listener is working off a different p-value is not a problem, as long as the sample size is large (as described on the graph)


The probability can be (and I suspect is) different for each trial, that is for each listener/stimulus combination.

Quote
2) No overly complicated multidimensional test is necessary - something like repeated measures t-test or ANOVA ought to be OK


The power I calculated is for an omnibus binomial test of all 800 trials against p0=0.5. It does not account for replication at all. Accounting for replication would result in more power, but it’s not easily tractable for their design. If successful (that is H0:all pi=p0 i=1,800 is rejected), it follows that at least one listener/stimulus pair had a pi greater than 0.5. It won't tell you which pair (pi) though.

Keep in mind that no such test was performed on the NHK data, but looking at their marginal graphs I have reason to believe that it would actually fail to reject this omnibus null hypothesis. To actually perform the test one would have to reverse engineer the raw numbers from the graph in fig 4, i.e. map from 0-100 to 0-40, simply add them to x and simply perform a binomial test H0:x <= 400 for n=800. The hard part is getting the raw numbers accurately from that tiny graph!

Quote
3) The test still needs more samples per listener to get really good results
[a href="index.php?act=findpost&pid=257200"][{POST_SNAPBACK}][/a]


Depends what you mean by good results. The omnibus binomial is the equivalent of ANOVA if you like. If the omnibus is rejected you know that something is going on "not by chance", but you cannot tell which pi(s) is/are the reason. In order to identify those (with statistical significance) you do indeed need more trials in each "cell" (each subject/stimulus pair), or you need to make additional assumptions to apply a Rasch model.

An example: two "cell" results 13/20 and 14/20 cell would both fail to reject at alpha=0.05 but the omibus 27/40 rejects at alpha=0.05.
The earth is round (P < 0.05).  -- Cohen J., 1994

 
SimplePortal 1.0.0 RC1 © 2008-2019