Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: p-values: Sum up + proposal (Read 7659 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

p-values: Sum up + proposal

Hi.

I've been looking into statistics of ABX tests under different conditions. What we refer to as p-value only gives correct results for the "probability to reach a certain score (or better) by random guessing" if the number of trials is fixed before the test starts. In this thread there's more information.

Let's repeat some basics:

This table shows the classical p-values. Moving to the right means more total trials, moving down means more wrong trials.

[span style='font-size:7pt;line-height:100%']Picture (1)[/span]

The p-values are calculated using pascal's triangle. For every trial there are 2 possibilities - right + wrong (imagine throwing a coin). For 2 trials there are 4 possibilities (r-r, r-w, w-r, w-w), ..., for n trials there are 2^n possibilities, represented by the blue numbers in next picture. A correct trial ("r") is represented by the green arrow, a wrong one by the red arrow. These two arrows can be regarded as the only allowed directions of 'movement' through the triangle. The blue line is one possible way to reach a 4/6 score. The number 15 at the end of this line shows that there are 15 possible ways to reach 4/6 - out of 64 total 'ways" for 6 'movements'. So the probability to reach 4/6 is 15/64. The p-value for 4/6 score is calculated by adding this and the probabilities for all x/6 results with x>4, i.e. 5/6 and 6/6, so p-value (4/6) = (15+6+1)/64.


[span style='font-size:7pt;line-height:100%']Picture (2)[/span]

So far, so good. The explanation why this doesn't work as it should follows soon in a separate post.
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #1
These 3 pictures will help to explain the problem:



Edited: "probability that you're guessing" replaced with "probability that you could get that score by guessing"
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #2
One solution that several of us discussed in 2001 was to create ABX "profiles" designed to give a reasonable number of max trials (for example 28), and a reasonable number of places where the program automatically stops.

See my summary post from the massive thread here:

http://www.hydrogenaudio.org/forums/index....indpost&p=32170

Quote
1. The test will automatically stop if the following points are reached:

6 of 6
10 of 11
10 of 12
14 of 17
14 of 18
17 of 22
17 of 23
20 of 27
20 of 28

2. The program will display overall alpha values after each of the above stop points has been achieved. Also, the overall alpha values will be displayed regardless of whether the test stops or not at the following (look) points: trials 6, 12, 18, 23, and 28.

(The earlier the test is terminated when the listener passes, the lower the overall alpha is.)

3. The program will display the number correct after each trial is completed.

4. The test will automatically stop if 9 incorrect are achieved.


ff123

p-values: Sum up + proposal

Reply #3
Quote
The goal is: no matter how long the test is going to take, the c-value must not become higher than e.g. 0.05. Every stop point will 'consume' a part of this c-value. It's necessary to make sure that adding the probabilities of each stop point, the sum can never be bigger than the c-value we want to reach (here 0.05). A simple approach for something like this:

2^(-1) + 2^(-2) + 2^(-3) + ... + 2^(-n) < 1 , no matter how big n gets.

What will happen if the listener does, say 6 failed ABX trials, then (almost) all following trials are successful? Would it ever be possible to bring the c-value down again?

p-values: Sum up + proposal

Reply #4
Quote
Quote
The goal is: no matter how long the test is going to take, the c-value must not become higher than e.g. 0.05. Every stop point will 'consume' a part of this c-value. It's necessary to make sure that adding the probabilities of each stop point, the sum can never be bigger than the c-value we want to reach (here 0.05). A simple approach for something like this:

2^(-1) + 2^(-2) + 2^(-3) + ... + 2^(-n) < 1 , no matter how big n gets.

What will happen if the listener does, say 6 failed ABX trials, then (almost) all following trials are successful? Would it ever be possible to bring the c-value down again?

Sure. How low the c-value can become after a large number of trials depends on the 'stop points' only. E.g. if you want to reach a c-value < 0.01 and start with 6 wrong trials, it could look like this (this example is not calculated with 2^(-1) + ... method but the result is similar):

Maximum number of trials: 40
Stop points with p-value < 0.003:
9/9
12/13
14/16
16/19
18/22
20/25
21/27
23/30
25/33
26/35
28/38

In your case, if you reach
26/35 = 6/6 + 20/29 or
28/38 = 6/6 + 22/32
your final c-value is still < 0.01

With the "2^(-1) + ..." method, you can reach the c-value you want but the number of trials is not limited. For a final c-value < 0.01 the stop points would be:
8/8
11/12
13/15
...
(I have to calculate these values manually because I haven't had time yet to add this to my little program.)
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #5
Quote
Schnofler's thought is probably right. If a tester is allowed to watch c-values and stop the test based on them, we would need 'corrected c-values', 'corrected corrected c-values' ...

Would the corrected, corrected, corrected × 10 value approach a particular value?  Could this not be a asymptote?  Couldn't we use calculus to find this out instead of using simplistic hacks?  Lazy? 
gentoo ~amd64 + layman | ncmpcpp/mpd | wavpack + vorbis + lame

p-values: Sum up + proposal

Reply #6
Quote
Quote
Schnofler's thought is probably right. If a tester is allowed to watch c-values and stop the test based on them, we would need 'corrected c-values', 'corrected corrected c-values' ...

Would the corrected, corrected, corrected × 10 value approach a particular value?  Could this not be a asymptote?  Couldn't we use calculus to find this out instead of using simplistic hacks?  Lazy? 

The ABX "profile" sidesteps this issue by specifying maximum trials allowable.  If the ABX does not pass after this max, then it is automatically failed.

28 trials max was one profile design, chosen to allow a reasonable number of trials, but other profiles can be designed with higher max trials if desired.  Keep in mind that the higher the max trials in the profile, the more difficult that profile will be to pass.

ff123

p-values: Sum up + proposal

Reply #7
Ok, I guess I should say something on this subject, too. The problem is, the really clean solutions always make the whole testing procedure less comfortable or more complicated.

Not showing the listener his results until some point he specified in advance would make it extremely easy to calculate a precise "probability that you were guessing" (just the p-value we use now), but it would also be a major pain in the ass for the listener.

ff123s ABX "profiles" are a much better solution, but they would still make testing more complicated than it is now. Especially in ABC/HR tests I like the possibility to just start an ABX, try a few times, give up or try some more, stop whenever I want to, etc. First choosing a profile, not knowing your score until you reach the next stop point, having to stop if max trials is reached, all this would make the test a lot less comfortable for the listener.

tigre, I haven't really made up my mind about the approach you describe in the second half of your second post. I understand how you do what you want to do, but I didn't understand how this solves the problem. Could you try to clarify?

So, since my contribution to this discussion so far mainly consists of undecisiveness, I decided to make something "useful", a program that can calculate the corrected-corrected-corrected-etc.-c-value. You specify the number of total and correct trials and a "depth", that is the number of "corrections" (where a depth of 1 is the normal p-value). To answer music_man_mpc's question: Yes, of course the values approach a certain limit (they have to, the sequence is monotonic increasing and has 1 as an upper bound). It would be nice to have a closed form of the limit function, but I guess that won't be easy (in the current form the definition of the sequence is terribly recursive). However, empirically, it seems like after a certain number of correction-iterations the value actually remains constant, so it's possible to calculate the limit even if we don't have a nice function for it.
The limit function p(n,c) is characterized by the following property: p(n,c) is exactly the probability of reaching a point (n',c') with n'<=n and p(n',c')<=p(n,c). That's why the argument "but the listener could have stopped as soon as he got a value <=p and continued otherwise" doesn't hold here. Sure, he could have stopped, but the chances of reaching such a point with the same or a better c-value (meaning corrected, corrected, etc. p-value) than he has now, are exactly the same as the c-value that is shown at the moment.
That would kind of solve the problem, since we could freely show the listener his c-value all the time, and ABXing would be the same as before, only the p-values would be a bit higher than usual.
The obvious problem is, what the heck *are* these values? I don't have a clue. They are the result of some mysterious calculations, but do they have anything to do with the "probability that you were guessing"? Well, I don't know, maybe someone more knowledgeable can shine some light on this.

p-values: Sum up + proposal

Reply #8
Thanks for feedback so far.

To clarify/mention an aspect that hasn't been made totally clear so far:

The c-values / corrected c-values /... are all caculated the same way:
They use the stop points (i.e. the ABX scores where the test would have stopped) and the actual score that is reached. What differs, depending on different approaches (c-value, corrected^n c-value, "asymptote approach", ...) are the stop points.

The problem is, that without any information before the test starts, the ABX software has to make assumptions about the stop points. Example:

Let's say a score of 11/14 is reached in a ABX test. The tester can see the scores + p-values he has reached and decides based on them when to stop the test (basic c-values approach).
1st case: His stop condition is a p-value of <= 0.031. The stop points are:
6/6, 8/9, 10/12, 11/14, the c-value is 0.047
2nd case: stop condition = p-value <= 0.032. Stop points:
5/5, 8/9, 10/12, 11/14, the c-value is 0.059

If the listener doesn't specify a p-value that will stop the test, the results will vary depending on the software's assumptions about at what score the tester would have stopped. Because of this, IMO ABX software *must* ask for some information before the test starts to produce reliable p-/c-values.

My "asymptote approach" (2^-1 + 2^-2 + ...) is one way to get correct c-values with an unlimited number of trials (and an unlimited number of wrong trials  ). The tester must specify what c-value he wants to reach at the beginning.

Maybe there is a way to calculate corrected values without the tester giving information before the test starts, but I doubt this, since the software always has to make assumptions that might be wrong. Immagine a listener wants to reach a c-value of < 0.01, but after 15 trials with some mistakes he decides that 0.05 is enough this time. This would change the stop points, no matter what method is used to calculate them, and therefore the c-values. Without the user giving some information about this to the software, there's no way to get correct results here.
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #9
I wonder if the limit of the probability to have guessed, in a sequencial test, is 1. Maybe one day I'll try to calculate it.

p-values: Sum up + proposal

Reply #10
I've created a dos-box program (attatched to this post), that simulates "2^(-1) + 2^(-2) + 2^(-3) + ..."  method. I've extended it a bit, now it works like this:

A aimed c-value is entered. The stop points are chosen by the program to make the c-value when reaching one of them stay lower than the aimed c-value, no matter how many trials are performed. The numer of total trials can be limited by the user to make the program stop after a reasonable number of trials. Every stop point is allowed to 'consume' a certain percentage (or less) of the remaining "aimed c-value reservoir". This percentage can be chosen by the user as 3rd input (0.01 - 0.99). Example:
The aimed c-value is 0.05. The percentage is 0.4.
The c-value for the 1st stop point must be smaller than 0.05*0.4 = 0.02, this is the case for
6/6, c-value = 0.0156. The "reservoir" is now 0.05-0.0156 = 0.0344.
What's added by the next stop point to the c-value must be smaller than 0.0344*0.4 = 0.0138.  This is the case for
8/9, c-value = 0.0273. "reservoir": 0.0227. Next stop point must add 0.0091 or less:
10/12., c-value = 0.0354
...

Here's an example showing how the percentage value affects the stop points:
For comparison the number of trials is limited to 50, but there's no limit in practice (besides limits caused by overflow in software etc.):
Aimed c-value = 0.01.

1. Percentage = 0.1:
Code: [Select]
1. Stop point: (10/10)   C-Value: 0.000976563
2. Stop point: (13/14)   C-Value: 0.00158691
3. Stop point: (15/17)   C-Value: 0.00223541
4. Stop point: (17/20)   C-Value: 0.00282192
5. Stop point: (19/23)   C-Value: 0.00332022
6. Stop point: (21/26)   C-Value: 0.00373085
7. Stop point: (23/29)   C-Value: 0.0040638
8. Stop point: (24/31)   C-Value: 0.00459897
9. Stop point: (26/34)   C-Value: 0.00496011
10. Stop point: (28/37)   C-Value: 0.00523142
11. Stop point: (29/39)   C-Value: 0.00564917
12. Stop point: (31/42)   C-Value: 0.00592204
13. Stop point: (32/44)   C-Value: 0.00632385
14. Stop point: (34/47)   C-Value: 0.00657883
15. Stop point: (36/50)   C-Value: 0.00676343


2. Percentage = 0.3
Code: [Select]
1. Stop point: (9/9)   C-Value: 0.00195313
2. Stop point: (11/12)   C-Value: 0.00415039
3. Stop point: (14/16)   C-Value: 0.00511169
4. Stop point: (16/19)   C-Value: 0.00601006
5. Stop point: (18/22)   C-Value: 0.00676394
6. Stop point: (20/25)   C-Value: 0.00737441
7. Stop point: (22/28)   C-Value: 0.00786117
8. Stop point: (24/31)   C-Value: 0.00824657
9. Stop point: (26/34)   C-Value: 0.00855083
10. Stop point: (28/37)   C-Value: 0.00879084
11. Stop point: (30/40)   C-Value: 0.00898025
12. Stop point: (31/42)   C-Value: 0.00927956
13. Stop point: (33/45)   C-Value: 0.00947899
14. Stop point: (35/48)   C-Value: 0.00962775


3. Percentage = 0.5
Code: [Select]
1. Stop point: (8/8)   C-Value: 0.00390625
2. Stop point: (11/12)   C-Value: 0.00585938
3. Stop point: (13/15)   C-Value: 0.00769043
4. Stop point: (16/19)   C-Value: 0.00844574
5. Stop point: (18/22)   C-Value: 0.00913858
6. Stop point: (21/26)   C-Value: 0.00942713
7. Stop point: (23/29)   C-Value: 0.00969638
8. Stop point: (26/33)   C-Value: 0.00981075
9. Stop point: (29/37)   C-Value: 0.00986504
10. Stop point: (31/40)   C-Value: 0.00991874
11. Stop point: (34/44)   C-Value: 0.0099426
12. Stop point: (36/47)   C-Value: 0.00996601


4. Percentage = 0.8
Code: [Select]
1. Stop point: (7/7)   C-Value: 0.0078125
2. Stop point: (11/12)   C-Value: 0.00952148
3. Stop point: (16/18)   C-Value: 0.00973511
4. Stop point: (19/22)   C-Value: 0.00986528
5. Stop point: (22/26)   C-Value: 0.00993642
6. Stop point: (25/30)   C-Value: 0.00997436
7. Stop point: (28/34)   C-Value: 0.00999449
8. Stop point: (33/40)   C-Value: 0.00999717
9. Stop point: (36/44)   C-Value: 0.00999893
10. Stop point: (40/49)   C-Value: 0.00999945


5. Percentage = 0.9
Code: [Select]
1. Stop point: (7/7)   C-Value: 0.0078125
2. Stop point: (11/12)   C-Value: 0.00952148
3. Stop point: (15/17)   C-Value: 0.00994873
4. Stop point: (21/24)   C-Value: 0.00997794
5. Stop point: (25/29)   C-Value: 0.00998824
6. Stop point: (28/33)   C-Value: 0.00999505
7. Stop point: (31/37)   C-Value: 0.00999912
8. Stop point: (36/43)   C-Value: 0.0099997
9. Stop point: (40/48)   C-Value: 0.0099999
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #11
tigre: Just to clarify, with your method, the c-value that is shown to the user will be the probability to reach one of the stop points (calculated as you described above) or his current score, right?

p-values: Sum up + proposal

Reply #12
Quote
tigre: Just to clarify, with your method, the c-value that is shown to the user will be the probability to reach one of the stop points (calculated as you described above) or his current score, right?

1. User tells software what c-value he wants to reach "true probability that you could get a score by guessing", e.g. 0.01.

2. Software calculates stop points (can be made configurable -> "probability" value).

3. There are several possibilities what can be shown to the user, e.g.:
a) the c-value based on the stop points and the actual score
b) simply either "not yet passed, if you stop now you've failed" or "passed, stop now"
c) the actual score and the next few reachable stop points
d) the stop points that have been missed already

My favourite would be a combination of a) and c), e.g. like this:

Quote
The "probability that you could get a score by guessing."" (c-value) you want to reach is 0.01.
Your current score is 7 correct trials out of 8.

Actual c-value: 0.0195

The next stop points you can reach are:
11/12; 4/4  correct trials needed
14/16; 5/6  correct trials needed
16/19; 9/11 correct trials needed

You've missed these stop points:
8/8


Calculating and showing the probability to reach one of the stop points wouldn't make much sense IMO.

Edit: "probability you're guessing" replaced with "probability that you could get a score by guessing."
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #13
Quote
a) the c-value based on the stop points and the actual score

Yes, that's what I meant in my previous post, sorry if I didn't make it clear enough (the c-value is, after all, calculated as the probability to reach one of the earlier stop points or your current score).

The problem with your approach, as I see it, is still the following: you're using two different kinds of "c-values" in your method. First you use the "traditional c-value" calculation to find the stop points, but then you use a different way of calculating the value that is actually shown to the user, because here you use your new "custom" stop points.
This results in the same problem as the transition from p-values to c-values: what you show to the user is something different than you used for your assumptions about user behaviour. The problem with the original c-value approach was this: you assume that the user will stop at a certain p-value, but then you don't even show him the p-value but rather a different value, the c-value, so the assumptions don't make sense.
In your new approach the problem is similar. First you use "normal" c-values to find out what the stop points are. But then you don't show these "normal" c-values to the user, but you show him a different kind of c-value, namely the ones based on your new stop points.

Or maybe I got it all wrong?

p-values: Sum up + proposal

Reply #14
Quote
Or maybe I got it all wrong?

Somewhat, I'd say.
Based on the user input before the test starts, all stop points are fixed. The results can be shown, but that's not necessary. The software must have control over the stop points, i.e. when one of them is reached, the software stops the test. Therefore, no assumptions about user behaviour have to be made, because this 'behaviour' is replaced by the stop points calculated by the software. The c-values that are calculated now using these stop points are correct, no matter what the user can see during testing. You can show him even the 'ordinary' p-values as additional information. Since the user can't decide to change stop conditions after the test has started, c-value calculation can't be messed up.

There's only one way to calculate c-values. The only thing that can change and therefore influence the results are the stop points. This is no problem if the stop points are fixed before the test starts. You could even give the user the possibility to set every stop point manually before testing starts. The resulting c-values would be different from c-values based on "equal p-value stop points" of course, but still valid since the stop points are known without any doubt and not calculated based on assumptions about user behaviour.
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello

p-values: Sum up + proposal

Reply #15
Quote
Quote
The "probability you're gessing" (c-value) you want to reach is 0.01.


Just a small wording thing that Continuum pointed out in the big thread:  It isn't really the "probability you're guessing" that's being calculated, but the "probability that you could get that score by guessing."

I like the idea of asking the listener what he wants to try for before he starts.

ff123

p-values: Sum up + proposal

Reply #16
Quote
Quote
Quote
The "probability you're gessing" (c-value) you want to reach is 0.01.


Just a small wording thing that Continuum pointed out in the big thread:  It isn't really the "probability you're guessing" that's being calculated, but the "probability that you could get that score by guessing."

You're right, thanks (edited now in my posts). In my 1st post I called it "probability to reach a certain score (or better) by random guessing", but when writing the other posts I must have become less aware of it

Quote
I like the idea of asking the listener what he wants to try for before he starts.

I do as well. This way there could be even an option to keep the 'old' p-values. (The tester would have to choose a fixed number of trials - and the test stops then, no matter what.)
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello