I've noticed several double-blind testing tools have (or give the option of) ABXY testing. For those who haven't seen this, it involves randomized duplication of the A and B samples as X and Y. Therefore, the options to the test subject are:
A is X, B is Y or A is Y, B is X
Under conventional ABX, the A and B are compared with X, with X being a duplicate of either A or B chosen at random. The options for ABX are therefore:
A is X or B is X
I'm struggling to see the logic of using ABXY over ABX. Statistically speaking, I can't see how it has any effect on alpha or power.
Is there a particular methodological/statistical reason why it is used, for example, by the ABX Comparator tool?
Not sure about statistics but ABXY is more efficient than ABX during listening test.
If differences are subtle then during ABX test You need to listen A-B-X-A-B-X (or B-A-X-B-A-X) several times because human memory is pretty short.
While I generally need just one pass A-B-X-Y to spot the difference quickly.
P.S. In other words it's critical for me to have a fast access to both X and Y and switch between them quickly to spot difference in a very short amount of time.
(less than 1 second).
Apologies IgorC if this seems a stupid question (! :) ), I'm just trying to get my head around how other people approach the two tests:
When you click 'A-B-X-Y', you're effectively playing 'A-B-A-B' or 'A-B-B-A' depending on how it's been randomised. If you're identifying subtle differences between the B and the X (as they are sequential in your pattern), then would that same inference not be drawn by playing A-B-X-A or A-B-X-B under regular ABX conditions?
When I do ABX, for example, I tend to play A, B and then X. If B and X sound similar, I'll immediately play A again to gauge whether the difference with X lies there. If they do not sound similar, I'll still play A again, only this time to confirm that it is similar to X (which was just playing), thus establishing that the difference must exist between B and X.
Interested to hear what you and others think of the two approaches, and how you personally sequence your testing. :)
then would that same inference not be drawn by playing A-B-X-A or A-B-X-B under regular ABX conditions?
I think You have answered your own question.
I got 2x time gain. Instead of playing 2 combinations (ABXA and ABXB) I just play one ABXY which is NOT the same as ABXA and/or ABXB (because of short sound memory).
And that really matters because less time means less fatigue -> more reliable outcome of blind test.
I strongly support ABXY.
Let's say if
A score is 5.0 - the original.
B score is 4.9 - the encoded.
X score is 5.0 - actually the original.
The ABXABXABXABX sequence is perceived like:
5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0
The AXAXAXBXBXBX sequence is perceived like:
5.0 5.0 5.0 5.0 5.0 5.0 4.9 5.0 4.9 5.0 4.9 5.0
The XYXYXY sequence is perceived like:
5.0 4.9 5.0 4.9 5.0 4.9
Then FFT those perceived score sequence. The XYXYXY sequence has the strongest non-dc power, therefore easiest to detect.
(The DC is very hard to detect because, like the IgorC said, human memory is short, and in a very high-rate listening test, the listener will get paranoid, and start to find tiny recording flaws of the original; Now both of them seem encoded.)
Another advantage is that XYXYXY has the greatest Hamming distance between the case X was 5.0 and the case X was 4.9.
The ABXABXABXABX sequence is, in case if the X is A, like
ABAABAABAABA, therefore,
5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0
The ABXABXABXABX sequence is, in case if the X is B, like
ABBABBABBABB, therefore,
5.0 4.9 4.9 5.0 4.9 4.9 5.0 4.9 4.9 5.0 4.9 4.9
The Hamming distance is
0.0+0.0+0.1+0.0+0.0+0.1+0.0+0.0+0.1+0.0+0.0+0.1+...
The XYXYXY sequence is, in case if the X is A, like
ABABAB, therefore,
5.0 4.9 5.0 4.9 5.0 4.9
The XYXYXY sequence is, in case if the X is B, like
BABABA, therefore,
4.9 5.0 4.9 5.0 4.9 5.0
The Hamming distance is
0.1+0.1+0.1+0.1+0.1+0.1+...
Thus ABXY is a great tool to detect tiny flaws on the verge of perceptibility.
I think that you will loss a lot of resolution.
A skilled listener does:
1] A : X same or different?
2] B : X same or different?
It's easier than trying to remember both A & B when listening to X.
I got 2x time gain. Instead of playing 2 combinations (ABXA and ABXB) I just play one ABXY which is NOT the same as ABXA and/or ABXB (because of short sound memory).
And that really matters because less time means less fatigue -> more reliable outcome of blind test.
Why would one play ABXA when you can play AXB or AXBX and get essentially the same thing you seem to desire - close comparisons of A and X and B and X.
I see Y as an interfering or worse, irrelevant variable.
Why would one play ABXA ...
And how a listener knows beforehand how sound A and B, huh?
Let's not just concetrate on a pure sequences and let's try to think what happens with a listener during test because we aren't just robots who perceive certain sequences and have perfect memory. It's a huge oversimplification
Why would one play ABXA when you can play AXB or AXBX
AXB can be AAB or ABB. So I will waste my concetration on "AA" part and only then move to "AB" in 50% of cases. Hm. No deal for me. It's a big deal to have an access to X and Y in a short time when differences are subtle.
I think that you will loss a lot of resolution.
A skilled listener does:
1] A : X same or different?
2] B : X same or different?
It's easier than trying to remember both A & B when listening to X.
Simply not true.
Then FFT those perceived score sequence. The XYXYXY sequence has the strongest non-dc power, therefore easiest to detect.
(The DC is very hard to detect because, like the IgorC said, human memory is short, and in a very high-rate listening test, the listener will get paranoid, and start to find tiny recording flaws of the original; Now both of them seem encoded.)
Another advantage is that XYXYXY has the greatest Hamming distance between the case X was 5.0 and the case X was 4.9.
The ABXABXABXABX sequence is, in case if the X is A, like
ABAABAABAABA, therefore,
5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0 5.0 4.9 5.0
Kamedo, could you please tell me what you mean by FFT and non-DC power? I'm not familiar with the terms in this context
I think that you will loss a lot of resolution.
A skilled listener does:
1] A : X same or different?
2] B : X same or different?
It's easier than trying to remember both A & B when listening to X.
I can remember both A & B, and answer accordingly in low rate listening tests. However, in high rate listening tests, the difference between A and B is so subtle that human memory can be washed out quickly. I like ABXY. It withstands in noisy channel.
Why would one play ABXA ...
And how a listener knows beforehand how sound A and B, huh?
Let's not just concetrate on a pure sequences and let's try to think what happens with a listener during test because we aren't just robots who perceive certain sequences and have perfect memory. It's a huge oversimplification
Why would one play ABXA when you can play AXB or AXBX
AXB can be AAB or ABB.
Right, so if you can hear the difference between A & B the difference is there to hear because in either case you have an A versus B comparison.
Following the same logic, AXBX is the same as AABA if X is A and ABBB if X is B. You have an A versus B comparison, either way.
So I will waste my concetration on "AA" part and only then move to "AB" in 50% of cases.
There Is no such thing as wasting concentration. When you listen you obtain information that is valuable to the identification process. Hearing that X sounds the same as A is just as valuable as hearing that it is different. If it the same, then it is not different, and if it is different, then it is not the same.
Hm. No deal for me. It's a big deal to have an access to X and Y in a short time when differences are subtle.
Whatever makes you feel good! I suspect that you are so prejudiced in favor the need for the Y variable that lacking it will probably hurt your concentration, even though that is not generally true.
Why would one play ABXA ...
And how a listener knows beforehand how sound A and B, huh?
Ever hear of listener training?
Whatever makes you feel good! I suspect that you are so prejudiced in favor the need for the Y variable
Are you just prejudiced in favor of ABX test?
I got 2x time gain. Instead of playing 2 combinations (ABXA and ABXB) I just play one ABXY which is NOT the same as ABXA and/or ABXB (because of short sound memory).
And that really matters because less time means less fatigue -> more reliable outcome of blind test.
Why would one play ABXA when you can play AXB or AXBX and get essentially the same thing you seem to desire - close comparisons of A and X and B and X.
I see Y as an interfering or worse, irrelevant variable.
as our high accuracy memory is short our judgement accuracy time is also very short.
imho in order to manage our moments of very high accuracy efficientely, we should intoduce a sort of ina
ccurate flow of data in order to extract the high accuracy moments.
I think that you will loss a lot of resolution.
A skilled listener does:
1] A : X same or different?
2] B : X same or different?
It's easier than trying to remember both A & B when listening to X.
Simply not true.
Don't ever have a listening contest with a skilled listener. They can notice differences way to small to identify just what the differences are. They go with 'same or different' not sounds like A or B which requires a much larger difference.
How I use ABXY, after playing A and B.
When the bitrate is very low....
I play X, and answer.
When the bitrate is fairly low, but if I just want to confirm my guess, I play the opposite.
I play X, Y, and answer.
When the bitrate is very high and the artifacts are very subtle, I tend to find the minuscule recording flaw of the original, so the question will be more like "X and Y, which is more damaged?".
I am a human, so I have a confirmation bias. The first guess, be it right or wrong, is more likely to be reinforced on the next play.
To reduce this bias, first, let myself believe that X is better than Y, and play X, Y, X, Y, X, Y, ...
next, let myself believe that Y is better than X, and play X, Y, X, Y, X, Y, ...
Finally, think to myself, "Which confirmation was stronger?"
That's the way I perform, and pass, the high rate ABXing.
For me, it's really simpler to find subtle differences when I can do XYXYXY... comparison.
perhaps it would interesting to try
ABXZ (where Z is random A or B and I'am told which one it is), unless that doesn't break the stats in some way.
The sequence "ABXY" by itself, versus "AXBX" isn't really much different, as Arnold said.
What IgorC and others seem to put emphasis on , is the fact that ABXY let's the tester forget completely about A and B, and focus on locating the difference between X and Y, if there is any which is audible.
Once the difference is found, the next step is seeing if A has the difference, or B has it, and be able to make the A->X, (and so B->Y) or B->X (and so A->Y) connection.
With ABX, you can do exactly the same thing: Let's forget completely about X, and focus on locating the difference between A and B, if there is any which is audible.
Once the difference is found, the next step is seeing if X has the difference, or if X doesn't have it, and be able to make the A->X (and so B-> not X) or B->X (and so A-> not X) connection.
Obviously, the objective is determining the sample under the X, either with ABX or with ABXY, and for that, the difference alone is not enough.
Once the difference is found, the next step is seeing if A has the difference, or B has it,
If my understanding is right, you play AXAXAX and see if A and X have the difference. If it's inconclusive, then you play BXBXBX and see if B and X have the difference.
If the initial guess "A has the difference" is right, A has a higher fidelity than X. If the guess is wrong, then A is same as X. That's not as dramatic as having fidelity opposite to the expectation.
The concept "same" is a tricky one. If you could deduce a difference early in the evaluation, you're lucky. But if you couldn't, it could be either: a) there is indeed no difference. b) there is a difference you missed it by whatever reasons, including but not limited to, forgetting the "knack" by the time elapsed.
That's a devil's proof that consumes the tester's precious time.
In an ABXY testing, if you play X, you can always guarantee that the one you play next will be "different". It can be of higher fidelity or lower fidelity, but never be the same as X.
Of course, you can switch the guess in the ABX test, but why you switch the guess is that the initial guess was inconclusive. The time you spent on pondering whether it's same or not -- it's the devil's proof -- is wasted. Actually, the devil's proof does appear in a ABXY test, when memorizing A and B and the case you don't find any differences, and you think of giving up the test. But that's the consideration you do only once in a single ABXY session, not something you do 5, 8 or 20 times in a single session.
In conclusion, ABXY session is faster to perform because it eliminates the devil's proof.
@Kamedo2 I don't follow you. I said that if you are interested in locating the difference, you play A versus B, not A versus X or B versus X. That's the part that I said is the same than playing X vs Y. There's no need for Y to locate the difference.
I really cannot understand why finding a difference between A vs B is harder than finding a difference between X vs Y.
I really cannot understand why finding a difference between A vs B is harder than finding a difference between X vs Y.
It isn't, of course. But finding this difference is only the first step.
I still don't get it... If finding the difference is just as easy with ABX or ABXY, then, what makes ABXY better? I am still not buying the "I do less tests with ABXY " sentence.
I still see that people still ignore the human factor. We haven't perfect memory.
It's not about sequences but about timing.
Some people here (including me) claim usefulness of ABXY on a subtle differences. It's important.
Because good level of concentration is _very short_ actually. And you need a high level of concentration on a subtle differences. You could spot a subtle difference on A-B, at this moment your concentration isn't good enough and now imagine ... You should move to A-X and/or B-X. You are 50% lucky (2 possibilities: A-X or B-X) to get a correct spot. But You get 100% if you move directly to X-Y.
And, again, all this test time your concentration is degrading at very high rate. It's not about logic, mathematics, statistics. Non of them can't describe it. But about an interaction between a test and human.
I think this is the case when common sense doesn't apply. There are scientific theories which go against a common sense but were proven to be correct.
Why would one play ABXA ...
And how a listener knows beforehand how sound A and B, huh?
Ever hear of listener training?
There is important moment. Kamedo2 and me claim about usefulness of ABXY for
subtle difference.
You can't "memorize" enough subtle difference.
Training doesn't apply here.
I really cannot understand why finding a difference between A vs B is harder than finding a difference between X vs Y.
It isn't, of course. But finding this difference is only the first step.
Exactly. Evaluating A and B is something you do only once in a single ABX session. After that, evaluating X is something you do 5, 8, or 20 times.
Step by step:
=ABX=
Listen to A
Listen to B
if difference is subtle and wasn't caught at first listen, listen to A and listen to B again until determining a difference.
Determining a difference does not mean remembering A and/or remembering B. It means one sample does not sound like the other in a concrete part of the audio. And since you found the difference while listening to A and B, you know which of them has the "different part".
Listen to X on that specific part where you located the difference previously, in order to see if it plays as the "bad" one or as the "good" one.
If X happens to be the same than your last listening ( let's say, you last listened to B, and X is B), you will not hear the difference, and might question if, maybe listening to A will show the difference (and determine that X is B).
=ABXY= (as I understand you describe it)
Listen to A?
Listen to B?
Listen to X
Listen to Y
if difference is subtle and wasn't caught at first listen, listen to X and listen to Y again until determining a difference.
Determining a difference does not mean remembering A and/or B, and/or X and/or Y. It means one sample does not sound like the other in a concrete part of the audio. And since you found the difference while listening to X and Y, you know which of them has the "different part".
Now, what?
Listen to A? Listen to B? Don't listen since now you somehow remember what is A and what is B?
Or maybe you simply say that rather than listening to A/B A/B A/B and then detemining X, you continually listen to ABXY ABXY ABXY until you are sure B is X or B is not X?
Why would one play ABXA ...
And how a listener knows beforehand how sound A and B, huh?
Ever hear of listener training?
There is important moment. Kamedo2 and me claim about usefulness of ABXY for subtle difference.
You can't "memorize" enough subtle difference.
Please provide a relevant reference in a peer reviewed paper or other independent authoritative evidence beyond your personal say-so confirming that exceptional claim.
Training doesn't apply here.
Again, this looks like an attempt to fabricate much-needed facts on the spot.
And the word subtle itself is very vague. Got a formal, generally agreed upon definition of it?
=ABXY= (as I understand you describe it)
I often did this when I participated in listening tests:
1) listen to samples A and B (A B A B A B...) to find encoding artifact (pre-echo, ringing, etc)
2) listen to samples X and Y (X Y X Y...) to find what sample has this artifact (for example, what sample has smearing of sharp transients)
3) optionally - confirm it using this sample and samples A and B
If you listen to XYXYXY then you not only need to tell that they are different (they are), but which is which.
If you Listen to AXAXAX or BXBXBX then all you need to know is if you can hear a difference or not.
If you listen to XYXYXY then you not only need to tell that they are different (they are), but which is which.
Or rather how they are different. If you can tell how they are different, then the difference is greater than JND.
If you Listen to AXAXAX or BXBXBX then all you need to know is if you can hear a difference or not.
Yes it'd much easier to answer a same or different question.
I will explain why ABXY is superior to ABX on high rate listening tests.
In the high rate test, only relative quality change between two adjacent play is usable as a useful hint.
Sadly, observed (perceived) quality is subjected to a Gaussian noise.
(Finding artifacts typically doesn't work in a high rate test, because it starts to seem like both tracks have the artifacts. Still I can still say which one is dirtier or cleaner, with limited reliability.)
In the spreadsheet B3:C3, the quality of track A and B is set. A is original, so it will be 5.0. B is set to 4.0 as a convenience.
In an ideal world where humans are noiseless machines, exactly 1.0 quality drop will be felt by human when playing A -> B. Never more than 1.0 nor less than 1.0.
Those ideal world result is represented on the spreadsheet D6:G7.
In the real world, humans are moody animals. The same quality change might be perceived as -1.2, -0.2, or even -2.8 sometimes(though this far is not so frequent).
Let's say human perceived it as -1.6. Look at D18:G18. The graph says it is modestly likely(7.5%) that the perception had come from the A->B. Less likely scenario is that you played the identical tracks twice and the perceived -1.6 drop was actually a noise(1.4%). Even less likely is that it was actually an improvement(0.1%) - probably there was a noise outside the listening room.
Apply the Bayes' rule here. In the A->X, It can be A->A or A->B but not the B->B nor B->A so we can exclude this possibility. The likeliness of it being A->B is D18*0.5/(D18*0.5+E18*0.5) = D18/(D18+E18) = 7.5%/(7.5%+1.4%) = 0.85 = 85% after the observation. Very simple. 0.5 is because blind test softwares are known in advance to set the likeliness of X=A being 50%. It says, according to the Bayesian inference, if a quality drop -1.6 is observed, it is 85% likely that it is A->B.
Likewise, In the X->Y test, we can exclude the A->A and B->B possibility. Just look at the A->B and B->A possibility. Similarly, Bayes' rule says 7.5%/(7.5%+0.1%) = 99% likely that it was A->B, and 1% likely that it was B->A (H18:I18).
Both played tracks twice so far, but X->Y is far more confident in saying which is which.
There are minor exceptions, such as H26:K26 where the ABXY is less confident, but according to this spread simulation, in an average situation, after playing tracks twice, ABXY have 89.2% likeliness to be correct, while ABX have 73.3% likeliness to be correct(L:S).
(https://ss1.coressl.jp/listening-test.coresv.net/img2/abxy-abx-comparison.png)
Another spreadsheet simulation says ABX doesn't catch up by playing tracks more, if ABXY is also allowed to play more.
(https://ss1.coressl.jp/listening-test.coresv.net/img2/abxy-abx-comparison2.png)
I actually don't take the spreadsheet software to my listening room. Instead I use the analog 'sense' of something similar to H11:I41 and let it accumulate on my mind as I hear more tracks.
From a programming perspective, it's a lot easier to just add the "Y Button" and cover both test scenarios. You don't have to ever click it, and you would be doing an ABX test, or use it and you would be doing ABXY.
And if you think that one test is easier than the other, then good! There should be no difficulty inherent to the test itself.
One program to rule them all !
EDIT: Personally I think I do AXY more than anything.
I will explain why ABXY is superior to ABX on high rate listening tests.
In the high rate test, only relative quality change between two adjacent play is usable as a useful hint.
Sadly, observed (perceived) quality is subjected to a Gaussian noise.
(Finding artifacts typically doesn't work in a high rate test, because it starts to seem like both tracks have the artifacts. Still I can still say which one is dirtier or cleaner, with limited reliability.)
In the spreadsheet B3:C3, the quality of track A and B is set. A is original, so it will be 5.0. B is set to 4.0 as a convenience.
Sighted evaluation, right?
You believe that ABXY is superior, and so it is for you.
From a programming perspective, it's a lot easier to just add the "Y Button" and cover both test scenarios. You don't have to ever click it, and you would be doing an ABX test, or use it and you would be doing ABXY.
And if you think that one test is easier than the other, then good! There should be no difficulty inherent to the test itself.
One program to rule them all !
EDIT: Personally I think I do AXY more than anything.
I find all the stuff in the FOOBAR2000 ABX add-on that is uniquely related to ABXY as a distraction, and think it hurts my results,
I use Foobar's ABX plugin, and what I usually do is first listen to A and B and figure out how they sound different, or in other cases, or I won't even need to listen to A and B, because I know what the difference is. This goes for something like volume level. So if I know that A is 0.2 dB louder than B, I will just listen to X and Y and see which one is louder. Then that must be A.
In the same way, after listening to A and B and thinking e.g. that A is brighter than B, then I just listen to X and Y from then onwards and then pick the brightest one as being A.
I hope this makes sense. I find using X and Y is very easy for me. I have basically used this method in all the blind tests I've done, although I also go back to A and B now and again if need be.
As you can see in the listening test forum, I passed with 15 out of 16 correct for a volume level difference of 0.2 dB, and that was done purely using X and Y.
I will explain why ABXY is superior to ABX on high rate listening tests.
In the high rate test, only relative quality change between two adjacent play is usable as a useful hint.
Sadly, observed (perceived) quality is subjected to a Gaussian noise.
(Finding artifacts typically doesn't work in a high rate test, because it starts to seem like both tracks have the artifacts. Still I can still say which one is dirtier or cleaner, with limited reliability.)
In the spreadsheet B3:C3, the quality of track A and B is set. A is original, so it will be 5.0. B is set to 4.0 as a convenience.
Sighted evaluation, right?
You believe that ABXY is superior, and so it is for you.
Critics are good in mix with suggestions, ideas ... anything.
But You don't help either. Can You see it?