List of typical problems and shortcoming of "common" audio t

Topic: List of typical problems and shortcoming of "common" audio t (Read 12832 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

List of typical problems and shortcoming of "common" audio t

2015-12-15 22:41:11

First thing first: if this topic has already been discussed somewhere, feel free to point me there (publications ref. accepted ).

There are a few things that make me feel uneasy about typical audio tests. For instance (in no particular order):

Switching between sample A and B seamlessly is hard in SW/HW if the decoder/bitrate/sample-rate is different (cf. past and present problems with foo_abx).
Self-reported results are hard to trust: people can easily cheat (e.g. changing the logs is usually enough).
Codec comparison is fragile and hard to do: for instance fdk_aac lowcuts at 15000Hz@96kbit/s while opus/vorbis/etc. don't. Do my OGG and AAC @96kbit/s sound different because of the codecs themselves or because of the (optional, but set by default in fdk_aac) frequency cut?
Buggy or outdated codecs/pipelines could degrade the audio for specific formats only. People could hear a difference, but it would be completely unrelated to what we want to test.
Preference tests seems to be rarely preceded by ABX tests, so they are hard to trust.
Most people just don't do formal testing.
When people choose correctly 15 times out of 20, the test is usually considered positive (i.e. "this person can differentiate A and B") however, there is no notion like "the difference in audio is so small that people cannot pick the right one every time".
It is usually easy to identify which sample is X from its waveform/spectrum.

The purpose of this thread is to get a more exhaustive problem list.
I really want to get the big picture and I am sure this will be useful in the near future.

IMPORTANT NOTE: I'm not looking for solutions (yet). The time for solutions will definitely come, so please refrain from giving them now.

List of typical problems and shortcoming of "common" audio t

Reply #1 – 2015-12-15 23:37:25

I guess you are talking about blind tests. My comments:

1) ABX software may just decode all files before the test even starts, so the problem is mainly a difference in the decoded format (samplerate, bit depth, could even be channels or PCM vs DSD).
2) Yes, but you don't even need to forge logs. See spectrum analyzers.
3) I consider cutoffs as part of the codec. Different codecs make different tradeoffs, but that's what you want to test ... if there is an audible difference.
4) Yeah and I think this is commonly overlooked.
5) Hmm, you mean like ABC/HR? Have you checked out the multiformat listening tests, exclusion criteria (like getting N samples wrong), statistical analysis? It's pretty solid if you have enough people and samples.

7) There's been some discussion about p-values and ABX comparator results. Without some idea about statistics the results are indeed not trivial to understand.

That's why it is supposed to be a blind test, you shouldn't "see" which is which.

Another one: sometimes files are created by an incompatible codec (or decoded even, that would be point 1 again) which can e.g. result in added silence at the beginning.

Another big one: differences between the files that you didn't even want to test for. For example there have been several invalid resampler test in the past. The guys wanted to test whether the resampler's filter caused audible differences, but didn't notice that some resamplers introduced a time delay. On fast switching between tracks this could give away which is which ... this could have even happened without the participants noticing it as such, but just perceiving some vague difference when switching.

It's a tragedy when such mistakes are not detected in the preparation of the test but only much later, or not at all, but conclusions are still drawn as if it was a valid test. Some dishonest people won't even acknowledge the flaws...

List of typical problems and shortcoming of "common" audio t

Reply #2 – 2015-12-16 14:33:49

Quote from: MMime on 2015-12-15 22:41:11

There are a few things that make me feel uneasy about typical audio tests.

Most people just don't do formal testing.

What's wrong with not wearing a tuxedo when testing??
Actually, the latter should make you more uneasy. Entire industries like High Fashion audio and acoustic "treatments", etc. are based specifically on not doing any perceptual testing whatsoever. The house of cards collapses with their inclusion.

Quote from: MMime on 2015-12-15 22:41:11

The purpose of this thread is to get a more exhaustive problem list.

That or you are seeking a solution in search of a problem.
For example:

Quote from: MMime on 2015-12-15 22:41:11

Preference tests seems to be rarely preceded by ABX tests, so they are hard to trust.

Ummm, why would an ABX be needed for a preference test?

cheers,

AJ

List of typical problems and shortcoming of "common" audio t

Reply #3 – 2015-12-16 15:12:14

Preference tests need a lot of participants in order to have much hope in providing statistically significant results.

List of typical problems and shortcoming of "common" audio t

Reply #4 – 2015-12-16 16:00:33

Please, please, let's focus on finding problems with the current usual ways of doing audio tests (and I'm talking about audio testing in general, not just DBT).

Discussions about each specific problem (even to say that they actually aren't) will come. Solution finding will also come. But please refrain from doing them now.

But thank you, you three, for your input. @anjinfa: are you really working on loudspeakers? If so, I'm sure that you can think of problems related to hardware and how people setup them for listening tests.

As for the motivation behind such a list, it has been given by @xnor in his last paragraph.
So, back to the point...

As an extra (psychologocal bias) problem, people tend to think that original audio *always* sounds better.

HW/SW-related problems are hard to spot (only by ABXing sound card output records)

Stereo recordings can exhibit differences due to phase interactions in loudspeaker systems.

List of typical problems and shortcoming of "common" audio t

Reply #5 – 2015-12-17 00:02:34

Quote from: xnor on 2015-12-15 23:37:25

It's pretty solid if you have enough people and samples.

I don't agree. I think that statistics are the 'soft' scientist's version of hard mathematics. Statistics mean nothing unless the original premises of the experiment are valid i.e. that certain mathematical criteria concerning the test data are met. In the case of people listening to music - a cultural, aesthetic judgement - that isn't going to happen.

Audiophile music is 'niche', shall we say. Audiophile music and audiophile systems enjoy a 'symbiotic' relationship. Audiophile music is chosen because it sounds 'good' (acceptable) on standard audiophile systems. The music selected by audiophiles for testing audiophile systems is not capable of revealing the weaknesses that plague audio systems.

One definition of science says that it is not valid for subjective judgements. Quite so. You will never be able to conclusively say that technical parameter X is inaudible or otherwise because your experiment is already biased towards the anodyne music that audiophiles 'enjoy' - because their systems sound acceptable when playing it. You can't meaningfully test listeners with synthesised test signals or music they are not familiar with, either.

This is not science.

List of typical problems and shortcoming of "common" audio t

Reply #6 – 2015-12-17 01:15:57

Please, once again, stop arguing!!! This is ABSOLUTELY NOT the right time for this. Only list problems for now.

You WILL be given plenty of time (at the very least a whole week) and opportunities to discuss them starting on next week. Until then, please refrain from arguing (at least here, feel free to open your own thread if you want to start now).

List of typical problems and shortcoming of "common" audio t

Reply #7 – 2015-12-17 01:35:25

Though, Green Marker expressed problems (valid or not, don't argue!) :-)

- Preference tests seem non-scientific
- Test samples are not randomly chosen (biased toward familiar music and "audiophile" music on which gears were tuned).
- The applicability of usual stats should be checked (letting aside the "subjective therefore ascientific" stuff, the normality of the distribition of values is never checked for instance)

I would also add:

- Making sure people can differentiate sample A from sample B before asking for which they prefer is not obvious
- What is "part of the codec" (for codec comparison) is hard to determine.
- People basically have the program and the files: they can temper with them as much as they want.

List of typical problems and shortcoming of "common" audio t

Reply #8 – 2015-12-17 01:41:56

Regarding #5, I'm just trying to help clear up a potential source of confusion. I'm not a mind reader; I don't know what you know and what you don't know. If you think this is arguing then count me out of this discussion.

I wish you luck in whatever it is that you're trying to accomplish.

List of typical problems and shortcoming of "common" audio t

Reply #9 – 2015-12-17 02:24:34

If you are refering to my first warning, this is not related to your contribution at all: what you did was exactly what I'm asking for.

What I want to achieve for now is a list of problems related to audio testing. And to finish clearing up your doubts, this is not about what I know or not: absolutely everything is fine as long as it is a related problem. Everything. I really don't want to be in a situation as described by xnor (figuring out flaws, restarting, figuring out a new flaw, restarting, etc.).

However comments like the ones of ajinfla or Green Maker are not what I want: they are reacting, arguing and starting side discussions. What they did is perfectly normal, expected and anyone of us would have done it. That's why I'm reminding everyone about the purpose of this thread (again, I want to nail it, if someone else started such a thread, I would be the first to react, argue and start side discussions!).

Think of this thread as a wiki page named "List of typical problems and shortcomings in audio testing". Would you like to see people arguing and discuting directly within Wikipedia's article? That's not to say people can't discuss: they just have to wait until next week to do it here or they can start a new thread.

List of typical problems and shortcoming of "common" audio t

Reply #10 – 2015-12-17 22:51:30

People can use high volumes and use the noise floor to discriminate samples

Crossfading leads to a lot of problems (think crossfading to an inverted signal, samples with different gains, etc.)

Fading is necessary to avoid audible Gibbs effect and possible high frequency components (hann window over a few ms seems fine)

Samples should be perfectly aligned.

ReplayGain (or similar) should be used to guaranty similar gain (0.2 dB sufficient?)

Some sound card/drivers seem to force fade in when changing sample rates (third sample rate needed?)

People tend to spend too much time testing (they get accustomed and can't hear differences anymore)

Bad DAC: clicks with different sample (related: HW issue/SC force fade)

List of typical problems and shortcoming of "common" audio t

Reply #11 – 2015-12-18 01:53:15

Double blind tests are not understood at all ("another fantasy that proves nothing")

People need solid and obvious proofs that they can actually hear differences in a blind test (they think facts are hidden by the setup/cheap gears/etc.)

Some people tend to think that the "subjectivist" way of doing things (basically training to have golden ears and then listen to the music once or very very often, in contrary) is far more accurate.

In extension, not allowing prior or parallel semi-blind test can be seen as an issue. (Note2ms: could UI be improved? Hybrid approach from a simple preference test to a differentiation test? "How would you define this sound? That one? Is this one more like this or that?")

The engagement phenomenon is really really really strong.

People tend not to be able to differentiate "different" and "better"

Some people have absolutely no notion of aliasing and are astonished that "objectivists" say there are no differences whike they can clearly hear a difference. (Note to myself: should make high res test files with high freq noise or funny frequency spec.)

Naming tests and results as "Proving that there is no difference between X and Y" is not perceived well by a whole category of people

According to some people, it takes days, even weeks to transition from the lowly CD world to the glorious HD world... (The people that did the switch may want to be "recognised")

"Subjectivist" test should last a long time (not one-shot, "would you like to get back to a previous test?").

There are groups of self-recognised expert listeners: only results provided by the group seem good enough for the people in that group

ABX testing is not performed "in situation" (e.g. with everyday noise: not everyone hear their music in a soundproof room)

People/Gear capabilities are not properly tested to make sure that everything is fine (hearing curve on given HW for instance)

Results are often too binary "difference can't be heard" (vs maybe "there might be differences but imagine, it took X minutes on average (and Y minutes min) for the people to correctly choose and yet, they were still wrong Z times out of T")

People could stress that they are not right (positive reinforcement? Sth else?)

Don't you forget the fg ref group john!!!

Oh, good one: ABX box makes A and B sound the same.

In extension to "the difference cannot be heard": the null hypo. is that the samples are different. Failing that means that we couldn't prove that they were different.

Training and all the other user actions should be reported (not as something wrong )

ABX is not about measuring differences in audio but differences in perceived audio

List of typical problems and shortcoming of "common" audio t

Reply #12 – 2015-12-18 12:16:50

I think you need to distinguish between different kinds of "problems" here. Some points are just myths or audiophile rumors or misunderstandings.

List of typical problems and shortcoming of "common" audio t

Reply #13 – 2015-12-18 12:20:04

I also fail to see what the point of all this is supposed to be.

You fail to distinguish between problems the ABX testing method has, problems that people have with ABX testing because of their misunderstanding or prejudice, and problems that are wrongly attributed to ABX testing.

You say you want to get the big picture, but what you show is a method of obfuscating any big picture there may be. It looks more like a thinly veiled attempt at discrediting ABX, not the least because you don't seem to be interested in any counterbalancing fact, i.e. the problems of alternatives of ABX.

List of typical problems and shortcoming of "common" audio t

Reply #14 – 2015-12-18 13:49:17

All that you have described, MMime, are reasons that good evidence collection is difficult and why no single piece of evidence can be considered proof in a vacuum.

You have not actually pointed out problems with ABX or double-blind testing in general but rather attributed problem with how certain people at certain times use the results (or lack thereof).

List of typical problems and shortcoming of "common" audio t

Reply #15 – 2015-12-18 15:20:05

It may be the time for a small explanatory note...

As I repeated already at least thrice, this topic is about listing problems related to audio testing in general. Not only ABX, not only blind testing, all of them.
What I did not specify is that as soon as something is perceived by someone as a problem, it logically becomes a problem (think about it, even if just psychological). Example: an old lady throws peas all around her at Hyde Park to protect herself from lions. You consider it a problem (for whatever reason). Yet, acknowledging this problem does not mean that you consider that the lions are real.
I guess you really don't see the point of doing this without classification because you (I should really say we) are used to solve problems locally. So naturally, if you are given such a list of problems, which starts to be huge, you are thinking "how the hell does he want to 'solve' all of that?! some problems aren't even real!!!" because in your mind, I'll take problem #1, discuss, find a solution and then problem #2, discuss, find a solution, etc.
Starting tomorrow (or Monday depending on my free time), I'll start to order the problems in a tree. In practice, most problems are caused by deeper problems (and reciprocally, most problems induce other, shallower problems). This is what I call the big picture (of the current situation... there will be other 'big pictures').
Once this is done, I'll let some time to find other cause-/consequence- problems and I'll start step 2: identifying which specific problems to solve.
Once this is done (this should be quick), I'll start step 3 with a new 'big picture' and I expect you to discuss, comment, argue, cry... a lot.

Does that seem clearer?

@xnor: I hope you understand now why I did not classify the problems: there is no need because even if the lions are an illusion, that would not change the fact that the lady thinks this is a problem. And our own problem is that {the lady thinks lions are a problem} (not the lions). And that's even one of the toughest problems! From your own experience, for instance, if you told her there are no lions in the UK, would she just reply "ohhhh! how silly of me, you are right! oh my bad, now that I think of it, what peas would have done to it? oh oh ah ah! have a cuppa tea?"... Nah... You'd get the usual "but there are zoos!!! And what prevent them from taking the train???!!!!"

@pelmazo: your crystal ball might be slightly broken, your comfort zone might be slightly bruised and I understand that you find that frustrating, but I really don't like your tone and won't tolerate any more of it. So quit freaking out and chill down. You can question, but the part about 'I see through you and your pathetic attempt at undermining my holy ABX' was unneeded.
You would see how wrong your accusations are just by looking at my previous comments in other threads (one about FLAC, particularly).
And don't you fell any shame, telling me that I am not "interested in any counterbalancing fact, i.e. the problems of alternatives of ABX" while I AM ASKING FOR THAT EXACTLY!!! Read again, if you assumed this was about the shortcomings of ABX, this is in your mind only! I'd LOVE to hear the shortcoming of the alternatives!!!

@Soap: I pointed out problems related to ABX and DBT only because that's what I know the most. That's why I'm asking you guys. And I repeat, you absolutely don't have to focus on ABX or DBT, that's not the purpose. But I also listed problems that are linked to statistics and psychology.
As well, you may consider that "people do not trust ABX" is not a problem with ABX itself. Depending on the perspective, it is perfectly true. However the final effect is the same: if people don't trust ABX testing, they don't trust ABX testing and won't do them or be interested in them to guide their choice.
The workflow that I presented above has another advantage: "people do not trust ABX" is awfully vague. As such, it cannot be solved by a single action (well, usually). But by representing the problems in a graph, you will figure out what "people do not trust ABX" actually means! Because the cause-problems that link to that particular problem ARE what actually make people feel distrust about the technique. And chances are high that you can fix these to some extent.

List of typical problems and shortcoming of "common" audio t

Reply #16 – 2015-12-18 15:39:14

Quote from: MMime on 2015-12-18 15:20:05

It may be the time for a small explanation note...
As I repeated already at least thrice, this topic is about listing problems related to audio testing in general. Not only ABX, not only blind testing, all of them.

Does that seem clearer?

@Soap: I pointed out problems with ABX and DBT only because that's what I know the most. That's why I'm asking you guys. And I repeat, you absolutely don't have to focus on ABX or DBT, that's not the purpose. But I also listed problems that are links to statistics and psychology.

Again, this appears to be listing "problems" (peas / lions example is a classic one) which are only Problems if you play the game.

The old lady spreading peas is only a Problem if one thinks it is a good use of time to try to force others to view the world the way you do.

So let's address your original points one by one.

1 - Switching tests are hard: So what? Lots of evidence collecting is hard and not everyone is equipped to study gene splicing either.

2 - Can't trust self-reported tests: So what? Don't. Your happiness shouldn't rely on trusting others making outrageous claims.

3 - Codec comparison is hard: See point #1

4 - Pipelines might cause flaws: See point #1

5 - You can't trust preference tests: Broken record, you don't need to trust others. If something you need to accomplish relies upon knowledge determined through preference testing it probably needs done by you anyway.

6 - Most people don't do formal testing: See point #2. It Doesn't Matter For Your Needs And Wants Unless You Want To Pick Internet Fights.

7 - People misunderstand statistics and draw faulty conclusions from one round of test: This is you allowing the mistakes of others to make you feel angry. Anger is the killer.

8 - Spectrum: This is a restatement of #2.

List of typical problems and shortcoming of "common" audio t

Reply #17 – 2015-12-18 16:01:08

Sorry, I updated the part for you while you were answering.

Quote from: Soap on 2015-12-18 15:39:14

Again, this appears to be listing "problems" (peas / lions example is a classic one) which are only Problems if you play the game.

The old lady spreading peas is only a Problem if one thinks it is a good use of time to try to force others to view the world the way you do.

What are the problems if you don't play the game? It would be much more helpful to give examples of such problems instead of staying vague.

Quote from: Soap on 2015-12-18 15:39:14

So let's address your original points one by one.
[...]

That's exactly what I don't want.
Could you try to understand what I wrote, not only in the first post (which indeed is not that good), and not just focus on saying that these are not actually problems.

The day you are fired, going to jail because of debts with no one coming for you because your whole family has died in an accident, would that be perfectly OK to be said "what the hell are you talking about, you don't have problems... You are fired, so what? Don't see the problem: a lot of people do not have a job. You have debts? So what?! You happiness should not rely on material things such as this. Going to jail? See point #2. Your family died, see point #2 but with 'others' instead of 'material things'. You don't like my answer? This is you allowing the mistakes of others to make you feel angry. Anger is the killer."

So yeah, with such a reasoning, there is no problem whatsoever... And you can safely not coming in this thread again. I'm ultimately looking for flaws and ways to address them to design (or help people design) better tests that better represent the reality. If you don't see the point, fine, but refrain from posting then.

List of typical problems and shortcoming of "common" audio t

Reply #18 – 2015-12-18 16:18:24

Quote from: MMime on 2015-12-18 16:01:08

That's exactly what I don't want. Why? Because I can counter each of your counter arguments and we can repeat that ad nauseam... In the meanwhile, nothing is done and we don't have more insight, nor a larger picture. That's true that, now that I see the result, my first post has not been presented the right way. This was intended to give an example of list (still, with points I consider valid) to kick off the listing. This was not intended to actually represent the flaws of ABX testing that plague every day of my life.

But OK, let's consider these are not problems in audio testing for a moment, like you suggest. What such a listening test would be like? Isn't that strictly equivalent to saying "don't do nor trust tests, they are not worth it, just randomly choose something and shut up".

I'm ultimately looking for flaws and ways to address them to design (or help people design) better tests that better represent the reality.
You are saying: there is no point in doing any test at all, people, choose something random, that won't change anything anyway.

Huh? They aren't problems because none of them stop people who want to learn from learning.

All of those "problems" only exist when one chooses to argue with someone who doesn't want to do things right.

When you ask "What such a listening test would be like?" what do you mean?

"Isn't that strictly equivalent to saying "don't do nor trust tests, they are not worth it, just randomly choose something and shut up"." NO. Do proper tests and control for the variables you can control for and understand the limits of your collected data. THE SAME AS ANY OTHER SCIENCE. Nowhere did I suggest something as fundamentally stupid and insulting as "randomly choose something and shut up".

List of typical problems and shortcoming of "common" audio t

Reply #19 – 2015-12-18 16:27:50

Science is also about improving the methods.

Improving the methods also means identifying flaws.

With your reasoning, non-blind tests would still be performed for drug tests for instance.

But someone figured out that there was a psychological, YES, a PSYCHOLOGICAL flaw with that.

This flaw had NOTHING to do with the drug test by itself (from your perspective anyway).

Yet, DBT allow us to buy better drugs with a minimum confidence that they are working.

What would you have said at the time? "Patients would get subconscious hints from the doctor? So what? Don't trust patients anyway."...

List of typical problems and shortcoming of "common" audio t

Reply #20 – 2015-12-18 16:33:23

In essence you are asking us to "describe a test that will fail because it is flawed". I would rather describe tests that will succeed because we have eliminated potential flaws.

List of typical problems and shortcoming of "common" audio t

Reply #21 – 2015-12-18 16:35:49

Think of these "problems" as "things I should think and care about if I were to design a proper audio test".

I called these "problems" because these are things to think about, not "natural" things that would "naturally" go well without a single thought to handle it.

List of typical problems and shortcoming of "common" audio t

Reply #22 – 2015-12-18 16:40:05

Quote from: pdq on 2015-12-18 16:33:23

In essence you are asking us to "describe a test that will fail because it is flawed". I would rather describe tests that will succeed because we have eliminated potential flaws.

So does that mean that there is a audio test methodology, without trade-off, accepted by absolutely everybody as "valid"?

If not, I want to know why "the tests that will succeed" according to you is not enough for person X or in situation Y.

AND I understand that there is prior work, hence the very first line of my very first post.

List of typical problems and shortcoming of "common" audio t

Reply #23 – 2015-12-18 16:40:06

Quote from: MMime on 2015-12-18 16:27:50

Science is also about improving the methods.

Improving the methods also means identifying flaws.

But you haven't identified any flaws in audio testing.
You've identified flaws in people who you can't trust! And you've identified that test setups which have not been properly designed may lead to inaccurate results.

Quote from: MMime on 2015-12-18 16:27:50

With your reasoning, non-blind tests would still be performed for drug tests for instance.

How so? That charge does not follow what I've said. This is not the first time you've charged me (without support!) with claims I did not make, but it will be the last.

List of typical problems and shortcoming of "common" audio t

Reply #24 – 2015-12-18 16:47:47

Quote from: MMime on 2015-12-18 16:40:05

Quote from: pdq on 2015-12-18 16:33:23
In essence you are asking us to "describe a test that will fail because it is flawed". I would rather describe tests that will succeed because we have eliminated potential flaws.

So does that mean that there is a audio test methodology, without trade-off, accepted by absolutely everybody as "valid"?

If not, I want to know why "the tests that will succeed" according to you is not enough for person X or in situation Y.

AND I understand that there is prior work, hence the very first line of my very first post.

If you want to learn about bad audio testing, you should go someplace else. You won't find it here.

Notice