Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Jeff Atwood's "Great MP3 Bitrate Experiment" (Read 25024 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #25
Like someone else said, I think some testers got a bit confused at how they were supposed to be rating these things.

Though I find it amazing that anyone could perfectly rate them (except by accident or cheating). I guess I just have cloth ears!

Cheers,
David.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #26
One of these:


Tch.  Kids these days... obviously never read Billboard.
Regards,
   Don Hills
"People hear what they see." - Doris Day


Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #28
There is a fundamental difference between data that are unfavourable and data that do not meet the requirements of the test.

Many users submitted their data under the incorrect assumption that the scale of 1–5 was a rank of their preference for each individual sample, with each value being useable only once. In actuality, the scale was supposed to be used as their rating of perceived quality for each sample, with no limit to the number of occurrences.

So, I don’t think your reference is relevant.

Whether or not it’s possible to confidently identify the data that do not meet the actual specification, to discard them, and to retain sufficient numbers to draw a useful conclusion is another question entirely.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #29
You shouldn't cherry pick raw data under any circumstances in a properly unbiased, double blind test. It makes the test suspect,  regardless of the test conductor's intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then one needs to conduct a fundamentally new test, not discard raw data one "believes" to be compromised.

[In different circumstances I'd accept using one test in an attempt to find certain "gifted" test subjects ,who are then retested, however. This could be used, for instance, to find "golden-eared" listeners.]

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #30
Disregarding all listeners who rated WAV as less than 5 gives us this chart (based on his Excel file, can't be bothered to make it more "accurate")



And our original:



Note how that completely removes much of the preference for the first option, and brings all the other options roughly in line with what we would expect.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #31
Disclaimer: meandering musings

You can't cherry pick raw data under any circumstances. It makes the test invalid regardless of your intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then you need to conduct a fundamentally new test, not discard raw data you "believe" to be compromised.

I don’t disagree in principle. Hail science! I was just pointing out that, however scientifically tenuous it might be, excluding data because they were submitted in the wrong format is not exactly equivalent to excluding data because they aren’t conducive to someone’s ulterior motive(s). At the very least, it’s not equivalent ethically: one is done in an effort to improve the reliability of a conclusion, whereas the other is done merely out of cynical self-interest.

Scientific ethics aside (just for a moment! ), is such filtering of incorrectly calibrated data even likely to be possible in any real-life study with any probability of preserving its objective reliability? I lack the experience to answer either way, and I suspect that it’s better avoided anyway due to the same concerns that you’ve raised – but in this case, I don’t think it’s very likely that one could do it. That was what I meant by my closing sentence, although I should have given it more consideration.

Of course, as you implied, this question should never arise: collection of data should be designed so as to preclude any of them being ‘incorrect’ or ambiguous. In this specific case, the take-home message is that instructions must be clear and unambiguous, so that respondents can provide useful data. It’s a shame how this test is somewhat marred by its shortcomings in that area and, as I said, how this confounding factor can’t be removed post hoc.

Disregarding all listeners who rated WAV as less than 5
Since you’ve just reminded me of something I wondered about earlier: how about disregarding all respondents whose data sets included each number only once? Or am I getting desperate here?

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #32
Since you’ve just reminded me of something I wondered about earlier: how about disregarding all respondents whose data sets included each number only once? Or am I getting desperate here?

I also excluded all data sets consisting of one number for all entries.

Combining our approaches (restricted to WAV=5, only entries with duplicates) does not provide good results either: [4.11, 3.79, 5, 3.79, 3.52]

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #33
You shouldn't cherry pick raw data under any circumstances in a properly unbiased, double blind test. It makes the test suspect,  regardless of the test conductor's intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then one needs to conduct a fundamentally new test, not discard raw data one "believes" to be compromised.


Well ... opinions certainly differ on that one. As far as I know, there is no universally agreed-upon treatment of outliers.

However, if the null is random ranking, then various statistical models could cope with those who rank the other way around. You could formulate the alternative hypothesis to be H1: after possibly switching order of rankings, they are still more concordant with bitrate than what is consistent with the null. But if you start looking at data, you are mining, and that is not without issues either.


Now for designing a new test, you are of course free to look at your old data with any creativity you can imagine. You are essentially looking for any pattern that could be tested.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #34
http://news.change.org/stories/cherry-pick...ientific-method

Quote
As far as I know, there is no universally agreed-upon treatment of outliers
You count them.

As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test. You don't go back and cherry pick out (keep) only the data you feel with your "completely objective and unbiased view" is "legit".

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #35
Shall I just repeat what I’ve already said about your allegation that the exclusion of incorrectly formatted data – which was not done by the actual researcher, it must be emphasised – is equivalent to cynical cherry-picking in favour of an ulterior motive? Or are you the only one who gets to repeat yourself?

I don’t disagree in principle that one should always endeavour to solve problems at the earliest/proper point, i.e. the experimental design in this case. I was just musing hypothetically. That last word is important, since it’s me who’s twittering away to myself here, rather than the researcher having actually done this or anything like it! Looking back, I do not agree with the filtering suggested by nevermind, which began all of this, but again: That’s different from asking whether one can filter data that were not formatted correctly. Which, again, isn’t something I think can be done reliably – but it was just a hypothetical question about the possibility of putting a Band-Aid on a less than optimally designed test, not prodding something in a direction according to self-interest

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #36
I think there is a belief here that "as long as the motives are pure, and unmotivated by desired outcome", then cherry picking is "OK". I don't feel that way. There could be things which are unforeseen by all of us.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #37
Data are data. If there has been some kind of procedural error and it's not feasible to re-run the experiment, it's entirely legit to restrict your data down to the valid subset, if there is some easy way to do so. If, due to some error, only 10% of your data are actually valid, and you can identify that 10% post hoc, there is no reason not to analyze that 10%. It might redeem the entire experiment.

Ideally, yes, you re-run the experiment and try ensure that 100% of your data are valid. This is not always feasible, nor should it be absolutely required.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #38
As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test.


While I understand your motivation, this is basically just you unsupportable opinion.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #39
basically just you unsupportable opinion
Careful, we've stepped below science into its superstructure: philosophy of science. Here there be dragons: terrible things that could render all the lovely objectivity around here into little more than "unsupportable opinion"...

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #40
If there has been some kind of procedural error and it's not feasible to re-run the experiment, it's entirely legit to restrict your data down to the valid subset, if there is some easy way to do so.
...

Huh? I suspect you don't really mean this, unless I am just completely mis-reading it. The feasibility or ease of re-running a test doesn't make a difference as to the legitimacy of the original test.

To paraphrase what you have written, one could say "If it is difficult to re-run a test, then we should accept at least the subset of the data that we believe wasn't compromised, due to the known error", [as long as we still have a large enough sample left over to make the results statistically significant, I guess]. "If it is easy to re-run the test, however, then the original data is suspect, should be ignored, and we should do the re-test."

The difficulty in re-running a test, but this time without the design flaw, doesn't change whether the original test data is legit or not. It either is or it isn't, regardless of the time needed/ease/difficulty in conducting a new test without the design flaw. Right?
---

"Cherry picking" is a type of confirmation bias, more accurately called a "fallacy of suppressed evidence" and may very well be unconscious in nature, despite its sinister sounding name. I wasn't, however, trying to speak poorly of anyone here or question their motives, but I seem to be alone here in thinking that claims of "pure and unbiased" motivation, which of course all scientists think applies to them  , doesn't suddenly make cherry picking "acceptable". Everyone thinks their selection process is "sound, pure, and motivated only by the unbiased pursuit of truth".

As it says here, one's motivation may indeed be pure and honest, but the fallacy name, even if not a very good name, still applies:

"If the relevant information is not intentionally suppressed by rather inadvertently overlooked, the fallacy of suppressed evidence also is said to occur, although the fallacy’s name is misleading in this case. The fallacy is also called the Fallacy of Incomplete Evidence and Cherry-Picking the Evidence.."

I unfortunately don't have any more time on my hands to devote to this, so I'm outta here.
Happy July 4th everyone!

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #41
"Cherry picking" is a type of confirmation bias, more accurately called a "fallacy of suppressed evidence" and may very well be unconscious in nature, despite its sinister sounding name. I wasn't, however, trying to speak poorly of anyone here or question their motives, but I seem to be alone here in thinking that claims of "pure and unbiased" motivation, which of course all scientists think applies to them  , doesn't suddenly make cherry picking "acceptable". Everyone thinks their selection process is "sound, pure, and motivated only by the unbiased pursuit of truth".


I don't think anyone is saying that cherry picking doesn't exist.  I think the point is that you're remarks about cherry picking are not really relevant in this particular instance. 

basically just you unsupportable opinion
Careful, we've stepped below science into its superstructure: philosophy of science.


Which is why it is incorrect to make universal assertions about how things must be done.

Jeff Atwood's "Great MP3 Bitrate Experiment"

Reply #42
Quote
As far as I know, there is no universally agreed-upon treatment of outliers
You count them.

Before or after you have defined them?


As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test.

Well go tell that to a paleontologist 


No, seriously: have a look at http://en.wikipedia.org/wiki/Meta-analysis..._and_weaknesses .