Sample specific discussions: sample #2
Reply #12 – 2008-12-02 05:00:07
The problem is that as long as the reference isn't ranked, I cannot simply refuse to accept results even if the low anchor is rated higher than a contender. Otherwise people might blame me for selecting only the results I like. This would be fatal for me in an AAC test for example because there were already persons who told me that I am biased since working for Nero. Have you looked at any statistical methods for gleaning information by considering subsets of your data? http://en.wikipedia.org/wiki/Resampling_(s...tics)#Jackknife I'm familiar with bootstrapping in phylogenetics, where you regenerate an evolutionary tree with various random subsets of the species you're interested in. That seems fundamentally different from considering random subsets of the samples, or of the submitters, though, since nothing non-linear or unpredictable happens after discarding some samples. I think it might be interesting to see, though, what fraction of the subsets of submitters still have all codecs tied within 95% confidence interval. BTW, what's so magic about 95%. What is the p value for Helix >= LAME 3.98.2 on the whole test? Even if you can't say Helix >= LAME, maybe you can say there's a >70% chance that Helix did better on this test, and a 30% chance that that's not the case.The only thing I did was to discard all results if a user had a very high number of results with ranked references (like 9 out of 14). In that case, I contacted the submitter and asked why this happened. Some of the people replied telling me that they simply guessed and after asking them to redo the test with ABX if possible I included the new results only, others were affected by the ABC/HR problem and wrote down the results on paper first without knowing that reloading the configuration files re-randomizes the contenders and others didn't reply at all. However, only a very low number of people were affected (I think a total of maybe 3 submitters). My results didn't get counted for sample 2. I didn't know that ABC/HR would gray-out the reference after a successful ABX in trial, not training, mode, so I just used training mode to find an artifact, but then picked the wrong slider on one sample (based on the effectively single-trial-ABX of listening to both sliders and the reference). I sent you an updated results file, but I guess you didn't use it because it was after the test deadline? iTunes: 5.0 LAME 3.98.2: 5.0 l3enc: 2.1 (higher than on the other 3 samples I did. It sucks, but wasn't as bad here, so got a higher score) FhG: 4.0 LAME 3.97: 2.7 Helix: 4.4 (note the comment, though) Hardware: Koss TD/60 headphones, and Logitech Z-5500 speakers. sound card: Intel HDA (STAC9271D codec with 105dB DAC SnR), driven by Ubuntu GNU/Linux. I think I tried all encodes on both the phones and the speakers, to see if there was something I could hear with one. My phones aren't bad, but they have no bass compared to my lovely speakers. Testname: Sample02 Tester: pcordes General Comments: re-rated sample #5 (the one I previously rated the reference for). I could ABX it, but I was being too harsh. --- 1L File: Sample02/Sample02_3.wav 1L Rating: 2.1 1L Comment: warbly --- 4L File: Sample02/Sample02_5.wav 4L Rating: 2.7 4L Comment: horn isn't smooth, sounds clickly --- 5R File: Sample02/Sample02_4.wav 5R Rating: 4.0 5R Comment: second horn sounds a little off --- 6R File: Sample02/Sample02_6.wav 6R Rating: 4.4 6R Comment: Can only ABX based on the difference in noise floor in the opening half second. The sample has less background hiss.