64 kbit/s Test has started!

Topic: 64 kbit/s Test has started! (Read 17299 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

64 kbit/s Test has started!

Reply #50 – 2002-07-24 10:39:38

Since the idea of the test (and probably the only valid way to make conclusions) is to make a statement about quality over a wide range of samples, the idea of using -q0 is simply to make sure that over all samples Ogg indeed averaged 64kbps. If you want to look at near-CBR, the managed Ogg clip will do the trick.

Forcing Ogg to closely keep to 64kbps is handicapping it for the sole reasons that it's competitors can't do VBR.

If you are dragging the 'using the results for a certain kind of music' thing into debate, you are *totally* on the wrong track. This test was *not* designed to make statements about codecs on certain kids of music. Trying to draw any conclusion like that from it will give you unreliable results, so don't do it, period.

--
GCP

64 kbit/s Test has started!

Reply #51 – 2002-07-24 10:41:46

Quote

Originally posted by tw101

I think the equal quality part is just a goal, but far from reality. Otherwise if, let's say, -q5 is transparent for a clip, it should be transparent for all music, isn't it?

Your ears are also a variable factor, so this analogy is flawed.

--
GCP

64 kbit/s Test has started!

Reply #52 – 2002-07-24 10:49:25

Quote

Originally posted by ff123

I am just very uncomfortable about changing any data. In the case of someone incorrectly pulling down the slider for the original, it is obvious that a real difference was not heard, but for the case of somebody failing to ABX but pulling down the correct slider, there is a chance that he actually heard a difference at the time he pulled the slider.

There's a 50% chance of pulling down the correct slider if you hear no difference. There's a <=5% chance of not hearing a difference, ABX-ing the clip and then pulling down the wrong slider. What you propose above makes no sense whatsoever.

Quote

In any case, probably the safest route is to discard all inconsistent data. Very draconian, but eminently fair.

Yes

--
GCP

64 kbit/s Test has started!

Reply #53 – 2002-07-24 11:34:33

Quote

Originally posted by Garf
There's a 50% chance of pulling down the correct slider if you hear no difference. There's a <=5% chance of not hearing a difference, ABX-ing the clip and then pulling down the wrong slider. What you propose above makes no sense whatsoever.

But how should the majority of results without ABX-backup (because the difference was too obvious for the test person to bother with the time-consuming and exhausting ABX-routine) be handled then? They results could have been guessed as well (with 50% conditional probability).

Of course, giving less than e.g. 4.0 for a sample wich one couldn't abx is inadequate. But 4.8 is ok for me, if the test person is sure that he heard a difference but is too exhausted to abx it.

Quote

Originally posted by ff123
In any case, probably the safest route is to discard all inconsistent data. Very draconian, but eminently fair.

Agreed.

64 kbit/s Test has started!

Reply #54 – 2002-07-24 12:06:03

Quote

Originally posted by Continuum
But how should the majority of results without ABX-backup (because the difference was too obvious for the test person to bother with the time-consuming and exhausting ABX-routine) be handled then? They results could have been guessed as well (with 50% conditional probability).

There is no way to deal with this directly - you either hope they 'fall through' on some other part of that testsample or that you get enough submissions that their noise disappears in the statistical analysis.

--
GCP

64 kbit/s Test has started!

Reply #55 – 2002-07-24 12:06:32

Quote

Originally posted by tw101
If WMA has VBR and it's better than its CBR, then bring it on (WMA 9 might do this).

It will.

Some of the features I found in WMA9 beta:
-CD audio (16bit/44100Hz/stereo) goes up to 320kbps
-Various VBR modes, from 40to80kbps, up to 160to400kbps
-There are now three WMA codecs:
-Standard: Seems like the old one, but with more bitrate options.
-Voice: Low bitrates, mono, low sampling rates. (It doesn't seem to be ACELP, BTW)
-Professional: For high bit depths, high sampling rates, and multichannel. Goes up to 768kbps for 96kHz/5.1/24bit

Files are backwards compatible with WMP7. Even VBR ones.

And, according to a quick test Dibrom did, the quality at 64kbps was even worse than WMA7. (Using the sample thear1 from ff123's latest test)

It wasn't tested with VBR though, since I found no way to encode from WAV to VBR - seems this beta can only encode to VBR while ripping from a CD.

Regards;

Roberto.

64 kbit/s Test has started!

Reply #56 – 2002-07-24 12:28:31

Quote

Of course, giving less than e.g. 4.0 for a sample wich one couldn't abx is inadequate. But 4.8 is ok for me, if the test person is sure that he heard a difference but is too exhausted to abx it.

The whole point is that if Mithrandir couldn't ABX it, then he has no evidence that he *can* hear a difference, and hence should give it a 5. Anything else is just plain wrong.

I'll be taking part in these tests in the next couple of days, and happily anticipate giving 5s out when I cannot hear a problem.

64 kbit/s Test has started!

Reply #57 – 2002-07-24 13:17:56

To JohnV and Garf (and whoever bugged by me):

The test has started, and I don't want to bug you anymore. I don't completely understand all your points yet, but I'll shut up for now. Peace.

64 kbit/s Test has started!

Reply #58 – 2002-07-24 15:41:24

My POV:
Grey out (disable) the sliders of the corresponding sample until ABX-test has been passed (minimum number of iterations could be set in config file).

I.e. for this test a minimum of 5 ABX-tests could be wanted. Listening is needed anyway, so why not start with the ABX. On easy samples this should be easy to pass, on difficult ones ABX is needed anyway. Additional advantage: ears are less fatigued when doing ABX in the beginning.

This way you won't be able to give a score lower than 5 until ABX test has been passed, so no incorrect results can be given..

64 kbit/s Test has started!

Reply #59 – 2002-07-24 15:44:18

Quote

Originally posted by Jon Ingram

The whole point is that if Mithrandir couldn't ABX it, then he has no evidence that he *can* hear a difference, and hence should give it a 5. Anything else is just plain wrong.

I continue to disagree. I stopped the test at 16 samples because I wanted to move on but that doesn't mean the samples were unquestionably the same for me. You can say the files sounded essentially the same if I scored 9/16 but I am not comfortable saying that the lossy file - in that case - was 100% totally perfectly transparent. It wasn't.

My 16 sample ABX test demonstrated that the encoder did fine job maintaining a high degree of transparency. But perfect? No. Hence, I graded it over 4.0, but not a 5.0. I also performed the ABX test late at night, shortly before I went to bed and that probably wasn't a good time for such critical testing.

I feel pressure that I must now redo the ABX test for that sample and perform 50+ iterations just so the granularity is susbtantially reduced.

This same thing happened with Klemm's 1.02 mppenc encoder. I wrote here that something didn't sound right with it but I simply could not produce the ABX evidence supporting my claims of differences. Therefore, what I said was "brushed under the carpet". Look, I know the placebo effect is something that should be minimized but I notice a religious fervor for the ABX test among these forums. There are many times where it is a great tool, but I do not kneel at the ABX altar all the time without reservation.

I don't know what the answer is. I hope we don't get too worked up over a minor detail when I'm sure there will be listener results where everything is rated a 1.0. Is that better than somebody who failed a 16 sample ABX test but who rated a sample a 4.6 instead of a 5.0? At least my results demonstrate a ranking and differentiation among encoders.

64 kbit/s Test has started!

Reply #60 – 2002-07-24 16:28:49

Quote

Originally posted by mithrandir
My 16 sample ABX test demonstrated that the encoder did fine job maintaining a high degree of transparency. But perfect? No. Hence, I graded it over 4.0, but not a 5.0.

What distinguishes you from a person who would have rated it 4.6 just by guessing? There's 50% chance to pull down the correct slider. What makes you the person who after failing ABX still stands outside of the group who would have guessed it?

Quote

This same thing happened with Klemm's 1.02 mppenc encoder. I wrote here that something didn't sound right with it but I simply could not produce the ABX evidence supporting my claims of differences. Therefore, what I said was "brushed under the carpet".

Where was this? Maybe nobody else couldn't hear the problem. If nobody else can hear it, then it's hard to fix...

64 kbit/s Test has started!

Reply #61 – 2002-07-24 16:51:51

If you can hear a difference, you should be able to ABX it. It's possible that ABX metodology may have stressed you so that your sensitivity had decreased too much. If so, my advice is that you take more time to do the ABX test, and also get more accustomed with the methodology (try easier samples) so that you get more familiar with it.

On the other way, I think it would be a good idea to modify the typical ABX test in favour of something as an ABXY test. In this test X and Y would be A and B hidden, and you could do ABX tests, XY (=AB) tests, AXY (=ABC) tests, etc, so that the person could use the method that is more comfortable for him.

As to the ABC/HR program, I think that it would be more useful to avoid the picking of the modified (versus refernce) file in the main ABC/HR window, once you've succesfully ABX'ed it. I mean, for every file, you'd have to ABX it versus the original reference file, and once you would have done it succesfully, you shouldn't need to indentify it once more at the main ABC/HR window, just rating it would be enough. On the other side, if you couldn't ABX the file, the ony possible rating would be 5.

Another good addition would be that for every ABX test, you could choose between "hidden" trial results, in which case you could go forward and backwards to all the trials, and take as much time (and attempts on the same test) as you want to pass the ABX test, or just the way it is now, so that you can go only forward on the trials, but knowing how good you are actually doing.

I think that with this possibilities the test would suit better to everyone's preferences. I know this is not immediate or easy to implement, but this are just my ideas.

I'm actually working on some of these in my own ABX comparators, when I have some time.

64 kbit/s Test has started!

Reply #62 – 2002-07-24 16:51:55

Quote

My 16 sample ABX test demonstrated that the encoder did fine job maintaining a high degree of transparency. But perfect? No. Hence, I graded it over 4.0, but not a 5.0. I also performed the ABX test late at night, shortly before I went to bed and that probably wasn't a good time for such critical testing.

Your 16 sample ABX demonstrated that you *could not* hear a difference, and hence should have rated the sample as a 5. You've already provided a good explanation as to why you were unable to hear a difference (late night, fatigue), but this does not remove the fact that you were unable to distinguish the sample from the original on a blind test.

It is not a matter of granularity, but a matter of providing evidence. Your ABX test provided *no* evidence that you can reject the null hypothesis that the sample is indistinguishable from the original.

64 kbit/s Test has started!

Reply #63 – 2002-07-24 17:05:59

Lots of good suggestions in this thread. I have added lots of thingsto the to-do list on the abchr web page. Most of them I'm going to make optional.

ff123

64 kbit/s Test has started!

Reply #64 – 2002-07-24 17:16:51

Quote

Originally posted by mithrandir

I also performed the ABX test late at night, shortly before I went to bed and that probably wasn't a good time for such critical testing.

Besides the excellent point Jon Ingram makes, I'd like to point out that if you say you did the test at a time you couldn't do critical testing, there's little point in making assumptions about what the results would be if you were capable of critical testing and grading based on, well, sheer guessing.

Quote

Look, I know the placebo effect is something that should be minimized but I notice a religious fervor for the ABX test among these forums.

Curiously, it's exactly to prevent people who 'are sure they hear something' like you from producing false positives.

--
GCP

64 kbit/s Test has started!

Reply #65 – 2002-07-24 17:19:20

Quote

Originally posted by mithrandir

This same thing happened with Klemm's 1.02 mppenc encoder. I wrote here that something didn't sound right with it but I simply could not produce the ABX evidence supporting my claims of differences. Therefore, what I said was "brushed under the carpet".

Of course it was brushed under the carpet. There was/is no evidence whatsoever anything was/is wrong.

Do you understand that you aren't going to be taken serious on 'something didn't sound right'?

Edit:

To understand why I *am* so anal about this, consider the following. In the 64kbps test, I've had (out of 7 samples tried) to correct my score from 4.5 to 5.0 twice after ABXing. I was sure I was hearing a difference, but even if I concentrated well, I didn't pass the ABX test. I've done a lot of these tests, gotten a lot of experience, and I _still_ can't trust myself when doing them. So why should I trust *you* ?

It's possible that if I concentrate really hard and try when I'm less fatigued, that I do manage to distinguish those samples from the originals. But *right now*, I *don't*. Maybe I won't even when I'm rested. So 5.0 is the only correct score to give them now.

--
GCP

64 kbit/s Test has started!

Reply #66 – 2002-07-24 19:11:01

>What distinguishes you from a person who would have rated it 4.6 just by guessing? There's 50% chance to pull down the correct slider. What makes you the person who after failing ABX still stands outside of the group who would have guessed it?

What are you going to do with results from people who did not perform the ABX tests at all? You have no proof that they guessed or not. At least with my results, you could "correct" my 4.6 to 5.0 if that satisfies people's objectivist sensibilities. But if someone just reports "4.0" and has no ABX results, you cannot prove that they would have passed or failed the ABX test.

I think we've found that this whole test is flawed to a certain degree. That's not a put down, just that if I am going to be criticized for not choosing 5.0 when I failed a particular occurance of an ABX test, don't accept others' results with open arms if they don't offer any supporting ABX results whatsoever.

Frankly, I think all results should be accepted as is and then we can filter and/or adjust responses afterwards, if it proves beneficial.

>Where was this? Maybe nobody else couldn't hear the problem. If nobody else can hear it, then it's hard to fix...

Sure, it is hard to fix when the problem is unspecified but my comment was a "call to arms", inviting people to agree or disagree with me by listening to the output themselves. I've used mppenc since 0.90 but when 1.02 was released I noticed that the output simply sounded different. The differences stemmed not necessarily from particular artifacts but by the way the output made me feel from an emotional standpoint. I know this kind of talk goes contrary to objectivist thinking but I'm suggesting that perhaps there are differences that cannot be easily or quickly verified by an ABX test that uses 3-10 second samples. Maybe it takes an entire song, movement or album to render an perceptual opinion. Not that someone's long-term perception is equivalent to objective ABX evidence, but the lack of such ABX evidence exist does not mean two entities are equal.

ABX tests are popular because we are limited in the amount of time we can dedicate to serious comparison listening. If we had an endless resource of well-trained ears and plenty of time to use them, we could test a lot more samples than we can now. Because time is finite, a developer will say "the ABX test failed so I'm not worrying about it...there are other things to fix." Ok, fine, that's valid but I think some get too comfortable by failed ABX tests. I think we must admit that everything is a compromise. You oil the squeaky wheel first, as you should, but you need to be careful with labelling a lossy entity "transparent" before we've gotten around to testing it completely and thoroughly.

>It is not a matter of granularity, but a matter of providing evidence. Your ABX test provided *no* evidence that you can reject the null hypothesis that the sample is indistinguishable from the original.

What if I ran the 16 sample test tonight and scored a 13/16 and then repeated the same test the following night and scored 9/16? Are the differences real or not?

I guess you have to ask what exactly is ff123's test trying to find. I could re-grade all of the samples I tested and score them differently. Much differently? No. But 3.5's might be 3.2's and 1.2's might be 1.5's. The numbers are going to have some play in them. Everyone's saying "your 4.6 should have been a 5.0", but that's probably within the expected standard deviation.

>It's possible that if I concentrate really hard and try when I'm less fatigued, that I do manage to distinguish those samples from the originals. But *right now*, I *don't*. Maybe I won't even when I'm rested. So 5.0 is the only correct score to give them now.

Then perhaps we should submit results from the same sound clips multiple times. Maybe I hear differently in the morning than at night. I don't know. But if people are clamoring for making failed ABX tests default to 5.0, then a person should be able to rerun those tests over and over (and submit the results over and over) until they are satisfied that the test results are stable. I am not convinced my 9/16 score is a stable result because I don't believe the sample is transparent. I failed in one moment in time. Is that sample going to be transparent for me in every future case? I don't think you can safely say that. You need more samples, but like I said before, time is finite.

64 kbit/s Test has started!

Reply #67 – 2002-07-24 19:26:12

Quote

Originally posted by mithrandir
I think we've found that this whole test is flawed to a certain degree. That's not a put down, just that if I am going to be criticized for not choosing 5.0 when I failed a particular occurance of an ABX test, don't accept others' results with open arms if they don't offer any supporting ABX results whatsoever.

I purposely kept ABX results and ABC/HR results separate. The ABX was meant (in this test) purely as a training aid / confidence builder. I didn't want to force ABX because it doesn't make sense for the majority of listeners on most encodes at this bitrate. It would just add extra test burden at too great a cost, IMO.

Quote

Frankly, I think all results should be accepted as is and then we can filter and/or adjust responses afterwards, if it proves beneficial.

This is probably sensible for this test, where it is unlikely that people guessed too far off, if they guessed wrong at all. But I can also discard inconsistent results provided enough people respond (I think this will be the case). One of the benefits of ABC/HR is that it reduces noise merely by threatening to show the listener he guessed wrong.

ff123

64 kbit/s Test has started!

Reply #68 – 2002-07-24 19:43:23

Quote

Originally posted by mithrandir

What are you going to do with results from people who did not perform the ABX tests at all? You have no proof that they guessed or not. At least with my results, you could "correct" my 4.6 to 5.0 if that satisfies people's objectivist sensibilities. But if someone just reports "4.0" and has no ABX results, you cannot prove that they would have passed or failed the ABX test.

The issue isn't whether someone scores are 'right or 'wrong' (a silly concept), but to determine which codec does best! If we are able to correct your scores, because you were nice enough to include ABX results, and that allows us to reach more solid conclusions, then that's a good thing. I say again: this is not about getting your scores right or wrong, it's about getting the maximum out of the test.

Quote

I think we've found that this whole test is flawed to a certain degree. That's not a put down, just that if I am going to be criticized for not choosing 5.0 when I failed a particular occurance of an ABX test, don't accept others' results with open arms if they don't offer any supporting ABX results whatsoever.

Why do you assume this makes the test flawed?

Quote

Everyone's saying "your 4.6 should have been a 5.0", but that's probably within the expected standard deviation.

You sortof got the point. Now note that if we correct the 4.6 score to 5.0, we may be lowering that same standard deviation. This allows us to make more (or more solid conclusions). This is why it's good.

Quote

Then perhaps we should submit results from the same sound clips multiple times. Maybe I hear differently in the morning than at night. I don't know.

You could do this, and it would be interesting if the results differed.

Quote

But if people are clamoring for making failed ABX tests default to 5.0, then a person should be able to rerun those tests over and over (and submit the results over and over) until they are satisfied that the test results are stable. I am not convinced my 9/16 score is a stable result because I don't believe the sample is transparent. I failed in one moment in time. Is that sample going to be transparent for me in every future case? I don't think you can safely say that. You need more samples, but like I said before, time is finite.

My issue with this is that the evidence you got (the sample *is* transparent) is at odds with the score you gave (it isn't). Yes, you may be right that with more carefull listening the sample falls through, but there is no evidence for that. The only evidence we do have is pointing to exactly the opposite. So the score you gave makes no sense to me.

--
GCP

64 kbit/s Test has started!

Reply #69 – 2002-07-24 20:50:20

Quote

There's a 50% chance of pulling down the correct slider if you hear no difference. There's a <=5% chance of not hearing a difference, ABX-ing the clip and then pulling down the wrong slider. What you propose above makes no sense whatsoever.

average joe's view: there is no such thingy as 50% chance here, thats just becouse u dont know the amount of tests some1 did - i mean u cant tell how many times he actualy pressed that play button in abc tests (when u know that in abx), imho abc & abx should be merged together (somehow) so they dont cancel each other out.

64 kbit/s Test has started!

Reply #70 – 2002-07-24 22:40:08

Quote

Originally posted by smok3
average joe's view: there is no such thingy as 50% chance here, thats just becouse u dont know the amount of tests some1 did - i mean u cant tell how many times he actualy pressed that play button in abc tests (when u know that in abx), imho abc & abx should be merged together (somehow) so they dont cancel each other out.

If they do not hear the difference, the change of pulling down the correct slider is 50%. This is a mathematical result that's independent from anything they did before.

--
GCP

64 kbit/s Test has started!

Reply #71 – 2002-07-24 22:40:54

The test is now on the frontpage of slashdot.org

--
GCP

64 kbit/s Test has started!

Reply #72 – 2002-07-25 00:26:47

Quote

Originally posted by Garf
The test is now on the frontpage of slashdot.org

--
GCP

Yeah, I saw that. There might be a lot of people sending in submissions now...

64 kbit/s Test has started!

Reply #73 – 2002-07-25 01:06:00

Quote

Originally posted by Garf
You could do this, and it would be interesting if the results differed.

I just spent 20 minutes redo-ing the test for WMA Bach. I had to study the clip until I found a segment that had the most memorable artifact. The artifact I found was a minor collapsing of the soundstage. With my ears honed in on that artifact, I ran the ABX test 30 times and had 20 correct trials. That's not a high success rate, but it's statistically significant.

When I received the 9/16 score, I ABX'd a different part of the clip. With a lossy file with very good performance like this one (to my ears) you have to pick the part that contains its worst relative performance. I suppose I didn't pick that most problematic part properly the first time around.

64 kbit/s Test has started!

Reply #74 – 2002-07-25 03:18:32

@ff123:

How long will you be accepting the results?

I made my first test today. And I already see I won't be able to do more than one test a day. most probably even not every day. (as I think I better abx everything, to be sure).

Also, would you like people to send every result one by one, or better all of them together when all are done?

-----------------------

and a suggestion for the abchr:
could be a good idea to make an option to save the current session in a humanly unreadable binary file, so that one could continue the interrupted test later.

regards
konstantin

Notice