Objective difference measurements to predict listening test results?

Topic: Objective difference measurements to predict listening test results? (Read 25523 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Objective difference measurements to predict listening test results?

2015-11-02 14:48:15

I started new live research on the subject - http://soundexpert.org/news/-/blogs/object...g-test-results-

Abstract. This research is aimed to explore relationship between waveform degradation of an audio material and its auditory perception. Such relationship can be investigated analyzing results of already finished listening tests. We will examine several listening test cases comparing levels of waveform degradation with subjective quality scores.

First part of the research is based on Kamedo2 listening test. Some preliminary results were mentioned in this post – https://www.hydrogenaud.io/forums/index.php...st&p=904976. Now this case has been examined more thoroughly. Results are promising.

Your critique and opinions are welcome. Comments on SE site are working as well.

Objective difference measurements to predict listening test results?

Reply #1 – 2015-11-03 09:03:29

You can use this page to visualize the results.
Dot plot + error bar graph maker for ABC/HR 5

I will read your post later.

Objective difference measurements to predict listening test results?

Reply #2 – 2015-11-03 18:31:44

I checked the modeling procedure.

If you do the quadratic approximation, from only four good (ffv7 ffv9b lame nero) points after excluding one "bad" (opus) point, almost certainly you will get an approximation with very few errors.
But the usefulness of the approximation is very questionable, you have only 4-2-1 = 1 degree of freedom in the approximation and the seemingly good approximation can be a pure chance.
And according to the quadratic curve, the score of 1.0=VeryAnnoying or 2.0=Annoying should sound very good, with very low Df value; it can't be.

Objective difference measurements to predict listening test results?

Reply #3 – 2015-11-03 21:20:04

Quote from: Kamedo2 on 2015-11-03 18:31:44

I checked the modeling procedure.

If you do the quadratic approximation, from only four good (ffv7 ffv9b lame nero) points after excluding one "bad" (opus) point, almost certainly you will get an approximation with very few errors.
But the usefulness of the approximation is very questionable, you have only 4-2-1 = 1 degree of freedom in the approximation and the seemingly good approximation can be a pure chance.
And according to the quadratic curve, the score of 1.0=VeryAnnoying or 2.0=Annoying should sound very good, with very low Df value; it can't be.

You can safely use linear approximation; it gives just slightly higher errors but allows to avoid uncertainty related to quadratic curve fitting.

Code: [Select]

                        opus     nero     lame     ffv9b    ffv7     fdk    fhg

True Quality Score      4.31     3.93     3.65     3.42     3.13

Modeled QS (quad)                3.93     3.66     3.40     3.19     3.86    3.99
Error                           -0.05%   +0.27%   -0.53%   +1.89%

Modeled QS (linear)              3.98     3.57     3.33     3.25     3.86    4.08
Error                           +1.15%   -2.25%   -2.55%   +3.97%

For qualitative analysis linear is enough in our case. Second-order model just increases accuracy but should be used with care outside initial points. Usually, the higher the order of fitting curve, the more unpredictable it is outside initial points.

Thanks for the link.

Objective difference measurements to predict listening test results?

Reply #4 – 2015-11-03 21:51:15

Quote from: Serge Smirnoff on 2015-11-02 14:48:15

Your critique and opinions are welcome. Comments on SE site are working as well.

Interesting investigation. Some comments / thoughts:

Seems much cheaper computation-wise than e.g. the PEAQ method. But probably the same disadvantage, see below.
How does it compare against PEAQ scores for the same codecs and waveforms? You can use McGill's free PQEvalAudio implementation.
Edit: ah, one more: what about the lower audibility of content above ~16 kHz? Does your model account for this, e.g. by some kind of low-pass filtering?
If I understand correctly, your measure is based on waveform differences (delay-aligned decoded minus original waveform). This only works for high-bit-rate coding.

The last bullet point, which also applies to PEAQ (which does not provide useful data on parametrically coded low-bit-rate audio), might explain why your approach does not deliver comparable values for the Opus and iTunes TVBR coded stimuli.
But I don't know for sure because I don't know how those two encoders operate or whether the other encoders (esp. nero) use parametric coding or not. I only know that Opus can use a parametric tool called "spectral folding" at high frequencies.
So ~100 kbit/s stereo might be the lowest bit-rate you can use this approach on. At lower bit-rates all modern audio codecs use parametric coding (because you just don't obtain good quality without that).

Chris

Objective difference measurements to predict listening test results?

Reply #5 – 2015-11-03 23:29:09

Quote from: C.R.Helmrich on 2015-11-03 21:51:15

Interesting investigation. Some comments / thoughts:
Seems much cheaper computation-wise than e.g. the PEAQ method. But probably the same disadvantage, see below.
How does it compare against PEAQ scores for the same codecs and waveforms? You can use McGill's free PQEvalAudio implementation.
Edit: ah, one more: what about the lower audibility of content above ~16 kHz? Does your model account for this, e.g. by some kind of low-pass filtering?
If I understand correctly, your measure is based on waveform differences (delay-aligned decoded minus original waveform). This only works for high-bit-rate coding.

Well, not quite so. During computation of Df values not only delays are removed but also phase shifts and time stretching/shrinking, so the procedure is time consuming (one histogram takes 12 hours on my Notebook). Resulting Df values measure only differences in shape of signals regardless of their amplitudes, sampling frequencies and phase deviations. It does not matter what kind of processing was used to produce output signal, was it digital or analog; it is pure "black box" analysis, we have only input and output signals. Degradation of output signals correlate well with subjective scores only if type/nature of their degradation is similar to each other. Cluster analysis helps to discover such groups of output signals with similar distortions. The reason of possible dissimilarity is beyond the scope of this approach; it can be low-pass filtering or parametric encoding or whatever reason. The method just determines whether these particular Df values can be used for predicting subjective scores or not. In other words the method is a companion to a listening test, not a substitute. If it is reliably proofed the number of required listening tests can be reduced substantially but not eliminated completely. The method/model in order to operate requires some number of listening tests. For analog signals it works even better because analog degradation is much more similar among various devices than digital processing and especially psy-processing.

To be honest, I don't want to compare my results with PEAQ at this stage just because this will not help to improve my method. Comparison with real listening test results is more interesting and productive. May be later.

Objective difference measurements to predict listening test results?

Reply #6 – 2015-11-04 08:35:56

What's the motivation behind doing a quadratic/high order polynomial fit? Like Kamedo2 pointed out, e.g. a quadratic fit doesn't make sense. It just fits really nicely (quite unsurprisingly) to some of the data points.
Like last time, the most interesting data points are left out because they are outliers. The method which is supposed to motivate their removal is just a roundabout way of saying "these data points are removed, because they look so different to all my other points", which is the characteristic of an outlier.
I get the impression that this study is meant to show that the method can give some sort of result for selected data, than to fit to real data: "If I take just the data points which fit the model, my model works."

What's the improvement over the last Df study you did? Again, I think you should concentrate on finding a way to incorporate Opus, rather than fine tune the method for results which are in good agreement anyway (probably because they use more similar technology, as opposed to Opus).

Scientifically, the most interesting data are those which are vastly different from the average, but can still not be attributed to errors in experimental methodology. The robustness/generality of a hypothesis or method is tested best against these outliers, not the average data. So if we assume ABX testing is useful and works, and Opus is a "real" data point, the focus should be to find a way to incorporate this result. It's not unlikely that the other data points will "fall into place" in the end, anyway.

Objective difference measurements to predict listening test results?

Reply #7 – 2015-11-04 09:53:23

Quote from: Kohlrabi on 2015-11-04 08:35:56

What's the motivation behind doing a quadratic/high order polynomial fit? Like Kamedo2 pointed out, e.g. a quadratic fit doesn't make sense. It just fits really nicely (quite unsurprisingly) to some of the data points.
Like last time, the most interesting data points are left out because they are outliers. The method which is supposed to motivate their removal is just a roundabout way of saying "these data points are removed, because they look so different to all my other points", which is the characteristic of an outlier.
I get the impression that this study is meant to show that the method can give some sort of result for selected data, than to fit to real data: "If I take just the data points which fit the model, my model works."
What's the improvement over the last Df study you did? Again, I think you should concentrate on finding a way to incorporate Opus, rather than fine tune the method for results which are in good agreement anyway (probably because they use more similar technology, as opposed to Opus).

Scientifically, the most interesting data are those which are vastly different from the average, but can still not be attributed to errors in experimental methodology. The robustness/generality of a hypothesis or method is tested best against these outliers, not the average data. So if we assume ABX testing is useful and works, and Opus is a "real" data point, the focus should be to find a way to incorporate this result. It's not unlikely that the other data points will "fall into place" in the end, anyway.

(1) Quadratic polinomial just improves errors slightly, linear polinomial can be used safely instead.
(2+) This new method, at this stage of development at least, doesn't predict subjective scores for "unusual" Df sequences and it has the mechanism to discover such "unusualness" and exclude these sequences before modeling process. In order to incorporate such unusual/outlying points into the model some listening tests should be conducted first. Then subjective scores for other Df sequences from this new group can be predicted with known error. If in Kamedo2 listening test there were some other codecs with Df sequences similar to Opus then another model could be build for this group. Thus, the method just helps to reduce number of listening tests - 3-4 tests for a group, not to substitute them which is hardly possible ever.

Objective difference measurements to predict listening test results?

Reply #8 – 2015-11-04 14:32:17

The linear approximation

Objective difference measurements to predict listening test results?

Reply #9 – 2015-11-04 15:19:26

from the post on SoundExpert.org : "As the decoded files have different sampling rates (32, 44.1, 48 kHz) they all are up-sampled to 96 kHz before Df calculations."

This up-sampling may cause additional errors due to the clipping of inter-sample overs. We have observed a high incidence of inter-sample overs at the output of MP3 decoders. These overs will clip the DSP in the up-sampler and produce a number of intermod products. These could result in artificially high error levels.

This inter-sample clipping problem can be avoided by reducing the amplitude by at least 3.1 dB prior to up-sampling. We generally use 3.5 dB in our processing as this is convenient. Obviously you would also need to reduce the reference track by the same amount before running the DF calculations.

Objective difference measurements to predict listening test results?

Reply #10 – 2015-11-04 16:00:00

Quote from: Kamedo2 on 2015-11-04 14:32:17

The linear approximation

I get a bit different values for linear approximation (post#4), can you check once again?

Quote from: John_Siau on 2015-11-04 15:19:26

This up-sampling may cause additional errors due to the clipping of inter-sample overs.

Thanks for your note. To avoid this I decoded reference tracks to 32bit and all other operations including up-sampling were performed in floating point arithmetic.

Objective difference measurements to predict listening test results?

Reply #11 – 2015-11-04 16:03:53

The problem with Df, as mentioned in the other thread, is still the same. Its a measure of waveform (dis)similarity that may or may not relate to audibility.

One way I could see it work is interpolating between data points of the same codec but different bitrates given that the codec does not somehow use different compression methods for those bitrates. So you'd only measure the extent of one form of "degradation" of the waveform.

Objective difference measurements to predict listening test results?

Reply #12 – 2015-11-04 16:58:30

Quote from: xnor on 2015-11-04 16:03:53

The problem with Df, as mentioned in the other thread, is still the same. Its a measure of waveform (dis)similarity that may or may not relate to audibility.

And one of the main goals of my research is to address this problem. Preliminary solution is (1) to group similar "degradations" by means of cluster analysis, (2) to build an approximation model and (3) to find error of the model. Then you can predict subjective scores with known error of any "degradations" falling into that group.

Quote from: xnor on 2015-11-04 16:03:53

One way I could see it work is interpolating between data points of the same codec but different bitrates given that the codec does not somehow use different compression methods for those bitrates. So you'd only measure the extent of one form of "degradation" of the waveform.

If cluster analysis shows that different quality setting (and corresponding bitrate) falls into the same group then Df can be used for prediction.

Objective difference measurements to predict listening test results?

Reply #13 – 2015-11-04 17:51:21

Quote from: xnor on 2015-11-04 16:03:53

The problem with Df, as mentioned in the other thread, is still the same. Its a measure of waveform (dis)similarity that may or may not relate to audibility.

Totally agreed.

If I sent $5 to everybody who has the advised DF's proponents of this obviously incomplete thinking, it would probably be their only temporal reward, because the very concept that DF as stated is inherently and grievously incomplete and doomed to failure until a reasonably complete perceptual model is added seems to completely flown over the OP's head.

In terms of difficulty, the critical success factor for the basic task is the perceptual model. The numerical refinements are far less important or problematical.

It would appear that we are dealing with one note carillon. Numerical, numerical, numerical. That doesn't change that the most important problem is one of data modelling and collection.

Most of the so-called refinements to DF attempt to address numerical nits, and overlook the typically far larger inherent failings due to the to lack of inclusion of an adequate perceptual model.

Objective difference measurements to predict listening test results?

Reply #14 – 2015-11-04 18:44:51

Quote from: Arnold B. Krueger on 2015-11-04 17:51:21

... the critical success factor for the basic task is the perceptual model.

I agree. Without a perceptual model, the numbers have little value other than identifying sources that are identical.

The perceptual model must determine the audibility of the difference signal while it is masked by the main signal.

Objective difference measurements to predict listening test results?

Reply #15 – 2015-11-04 21:44:38

Perceptual model is included in the method in the very native form - a set of listening tests. Without them the method doesn't work.

Objective difference measurements to predict listening test results?

Reply #16 – 2015-11-04 21:55:57

That begs the question as to whether a method that can only reliably tell you what is already known is of any use at all. Of course this assumes you can make it work with the new generation of encoders; because you haven't even gotten that far.

Perpetual Ketchup Machine: Dead on Arrival.

Objective difference measurements to predict listening test results?

Reply #17 – 2015-11-05 06:32:52

Quote from: Serge Smirnoff on 2015-11-04 21:44:38

Perceptual model is included in the method in the very native form - a set of listening tests. Without them the method doesn't work.

The inadequacies of listening tests for evaluating audio gear have been well known for about 90 years.

Time to hit the history books!

Objective difference measurements to predict listening test results?

Reply #18 – 2015-11-05 10:03:24

Quote from: Serge Smirnoff on 2015-11-04 21:44:38

Perceptual model is included in the method in the very native form - a set of listening tests. Without them the method doesn't work.

Your model does not, and cannot, include the test data itself, because the aim of your model is to predict test results. You obviously only test your model against the data, but that's it. If your model only predicted results after they've been measured, it'd serves no purpose at all, like greynol already pointed out.

Objective difference measurements to predict listening test results?

Reply #19 – 2015-11-08 13:26:56

Quote from: Arnold B. Krueger on 2015-11-05 06:32:52

The inadequacies of listening tests for evaluating audio gear have been well known for about 90 years.

Inadequacies to what?

Quote from: Kohlrabi on 2015-11-05 10:03:24

Quote from: Serge Smirnoff on 2015-11-04 21:44:38
Perceptual model is included in the method in the very native form - a set of listening tests. Without them the method doesn't work.
Your model does not, and cannot, include the test data itself, because the aim of your model is to predict test results. You obviously only test your model against the data, but that's it. If your model only predicted results after they've been measured, it'd serves no purpose at all, like greynol already pointed out.

The aim of the model is to predict listening test results on the basis of another listening test results. The model is built on the latter. Resulting Df-SQ curve, which approximates relationship between objective measurements and subjective scores is in fact a psychometric function, which reflects characteristics of auditory perception (of Kamedo2 in our case). In this sense it can be considered as a perception model. This model can be used even after the listening test have been finished for predicting subjective scores of another codecs (assuming they belong to the same Df cluster). As finding the codecs with similar Df sequences is vital for accuracy of the model the article was expanded with some details of the cluster analysis.

In order to test predictive potential of the model there are two choices: (1) to conduct additional listening test with the same conditions and new codecs or (2) to compute/predict already existed subjective scores one-by-one using other subjective scores as a basis. For example if we have 4 scores, we can compute each of the score using other 3 as a basis and compare real and computed (“missing”) ones. Obviously the case (2) is much much easier and sometimes the only possible solution. Results of such experiment with “missing” values are in Table 7 of the article.

Only one listening test case is examined so far (thanks to Kamedo2). This is not enough for sure. If I have enough resources (the idea of Arnold B. Krueger to send $5 was well timed) I will examine another 5-7 listening test cases to find whether the model reliably works. Suggestions on already completed listening tests would be very handy. The only strict condition for such tests is that the codecs (of old versions probably) must be available somewhere. Availability of native sound material is a plus. The more codecs have been tested, the better. And of course, the test must be properly conducted.

Objective difference measurements to predict listening test results?

Reply #20 – 2015-11-08 19:54:57

Quote from: Serge Smirnoff on 2015-11-08 13:26:56

In order to test predictive potential of the model there are two choices: (1) to conduct additional listening test with the same conditions and new codecs or (2) to compute/predict already existed subjective scores one-by-one using other subjective scores as a basis. For example if we have 4 scores, we can compute each of the score using other 3 as a basis and compare real and computed (“missing”) ones. Obviously the case (2) is much much easier and sometimes the only possible solution. Results of such experiment with “missing” values are in Table 7 of the article.

For (2), I have to say I am not satisfied with the current low degree of freedom. If you draw a line from only 4 points, only 2 degree of freedom will be left. Removing further one point will results in 1 degree of freedom. It means the prediction will be very unreliable.

Objective difference measurements to predict listening test results?

Reply #21 – 2015-11-08 21:37:25

Quote from: Kamedo2 on 2015-11-08 19:54:57

For (2), I have to say I am not satisfied with the current low degree of freedom. If you draw a line from only 4 points, only 2 degree of freedom will be left. Removing further one point will results in 1 degree of freedom. It means the prediction will be very unreliable.

In our particular case we already have 4 predictions and can roughly assess their reliability (Table 7): max error is 5.86%, RMS Error is 0.10. Having only 3 points (and one of them - lame - is not quite suitable according to cluster analysis) this is not a bad result. Thus, we have an instrument for assessment of reliability of predicted scores and can research dependency of this reliability upon, for example, number of points or average distance between Df sequences. For that purpose other listening tests should be examined (with more codecs tested).

Objective difference measurements to predict listening test results?

Reply #22 – 2015-11-09 04:02:48

4 predictions? More like 1. You're effectively holding AAC, hitting MP3 sloppy and missing OPUS.

Objective difference measurements to predict listening test results?

Reply #23 – 2015-11-09 09:32:03

Quote from: Soap on 2015-11-09 04:02:48

4 predictions? More like 1. You're effectively holding AAC, hitting MP3 sloppy and missing OPUS.

They are 4 (Table 7). At the moment they are of little practical use for general public. They are here to show the method of error assessment for the model/ predictions, the method that will be used during the research.

Objective difference measurements to predict listening test results?

Reply #24 – 2015-11-09 19:53:07

Quote from: Serge Smirnoff on 2015-11-09 09:32:03

Quote from: Soap on 2015-11-09 04:02:48
4 predictions? More like 1. You're effectively holding AAC, hitting MP3 sloppy and missing OPUS.

They are 4 (Table 7). At the moment they are of little practical use for general public. They are here to show the method of error assessment for the model/ predictions, the method that will be used during the research.

That's way more words than needed to admit the fact you're talking 3 minor variants of AAC, not 3 unique predictions, no?

Notice