HydrogenAudio

Hydrogenaudio Forum => Scientific Discussion => Topic started by: Serge Smirnoff on 2010-11-24 12:27:35

Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-24 12:27:35
I found this thread (http://www.hydrogenaudio.org/forums/index.php?showtopic=82292) among SoundExpert referals and was a bit surprised with almost complete misunderstanding of SE testing methodology and particularly how diff signal is used in SE audio quality metrics. Discussion on the topic from 2006 (http://www.hydrogenaudio.org/forums/index.php?showtopic=50548) actually seems more meaningful. So I decided to post here some SE basics for reference purposes. I will use a thought experiment which is close to reality though.

Suppose we have two sound signals – the main and the side one. They could be for example a short piano passage and some noise. We can prepare several mixes of them in different proportions:

After normalization all mixes have equal levels and we can evaluate perceptibility of the side signal in the mixes. Here at SE we found that this perceptibility is a monotonous function of side signal level and looks like this:

Figure: Side signal perception (http://soundexpert.org/image/image_gallery?uuid=6ddb409c-2096-407f-9125-e6d06fd36686&groupId=10136&t=1350692359358)

(1) In other words, there is a relationship between objectively measured level of side signal and its subjectively estimated perceptibility in the mix. And what is more:
[blockquote](a) this relationship is well described by 2-nd  order curve (assuming levels are in dB)
(b) the relationship holds for any sound signals whether they are correlated or not, the only differences are position and curvature of the curve.[/blockquote]
(2) These side stimulus perceptibility curves are the core of SE rating mechanism. Each device under test has its own curve plotted on basis of SE online listening tests.
(3) Side signals are difference signals of devices being tested. Levels of side signals are expressed in dB of Difference level parameter which is exactly equal to RMS level of side signal in our case.
(4) Subjective grades of perceptibility are anchor points of 5-grade impairment scale.
(5) Audio metrics beyond threshold of audibility is determined by extrapolation of that 2-nd order curves. Virtual grades in extrapolated area could be considered as objective quality parameters regarding human auditory peculiarities.

So, yes, difference signal is used in SE testing. We take into account both its level and how human auditory system perceives it together with reference signal. Some difference signals having fairly high levels still remain almost imperceptible against the background of reference signal and vice versa; perceptibility curves reflect this.   

This is the concept. Many parts of it still need thorough verification in carefully designed listening tests, which are beyond SE possibilities. All we can do is to analyze collected grades returned by SE visitors. This will be done for sure and yet this can't be a replacement of properly organized listening tests.

SE testing methodology is new and questionable, but all assumptions look reasonable and SE ratings – promising, at least to me. Time will show.
Title: SoundExpert explained
Post by: drewfx on 2010-11-24 17:20:39
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-24 19:00:26
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?

The super-goal is to create audio metrics which could assess quality margin of devices and technologies taking into account human auditory characteristics. THD+N does this but regardless of the latter. Quality margin exists objectively, we need an instrument for measuring it. Extrapolating those psychometric curves we create such an instrument. Dashed line could be 1-st order or more complex curve, this is pure conventional. It seems to me that most natural and simplest way to prolong the curve is to extrapolate it. Without that dashed section assessment of quality beyond perception is just impossible.
Title: SoundExpert explained
Post by: drewfx on 2010-11-24 19:24:40
Without that dashed section assessment of quality beyond perception is just impossible.


Exactly! Which is as it should be - there is no change in "quality" beyond the point of perception, unless you're defining "quality" to mean something imperceptible.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-24 20:49:11
Exactly! Which is as it should be - there is no change in "quality" beyond the point of perception, unless you're defining "quality" to mean something imperceptible.

There are subjective and objective quality parameters. Subjective grades can be obtained only  in listening tests. They most accurately evaluate perceived audio quality but helpless for assessing quality margins - encoder@300kbit/s and  encoder@400kbit/s are the same for them, as well as ADC(-90dB THD+N) and ADC(-123dB THD+N) and a lot of other audio equipment and technologies. Objective quality parameters like THD, IMD, frequency response etc. cover many aspects of quality margin estimation but poorly correlate with subjective parameters because do not take into account human hearing. Also subjective and objective parameters “live” in separated worlds, you can't “translate” them into each other.

I am sure there is possibility to create such quality parameter that could assess quality margin with regard to human hearing. And this is the goal of SE efforts. We propose such parameter which combines objective measurement with evaluation by humans. 

You can call this quality parameter as you like, I prefer just “quality rating”. It's a combination of  subjective and objective parameters.
Title: SoundExpert explained
Post by: drewfx on 2010-11-24 21:17:13
Just to be clear - I am not necessarily questioning your goals here.

The problem I see is that when you extrapolate to infinity a curve that attempts to quantify human perception, you are also implicitly implying that human perception itself extends infinitely. You are redefining "imperceptible" to mean "less perceptible". You need the curve (and math) to match the realities of human perception, or else any conclusions you attempt to draw from the extrapolations are essentially meaningless. What's the point of trying to create an objective measure that only applies to a hypothetical world?

I repeat my original assertion - the curve should be a flat line when it reaches the point labeled "imperceptible".
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-24 23:24:33
If you want to build human-hearing-oriented audio metrics for the area beyond perception point (p-point) you will presently need some psychometric relationship in that area, which is impossible by definition – you can't research perception beyond p-point (*). So any relationship in that area will be artificial or hypothetical. The task is to find such hypothetical relationship that will serve the purpose best. In SE metrics it is extrapolation of real psychometric curve. In other words, dashed line is what we actually need for our metrics and the only purpose of the real part of this curve is to be a basis for plotting that dashed line.

So, extrapolating the curve I imply that beyond p-point the relationship revealed by real part of the curve holds. You can't prove or disprove this assumption directly because of (*), but this can be done indirectly by comparing SE quality ratings with results of traditional listening tests on audio material with very small impairments.
Title: SoundExpert explained
Post by: alexeysp on 2010-11-25 10:35:23
If you want to build human-hearing-oriented audio metrics for the area beyond perception point (p-point) you will presently need some psychometric relationship in that area, which is impossible by definition – you can't research perception beyond p-point.


> human-hearing-oriented
> beyond perception point

Does not compute.


You are talking about "quality margin", but there is no such thing as absolute quality. Quality is, essentially, a measure of fitness for particular purpose. That is, notion of quality is always related to particular application, or a defined set of applications.

So, what kind of application your extrapolated curve relates to?

If the purpose is simply to compare two codecs or devices as applied to perceived qualty of audio reproduction, then the part of the curve below the "imperceptible" threshold should be sufficient. What purpose does the extrapolated part serve then?
Title: SoundExpert explained
Post by: 2Bdecided on 2010-11-25 11:30:03
Just to be clear, your graph example shows grades where the default noise level (0dB) is quite objectionable, and reducing the noise makes it less and less so - correct?

But with codec testing, you do kind of the opposite. The default noise level (0dB) is usually indistinguishable/transparent, or very nearly so, and to build the "worse quality" part of the curve (the part where people can hear the noise), you have to amplify the coding noise - correct?


People in this thread are saying the scale beyond "imperceptible" makes no sense. I'm not sure if that's true or not. What you're "measuring" (I put that in quotes - see later) is how far the coding noise sits below the threshold of audibility. (or above, if it's audible at the default level). If the second-order curve theory holds true, then to do this you only need sufficient points on the curve where the difference is audible. Points on the curve where the difference is inaudible don't help because it does become a flat line there.


There are several accepted ways to judge the threshold of audibility. I used this one...
Quote
Each masking threshold was determined by a 3-interval, forced choice task, using a one up two down transformed stair case tracking method. This procedure yields the threshold at which the listener will detect the target 70.7% of the time [Levitt, 1971]. The process is as follows.
For each individual measurement, the subject is played three stimuli, denoted A, B, and C. Two presentations consist of the masker only, whilst the third consists of the masker and tar-get. The order of presentation is randomised, and the subject is required to identify the odd-one-out, thus determining whether A, B, or C contains the target. The subject is required to choose one of the three presentations in order to continue with the test, even if this choice is pure guesswork, hence the title “forced choice task.” If the subject fails to identify the target signal, the amplitude of the target is raised by 1 dB for the next presentation. If the subject cor-rectly identifies the target signal twice in succession, then the amplitude of the target is re-duced by 1 dB for the next presentation. Hence the amplitude of the target should oscillate about the threshold of detection, as shown in Figure 6.5. In practice, mistakes and lucky guesses by the listener typically cause the amplitude of the target to vary over a greater range than that shown. A reversal (denoted by an asterisk in Figure 6.5) indicates the first incorrect identification following a series of successes (upper asterisks), or the first pair of correct identi-fications following a series of failures (lower asterisks). The amplitudes at which these rever-sals occur are averaged to give the final masked threshold. An even number of reversals must be averaged, since an odd number would cause a +ve or –ve bias. Throughout these tests, the final six (out of eight) reversals were averaged to calculate each masked threshold.
The initial amplitude of the target is set such that it should be easily audible. Before the first reversal, whenever the subject correctly identifies the target twice, the amplitude is reduced by 6 dB. After the first reversal, whenever the subject fails to identify the target, the amplitude is increased by 4 dB. After the second reversal, whenever the subject correctly identifies the tar-get twice, the amplitude is reduced by 2 dB. After the third reversal, the amplitude is always changed by 1 dB, and the following six reversals are averaged to calculate each masked threshold. This procedure allows the target amplitude to rapidly approach the masked thresh-old, and then finely track it. If the target amplitude were changed in 1 dB steps initially, then the decent to the masked threshold would take considerably longer, and add greatly to listener fatigue. In the case where the listener fails to identify the target initially, then the target ampli-tude is increased by 6 dB for each failed identification, up to the maximum allowed by the re-play system (90 dB peak SPL at the listener’s head).

This is normally used for simple noise masking tone experiments. It seems to work OK with coding noise, but repetition of a moment of coded audio over and over again is quite mind numbing and makes people listen in a very different way to normal music listening. Whether it pushes their thresholds up or down I don't know. Quite a fascinating subject IMO!


It seems to me that your method is far kinder to listeners. If your second order curve fitting can be justified, then it's a really neat way of finding the threshold of audbility (the cross over from 5.0 "imperceptible", to 4.9 "just perceptible but not annoying" on the usual scale) without even having to test at that (difficult) level.



So far so good. What I'm less convinced of is the implication that a given codec has so much "headroom", and that this is a "good thing".

e.g. on the range of content tested, at a given bitrate/setting, a given codec might be transparent even with the noise elevated by 12dB. It scores well in your test. Fair enough. IMO it would be wrong to draw too much from this conclusion. e.g.
1. It's tempting to think this means it's suitable for transcoding, but it might not be - it might fall apart when transcoded.
2. It's tempting to think this means that audible artefacts will be rarer (and/or less bad) with this codec than with one where the noise becomes audible when elevated by 3dB, but this might be very wrong - this wonderful codec which keeps coding noise 12dB below the threshold of audibility on the content tested might fall apart horribly on some piece of content that hasn't been tested.


I'm sure you know all this! I'm just thinking aloud.

Anyway, I find it fascinating. Thanks for the explanation.

Cheers,
David.
Title: SoundExpert explained
Post by: knutinh on 2010-11-25 18:15:43
I repeat my original assertion - the curve should be a flat line when it reaches the point labeled "imperceptible".

How wide is a Gaussian distribution?

If an encoder produce an audible flaw for 1/1000 people, for 1/1000 source materials for every 1/1000 times, that is a lot of tests to sort through blindly in order to find that one audible corner-case. And you never know if it is there until you find it.

If (and that is a big if) subjective score can be modelled as simple functions, then one could do simple, small-scale listening tests designed to extract those parameters, instead of determining the absolute threshold of audibility. If this extrapolation is sane (I have no idea if it is), then one could predict the outcome of exhaustive, expensive listening experiments from small ones, and say something clever about the likelihood of a given flaw ever being detected, right?

-k
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-25 18:33:13
> human-hearing-oriented
> beyond perception point

Does not compute.

P-point differs from person to person and depends on training. It is determined conventionally by fixing procedure of measuring. Beyond p-point is not completely deaf area. Back to your contradiction: “human-hearing-oriented audio metrics for the area beyond perception point” simply means that such metrics should evaluate audio quality as good as it would be evaluated by golden ears in perfectly designed listening tests.

You are talking about "quality margin", but there is no such thing as absolute quality. Quality is, essentially, a measure of fitness for particular purpose. That is, notion of quality is always related to particular application, or a defined set of applications.

So, what kind of application your extrapolated curve relates to?

If the purpose is simply to compare two codecs or devices as applied to perceived qualty of audio reproduction, then the part of the curve below the "imperceptible" threshold should be sufficient. What purpose does the extrapolated part serve then?

Any applications with small impairments, which are difficult (or expensive) to evaluate in standard listening tests. Amplifiers with high THD values; noise-shaping, pitch-shifting and other sound processing algorithms, high-bitrate encoders …
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-25 18:49:48
If this extrapolation is sane (I have no idea if it is), then one could predict the outcome of exhaustive, expensive listening experiments from small ones, and say something clever about the likelihood of a given flaw ever being detected, right?

This is exactly what all this new audio metrics was designed for.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-11-25 20:39:48
In the recently closed thread which the OP referred to I mentioned SoundExpert, and greynol replied:
This has been discussed on the forum on more than one occasion. While Serge may take his method seriously, HA does not.
If you're reading this greynol, would you be so kind to resume the (HA) objections ?
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-25 22:50:52
Just to be clear, your graph example shows grades where the default noise level (0dB) is quite objectionable, and reducing the noise makes it less and less so - correct?

But with codec testing, you do kind of the opposite. The default noise level (0dB) is usually indistinguishable/transparent, or very nearly so, and to build the "worse quality" part of the curve (the part where people can hear the noise), you have to amplify the coding noise - correct?

You right in principle, but real figures are slightly different because for quantitative estimation of differences introduced by device under test we use Diff.Level parameter. Diff.Level scale is shifted by 3 dB against the one on the graph (0dB on the graph = -3dB on Diff.Level scale).

Yes, in order to build "worse quality" part diff.signal is amplified. Usually this part occupies the range between -10dB and -30dB for high bitrate encoders. Depends on test sample and particular encoder. Low bitrate encoders are tested as is without building the curves.

People in this thread are saying the scale beyond "imperceptible" makes no sense. I'm not sure if that's true or not. What you're "measuring" (I put that in quotes - see later) is how far the coding noise sits below the threshold of audibility. (or above, if it's audible at the default level). If the second-order curve theory holds true, then to do this you only need sufficient points on the curve where the difference is audible. Points on the curve where the difference is inaudible don't help because it does become a flat line there.

It seems you missed the point here or I missed your one. We "measure" exactly how far the coding noise sits above the threshold of audibility on the subjective scale (vertical). We can measure the amount of that coding noise with Diff.Level but in order to map it to subjective scale we need some curve above the threshold.

There are several accepted ways to judge the threshold of audibility. I used this one...
....................................................................................................
...........................
It seems to me that your method is far kinder to listeners. If your second order curve fitting can be justified, then it's a really neat way of finding the threshold of audbility (the cross over from 5.0 "imperceptible", to 4.9 "just perceptible but not annoying" on the usual scale) without even having to test at that (difficult) level.

Yes, the method could be used for the purpose (if it's true). Multiple iterations around target threshold will be replaced with several easy listening tests, necessary for building "worse quality" part of the curve. Then you will need to extend it just a little bit.

So far so good. What I'm less convinced of is the implication that a given codec has so much "headroom", and that this is a "good thing".

e.g. on the range of content tested, at a given bitrate/setting, a given codec might be transparent even with the noise elevated by 12dB. It scores well in your test. Fair enough. IMO it would be wrong to draw too much from this conclusion. e.g.
1. It's tempting to think this means it's suitable for transcoding, but it might not be - it might fall apart when transcoded.
2. It's tempting to think this means that audible artefacts will be rarer (and/or less bad) with this codec than with one where the noise becomes audible when elevated by 3dB, but this might be very wrong - this wonderful codec which keeps coding noise 12dB below the threshold of audibility on the content tested might fall apart horribly on some piece of content that hasn't been tested.

1. Hard to say. Not only noise headroom matters, also how this headroom maps to vertical scale. And this depends on the curve. In any case codec with greater margin will be more suitable for transcoding than with lower (I do hope so).
2. I think this relates to normal listening tests as well.
Title: SoundExpert explained
Post by: Woodinville on 2010-11-26 07:25:40
SE testing methodology is new and questionable, but all assumptions look reasonable and SE ratings – promising, at least to me. Time will show.


Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

In your example, I see no accounting for spectra, which is a key factor for the human auditory system.

If we're talking "one kind of instrument music" vs. "white noise" we have nothing useful at hand. So what is  your point?
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-26 15:25:17
Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

The metric works as long as you can measure Diff.Level (always) and estimate annoyance of diff. signal in some sound excerpt (not always, for long excerpts the term "basic audio quality" could be inapplicable). In short -  if listening tests are valid for the excerpt, the metric is valid too.

In your example, I see no accounting for spectra, which is a key factor for the human auditory system.

If it is a key factor, the human auditory system will account it during listening tests which are integral part of the metric.
Title: SoundExpert explained
Post by: Woodinville on 2010-11-27 06:17:47
Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

The metric works as long as you can measure Diff.Level (always) and estimate annoyance of diff. signal in some sound excerpt (not always, for long excerpts the term "basic audio quality" could be inapplicable). In short -  if listening tests are valid for the excerpt, the metric is valid too.


Um, I don't think so. I can measure a difference level that is exactly the same, i.e. the same exact SNR, and have enormously different percieved quality.

See "13 dB miracle", please.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-27 07:29:30
Um, I don't think so. I can measure a difference level that is exactly the same, i.e. the same exact SNR, and have enormously different percieved quality.

See "13 dB miracle", please.

Exactly, the "different perceived quality" will be revealed during listening tests and this will be reflected by the psychometric curve above. So, the same Diff.level will be mapped to different points on subjective scale because of different curves.

Designing of this metric was heavily inspired by "13 dB miracle". It was clear that audio quality metric can't rely upon only objective measurements (like THD, SNR...), those objective parameters have to be corrected (weighted) by some psychometric relationships. In our case - by side signal perception curves.
Title: SoundExpert explained
Post by: Porcus on 2010-11-27 14:49:29
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?



Matter of definition, interpretation and use.

1) Consider three chess games which are both "theoretically lost". One is a simple mate in one, the other is so hard that if you put 1000 chess players at the task, you won't be able to distinguish it from the startup position by statistical analysis of the outcome. And, the third is so hard that it won't be solved in fifty years. To clearcut the logic, assume that the second is like the third, except with 70 intermediary "only moves" (which do not constitute any learning curve for the subsequent ones).

Now everything else equal, you will still have a clear strict preference. Because you could risk meeting one of the very few chess players that actually can win this. You might not know that is is "humanly winable" though, but you will absolutely want to insure against the uncertainty if it is free.

Now consider a step-by-step sequence of chess positions, starting from the "third" one above. We index them by "# of very hard moves until the win is clear, as measured by statistics within confidence level [say, p]". How do you define the human-winability threshold?


2) Consider 32-bit sound file, then a 31 bit (LSB truncated) file, etc. Rank these. You may claim that every file above a "hearing threshold" of slightly below T bits, is equivalent. However, what if it is an unfinished product? Are you sure that the final mix is going to have the same hearing threshold? If not, then the high-resolution file could very well be more robust -- there might be manipulations which would enable you to hear a difference between the final and its T-bit version, although not between the original and its T-bit version. Most 16-bit CDs are mixed at higher word length, right?
Solution? A "robustness-to-manipulations" measure?


Of course:
- if no such issues apply, then zero value to superfluous information is at least as good a measure as everything else
- if anyone makes a selling claim, then they have the burden of proof. Then "inaudible difference" is the null hypothesis. You would grab the extra measured quality if for free, as an insurance against audibility, but you would frown upon someone trying to sell you an insurance against a disaster which no-one has ever substantiated has ever happened or could ever happen. (... well ...: http://en.wikipedia.org/wiki/Alien_abduction_insurance (http://en.wikipedia.org/wiki/Alien_abduction_insurance) )
- even if we assume that there is some worth to this not-justified-as-generally-audible quality, then it is hard to quantify. Justifying it exists (by measurement) does not mean we can justify a reasonably narrow confidence interval for a particular point on the graph.
Title: SoundExpert explained
Post by: Woodinville on 2010-11-27 22:05:48
Exactly, the "different perceived quality" will be revealed during listening tests and this will be reflected by the psychometric curve above. So, the same Diff.level will be mapped to different points on subjective scale because of different curves.


So, then, this curve of yours is only useful to compare like to like. This is, simply put, not very useful.  I don't get your point here.
Title: SoundExpert explained
Post by: knutinh on 2010-11-28 18:24:04
So, then, this curve of yours is only useful to compare like to like. This is, simply put, not very useful.  I don't get your point here.

I understood the author as he wanted to be able to do limited, inexpensive tests on exaggerated errors, then use this method to extend those results into smaller errors that would normally need large, expensive listening tests.

If this works and can be verifyed, then it sounds like a good thing.

-k
Title: SoundExpert explained
Post by: greynol on 2010-11-28 19:14:37
That's a mighty big if.

For years people have requested verification and none has been forthcoming.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-11-28 20:35:11
The technique isn't new, according to this AES paper from 1997: Measuring the Coding Margin of Perceptual Codecs with the Difference Signal (http://www.aes.org/e-lib/browse.cfm?elib=7362)
Quote
Inaudible impairments of impairments near the threshold of audibility require a new method to assess the quality. A variable amplification of the impairments to provide the detection can be realized with the help of the difference signal. In a large listening test, the coding margin for 14 test items was measured. A time varying filter bank to modify the difference signal and to enhance the listening conditions is described.

NB: Just after posting I found an old HA post from Serge (http://www.hydrogenaudio.org/forums/index.php?showtopic=50548&view=findpost&p=458641). Apparently it's his own paper, although the paper states "Feiten, Bernhard" as the author from Deutsche Telekom.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-28 21:47:08
NB: Just after posting I found an old HA post from Serge (http://www.hydrogenaudio.org/forums/index.php?showtopic=50548&view=findpost&p=458641). Apparently it's his own paper, although the paper states "Feiten, Bernhard" as the author from Deutsche Telekom.

Author of the paper is Bernhard Feiten for sure. He is in the references both in my own paper and on SE site (http://soundexpert.org/authors). SE metric could be considered as further development of his approach.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-11-29 10:49:50
That's a mighty big if.

For years people have requested verification and none has been forthcoming.
I think something like it is justified.

I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function (http://en.wikipedia.org/wiki/Psychometric_function) - an S-curve, generated by integrating a Gaussian distribution...

(http://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/320px-Logistic-curve.svg.png)

X-axis is level, and y axis is chance of detecting the artefact.

If you know the function takes this shape, then it's apparently that you don't need to test at the threshold. You can test at several levels somewhat above threshold, and fit the resulting data to this graph/shape, thus giving you the actual threshold value.

The major problem with this is that, if you are testing only a long way above threshold, then very minor errors in the data will give huge errors in the threshold estimate because the fit to the graph could be wildly wrong.


Now, Sound Expert isn't doing this - it's at least one step away from it, testing at levels where people can hear the artefact all the time, and asking them how bad it sounds.

As you say, we have no proof that a graph of these results can be extrapolated back to find the threshold.


An obvious criticism is that two different kinds of artefacts, 12dB above threshold, might give very different results - i.e. one might be far more annoying than the other. But that's not necessarily a failing - if just means the curve might be steeper for one than the other - which would become apparent with more points on the curve (e.g. 6dB and 18dB, for example) so could be accounted for by the method.


It would be interesting to try to prove/disprove all this. A good starting point might be to take one of the archived listening tests from from HA with known results, and use exactly the same samples on SoundExpert. The results should speak for themselves.

Cheers,
David.
Title: SoundExpert explained
Post by: Porcus on 2010-11-29 12:00:40
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function (http://en.wikipedia.org/wiki/Psychometric_function) - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?

Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function (http://en.wikipedia.org/wiki/Link_function#Link_function) ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.


Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations. Again, this depends a bit on circumstances; for example, in an ABX listening test, the interesting issue is whether you guess better than 50%, while in diagnosis of rare diseases -- or default of sovereign bonds -- you are already in the tail of the distribution.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-11-29 15:27:33
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function (http://en.wikipedia.org/wiki/Psychometric_function) - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?
Yes! Must have deleted it by mistake...

King-Smith, P. E., & Rose, D. (1997). Principles of an adaptive method for measuring the slope of the psy-chometric function. Vision Research, 37(12), 1595-
1604. [PubMed]

...though I've lost my copy of the article - I cited it a decade ago so I must have thought it made sense back then.

Quote
Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function (http://en.wikipedia.org/wiki/Link_function#Link_function) ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.

Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations.
Yep, I agree with all of that. I think the "slightly" different curve shapes don't matter as much as you might expect in practice here since the psychometric data is likely to be rather rough anyway. If you get data on the steep part of the curve, you can probably do quite well even if you're not sure of the shape. If you get data on the shallow part of the curve, you're in more trouble if you don't know the exact shape, but you were already way off anyway.


Measured psychoacoustic thresholds are often 70% (because the procedure is often the one that I quoted on the previous page).

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%. That's way off to the left on the S-curve. I guess that 50% on the S-curve gives a 75% score on ABX (???), which can give a very low p (depending on the number of trials).

Cheers,
David.
Title: SoundExpert explained
Post by: Porcus on 2010-11-29 15:47:24
[Heavily edited]

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%.


Well, as I have access to the reference of yours, I looked it up. Brief review after only one reading:

- they are using a logistic model which runs not from 0 to 1, but from false positive rate to 1 minus false negative rate. Detection threshold is set at 50% chance of being detected. This is not the same as coinflipping (they use a "Zippy Estimation by Sequential Testing" method, referencing one of King--Smith's earlier works).

- Experiment: One "more detectable" (stronger light in their experiment, could be "more distorted" in ours) signal A and one "less detectable" signal B are displayed, order randomized (subject knows it is either AB or BA, with 50/50 chance). Subject asked to identify. Difference between A and B in their case is difference in log luminosity. I.e., they have one explanatory variable. In assessing psychoacoustic lossy encoding, you are rather interested in how to minimize audibility of reduction down to given filesize, but that is another issue: here we are assuming that this job is done.



As for your "75%", it may -- or not, I have not checked the ZEST reference of theirs -- refer to Gini coefficient vs. area under ROC curve. The Gini coefficient measures Pr[subject's stated ordering matches true ordering]. AUROC measures Pr[subject's stated ordering matches true ordering] - Pr[subject's stated ordering does not match true ordering] (I have assumed no ties here). Coinflipping (works here, as they are randomized at probability 50/50) yields Gini coefficients of .5 respective 0. In some applications, one targets a value for this statistic, in others one targets a significance level for better-than-coinflipping. Basically, it is Mann--Whitney's U and its properties.
Title: SoundExpert explained
Post by: drewfx on 2010-11-29 17:43:56
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?



Matter of definition, interpretation and use.

1) Consider three chess games which are both "theoretically lost". One is a simple mate in one, the other is so hard that if you put 1000 chess players at the task, you won't be able to distinguish it from the startup position by statistical analysis of the outcome. And, the third is so hard that it won't be solved in fifty years. To clearcut the logic, assume that the second is like the third, except with 70 intermediary "only moves" (which do not constitute any learning curve for the subsequent ones).

Now everything else equal, you will still have a clear strict preference. Because you could risk meeting one of the very few chess players that actually can win this. You might not know that is is "humanly winable" though, but you will absolutely want to insure against the uncertainty if it is free.

Now consider a step-by-step sequence of chess positions, starting from the "third" one above. We index them by "# of very hard moves until the win is clear, as measured by statistics within confidence level [say, p]". How do you define the human-winability threshold?


2) Consider 32-bit sound file, then a 31 bit (LSB truncated) file, etc. Rank these. You may claim that every file above a "hearing threshold" of slightly below T bits, is equivalent. However, what if it is an unfinished product? Are you sure that the final mix is going to have the same hearing threshold? If not, then the high-resolution file could very well be more robust -- there might be manipulations which would enable you to hear a difference between the final and its T-bit version, although not between the original and its T-bit version. Most 16-bit CDs are mixed at higher word length, right?
Solution? A "robustness-to-manipulations" measure?


I would certainly agree it is fair to allow for a reasonable margin of error near the threshold of perception.

Quote
- if anyone makes a selling claim, then they have the burden of proof. Then "inaudible difference" is the null hypothesis.


And this really was my concern - if you have a "quality factor" metric that seems to imply one product is "better" than another based on the extrapolated portion of the curve, it is ripe for someone to misuse. For this reason, I think the information on the threshold of perception needs to be preserved.
Title: SoundExpert explained
Post by: greynol on 2010-11-29 18:18:56
And this really was my concern - if you have a "quality factor" metric that seems to imply one product is "better" than another based on the extrapolated portion of the curve, it is ripe for someone to misuse. For this reason, I think the information on the threshold of perception needs to be preserved.

Precisely (and on this forum, SE results do get misued)!

There has been a lot of talk about psychometrics, but little to none about psychoacousitcs.  When it comes to perceptual coding it is the latter that is king.

Someone, anyone, provide some data showing a direct correlation between across the board "artifacts" amplification to the real-world application of lossy audio compression.  I've seen claims that SE results are good for those interested in applications such as surround-sound processing, transcoding and equalization.  Evidence, please!!!

NB: the word artifacts was put in silly quotes for a reason.  We already had the discussion about what constitutes artifacts and the role masking plays.  I am not denying that they can become unmasked through typical real-world usage but I am denying that across the board amplification of a difference signal that is subsequently added back in constitutes real-world usage.

AFAICT none of the criticisms put forth by people like Garf, Sebastian, Woodinville and Saratoga have been sufficiently addressed since they've been raised.  It seems we've made no progress over the last four years.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-29 19:21:13
Someone, anyone, provide some data showing a direct correlation between across the board "artifacts" amplification to the real-world application of lossy audio compression.  I've seen claims that SE results are good for those interested in applications such as surround-sound processing, transcoding and equalization.  Evidence, please!!!

I can well agree that such parameter as quality margin might not be very useful in practice of lossy codecs usage. The metric was developed for assessing wider class of low impairments. Finally it could be a substitute (further development) of current audio metric based on THD, SNR, IMD ... parameters. As opposed to current metric the new one has to be sensitive to psychoacoustic features of human hearing. That's why lossy coders are perfect for test drive of the metric. Also they produce time accurate output and diff signal is easy to extract.

So I prefer to separate the questions:
Title: SoundExpert explained
Post by: greynol on 2010-11-29 19:27:02
If we aren't going to consider real-world usage of perceptual audio coding then why does one need margin?

What role does an indirect method of measuring artifact detection play when there is direct method of measurement available, especially when the direct method can be applied to real-world usage?

I think these are poignant questions when considering whether SE results are to be used as an acceptable means of support for judging sound quality on this forum.

I found this thread (http://www.hydrogenaudio.org/forums/index.php?showtopic=82292) among SoundExpert referals and was a bit surprised with almost complete misunderstanding of SE testing methodology and particularly how diff signal is used in SE audio quality metrics.
I feel this needs to be addressed.  The thread in question has nothing to do with SE.  When SE was raised I don't believe there was any misunderstanding.  Speaking only for myself, Serge, I believe I do understand your testing methodology.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-29 19:36:59
What role does an indirect method of measuring artifact detection play when there is direct method of measurement available, especially when the direct method can be applied to real-world usage?

The lower the impairments - the more expensive and less reliable the results of listening tests. This is a problem.
Title: SoundExpert explained
Post by: greynol on 2010-11-29 19:45:34
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.  This is a problem.

Individual ABX testing has always taken precedence over group testing on this forum.  If someone feels that he or she is (or has become) more sensitive to artifacts then a new test can always be performed.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-29 20:19:40
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.  This is a problem.

This is not a problem, this is an opportunity.
Title: SoundExpert explained
Post by: greynol on 2010-11-29 20:31:52
How so?
Title: SoundExpert explained
Post by: SebastianG on 2010-11-29 21:04:30
[...]
So, yes, difference signal is used in SE testing.
[...]
This is the concept.
[...]
SE testing methodology is new and questionable,

Yes, very questionable. I said this 4 years ago and I'm still saying this now.

but all assumptions look reasonable and SE ratings

Not really. It's not hard to imagine the possibility of signal pairs (main,side) where you can't hear any difference between main and main+side but you can easily hear a difference between main and main+0.5*side. Hint: phase is a bitch. ;-) Your implicit assumption is that both signals are independent. But this is not necessarily the case with perceptual audio coders. Take for example the MPEG4 tool called PNS (perceptual noise substitution). It just replaces some high frequency noise with synthetically generated noise of the same level. This is done by transmitting the noise level only. Obviously, we can use this tool in cases when the main perceptual feature is the energy level and anything else is not important. Then, we have the following properties: Noise level of original matches the noise level of the encoded result, so energy(main) = energy(main+side). Probability theory tells us that main and main+side are orthogonal. This implies a coherence between main and side of 0.7 -- ZERO POINT SEVEN. Hardly independent. This also implies that a 50/50 mix -- main+0.5*side -- would lose 3dB power. You can easily compute this via
Code: [Select]
main = [1 0];
side = [0 1] - main;
20*log10(norm(main+0.5*side))

(Matlab code)

So, by attenuating the sample-by-sample difference we actually amplify the perceived difference (since we lose power) in this case! What does that tell us? It tells us that you overrate sample-by-sample differences. Perceptual audio coders try to retain certain things so it sounds similar and tolerate other losses. And you're focussing on the "other losses" (as well). What you're doing is basically violating some of a perceptual encoder's principles (like keeping energy levels similar no matter how large the sample-by-sample difference will be). By amplifing the difference you could destroy some signal properties the encoder and our HAS cares about much more than you do. Sound perception is not as simple as you want us to believe. Sample-by-Sample differences are not important. And "extrapolating artefacts" this way is nothing but a big waste of time. Even testing with "attenuated artefacts" doesn't tell you anything. Your methodology breaks down because you're assuming that the difference is independent from the original. It is not.

Cheers!
SG
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-29 21:10:08
How so?

Because controlled breaking of masking may turn out a powerful instrument of audio research and a basis for audio metric.

It just has to be proved or disproved. IMHO this can't be done in discussions.
Title: SoundExpert explained
Post by: Woodinville on 2010-11-29 21:14:02
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.

This leads, and yes, I have confirmed it with a simple, unpublished experiment, to "interesting" results in terms of perceived quality.

Title: SoundExpert explained
Post by: Porcus on 2010-11-29 22:00:02
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.


Do they "remove" the high frequencies, or do they remove the information -- i.e., replace it with something which may or may not have the same energy, but has less information content? If you take (1) a sawtooth signal, (2) dither it down to 24 bits, (3) mp3-encode -- what will each signal's Fourier coefficients look like?

(Not a rhetorical question. I don't know.)
Title: SoundExpert explained
Post by: Kees de Visser on 2010-11-29 22:21:31
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.
A lot of radio stations use an Orban processor to juice up the signal (EQ, multi-band compression, limiting). The Orban is pretty good at breaking masking. Lossy audio that might pass a transparency ABX test can become quite bad after "Orbanisation". If your real world is rather predictable (e.g. personal use on a known system), then ABX is probably fine.
If the SoundExpert test is flawed, what kind of stress test would be able to reveal sub-threshold differences that ABX can't ?
Title: SoundExpert explained
Post by: Woodinville on 2010-11-29 23:26:53
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.


Do they "remove" the high frequencies, or do they remove the information -- i.e., replace it with something which may or may not have the same energy, but has less information content? If you take (1) a sawtooth signal, (2) dither it down to 24 bits, (3) mp3-encode -- what will each signal's Fourier coefficients look like?

(Not a rhetorical question. I don't know.)


It depends on the codec. MP3 and AAC (no plus) simply remove the high frequencies. The "plus" series adds in non-signal that sounds kinda-sorta, maybe "ok".
Title: SoundExpert explained
Post by: greynol on 2010-11-30 07:19:41
If the SoundExpert test is flawed, what kind of stress test would be able to reveal sub-threshold differences that ABX can't ?

You already gave the answer: "Orbanisation".  If you find some means of testing that provides direct correlation (perhaps SE will do it, though I'm not holding my breath) then great, you have an alternative; otherwise there will not be another equitable substitute for comparing lossless->orbanisation and lossless->lossy->orbanisation, where you choose the encoder, settings and samples.  ABX or ABC(/HR) will always play a role.  If the differences are "sub-threshold" then for the sake perceived audio quality there is no difference.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-30 08:09:07
It's not hard to imagine the possibility of signal pairs (main,side) where you can't hear any difference between main and main+side but you can easily hear a difference between main and main+0.5*side.

In practice - never. In all cases perception of gradually unmasked artifacts is monotonous function. That was also confirmed by B. Feiten in already mentioned "Measuring the Coding Margin of Perceptual Codecs with the Difference Signal" (AES Preprint # 4417). This is the main point of SE metric that was stated in the first post (above the graph). Once again - not a single case where the curve was not monotonous and numerous cases of monotonous behavior. So I treat this as a fact.

Hint: phase is a bitch. ;-) Your implicit assumption is that both signals are independent. But this is not necessarily the case with perceptual audio coders. Take for example the MPEG4 tool called PNS (perceptual noise substitution). It just replaces some high frequency noise with synthetically generated noise of the same level. This is done by transmitting the noise level only. Obviously, we can use this tool in cases when the main perceptual feature is the energy level and anything else is not important. Then, we have the following properties: Noise level of original matches the noise level of the encoded result, so energy(main) = energy(main+side). Probability theory tells us that main and main+side are orthogonal. This implies a coherence between main and side of 0.7 -- ZERO POINT SEVEN. Hardly independent. This also implies that a 50/50 mix -- main+0.5*side -- would lose 3dB power. You can easily compute this via
Code: [Select]
main = [1 0];
side = [0 1] - main;
20*log10(norm(main+0.5*side))

(Matlab code)

So, by attenuating the sample-by-sample difference we actually amplify the perceived difference (since we lose power) in this case! What does that tell us? It tells us that you overrate sample-by-sample differences. Perceptual audio coders try to retain certain things so it sounds similar and tolerate other losses. And you're focussing on the "other losses" (as well). What you're doing is basically violating some of a perceptual encoder's principles (like keeping energy levels similar no matter how large the sample-by-sample difference will be). By amplifing the difference you could destroy some signal properties the encoder and our HAS cares about much more than you do. Sound perception is not as simple as you want us to believe. Sample-by-Sample differences are not important. And "extrapolating artefacts" this way is nothing but a big waste of time. Even testing with "attenuated artefacts" doesn't tell you anything. Your methodology breaks down because you're assuming that the difference is independent from the original. It is not.

I didn't make such assumption, quite the opposite - see 1b in the first post. Nevertheless, the case you discribe is realy interesting. If exaggerated and simplified a bit it will look like following:

We have a sound excerpt which has a time interval (between tonal parts) which consists purely of, say, white noise. Also we have a coder which can only substitute the noise with uncorrelated one whenever it detects that there are no tonal parts during  that interval. Then diff. signal will consist of  amplified noise portion (being uncorrelated they will be added not subtracted). So the version of our excerpt with amplified differences will have stronger noise part which can be detected in listening tests while in practice this is not important for HAS.

Is this the case you wanted to produce? If yes I will examine it more carefully. It is really interesting as it helps to determine the limits of the metric.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-30 08:20:28
Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.

We filter out such frequencies from resulting test signals and mentioned that in Diff.Level paper. It is known problem and we addressed it from the beginning. 
Title: SoundExpert explained
Post by: knutinh on 2010-11-30 08:53:51
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function (http://en.wikipedia.org/wiki/Psychometric_function) - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?
Yes! Must have deleted it by mistake...

King-Smith, P. E., & Rose, D. (1997). Principles of an adaptive method for measuring the slope of the psy-chometric function. Vision Research, 37(12), 1595-
1604. [PubMed]

...though I've lost my copy of the article - I cited it a decade ago so I must have thought it made sense back then.

Quote
Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function (http://en.wikipedia.org/wiki/Link_function#Link_function) ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.

Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations.
Yep, I agree with all of that. I think the "slightly" different curve shapes don't matter as much as you might expect in practice here since the psychometric data is likely to be rather rough anyway. If you get data on the steep part of the curve, you can probably do quite well even if you're not sure of the shape. If you get data on the shallow part of the curve, you're in more trouble if you don't know the exact shape, but you were already way off anyway.


Measured psychoacoustic thresholds are often 70% (because the procedure is often the one that I quoted on the previous page).

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%. That's way off to the left on the S-curve. I guess that 50% on the S-curve gives a 75% score on ABX (???), which can give a very low p (depending on the number of trials).

Cheers,
David.

Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.

-k
Title: SoundExpert explained
Post by: Porcus on 2010-11-30 10:28:54
Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.


Because audiophiles tend to shun science? Ooops, did I even say that?
Title: SoundExpert explained
Post by: knutinh on 2010-11-30 10:34:21
Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.


Because audiophiles tend to shun science? Ooops, did I even say that?

That may be the reason why so few listening tests are done in general, but not why this particular type is seldomly used.

Estimating "just the right" bitrate for a given mp3 encoder seems to be a common issue, and using an adaptive test that spends most of the time where the point of interest along the "PMF" turns out to be, seems sensible.

-k
Title: SoundExpert explained
Post by: Porcus on 2010-11-30 11:03:55
Joking aside: I'd be surprised if MPEG didn't do things like this in the development. Anyone knows?
Title: SoundExpert explained
Post by: 2Bdecided on 2010-11-30 15:24:45
In all cases perception of gradually unmasked artifacts is monotonous function.
How can you say this when SebG and Woodinville both gave you examples to the contrary?

I hit the exact problem Woodinville describes using the method I posted on the first page of this thread - a listener gets stuck in a "false" minima of audibility because double the difference gives you the original signal back (with the part "removed" by the codec being inverted, but that difference is not usually audible). Hardly monotonic - the chance of hearing the artefact becomes zero at a single gain setting (+6dB), and (with the specific audio I used - YMMV!) leaps back to the "expected" function very quickly either side of that.

Cheers,
David.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-11-30 16:38:05
How can you say this when SebG and Woodinville both gave you examples to the contrary?

I hit the exact problem Woodinville describes using the method I posted on the first page of this thread - a listener gets stuck in a "false" minima of audibility because double the difference gives you the original signal back (with the part "removed" by the codec being inverted, but that difference is not usually audible). Hardly monotonic - the chance of hearing the artefact becomes zero at a single gain setting (+6dB), and (with the specific audio I used - YMMV!) leaps back to the "expected" function very quickly either side of that.

In many papers devoted to "coding margin" a special filtering is recommended to eliminate those "ghost" frequencies. We also use it.
Title: SoundExpert explained
Post by: Woodinville on 2010-12-01 02:11:37
How can you say this when SebG and Woodinville both gave you examples to the contrary?

I hit the exact problem Woodinville describes using the method I posted on the first page of this thread - a listener gets stuck in a "false" minima of audibility because double the difference gives you the original signal back (with the part "removed" by the codec being inverted, but that difference is not usually audible). Hardly monotonic - the chance of hearing the artefact becomes zero at a single gain setting (+6dB), and (with the specific audio I used - YMMV!) leaps back to the "expected" function very quickly either side of that.

In many papers devoted to "coding margin" a special filtering is recommended to eliminate those "ghost" frequencies. We also use it.



How do you know what "it" is? You have to work specifically to every bit rate, every bandwidth, every sampling rate, every different encoder?

This is not useful.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-01 08:17:21

In many papers devoted to "coding margin" a special filtering is recommended to eliminate those "ghost" frequencies. We also use it.



How do you know what "it" is? You have to work specifically to every bit rate, every bandwidth, every sampling rate, every different encoder?

This is not useful.

Subtracting a portion of reference signal from output one it's not hard to figure out what frequencies are "ghosted' and remove them with FIR filter. So, yes, we do it for every test sample with amplified artifacts. This helps to get smoother perception curves. Every item tested at SE has its own unique curve plotted on results of SE listening tests. Extrapolating that curve we get resulting quality rating for each testing item.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-12-01 15:26:50
I can see how this could work for a simple low pass filter, but not how it could work for SBR.

With SBR, there's nothing you can usefully present to a listener that's "just like the coded version, but with the faults a bit louder" or "just like the coded version, but with the faults a bit quieter".

It's like me singing the same song twice. You can't figure out how close the two different versions are by subtracting them or amplifying differences. Subjectively (if I was a very consistent singer) the two versions could sound basically identical, but mathematically every sample would be very different, and I can't see how what you propose could work. SBR isn't so different from this example!

Cheers,
David.
Title: SoundExpert explained
Post by: Woodinville on 2010-12-01 21:03:21

In many papers devoted to "coding margin" a special filtering is recommended to eliminate those "ghost" frequencies. We also use it.



How do you know what "it" is? You have to work specifically to every bit rate, every bandwidth, every sampling rate, every different encoder?

This is not useful.

Subtracting a portion of reference signal from output one it's not hard to figure out what frequencies are "ghosted' and remove them with FIR filter. So, yes, we do it for every test sample with amplified artifacts. This helps to get smoother perception curves. Every item tested at SE has its own unique curve plotted on results of SE listening tests. Extrapolating that curve we get resulting quality rating for each testing item.


So, it's "by clip". This still seems useless.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-12-01 22:47:58
This still seems useless.
So which options are available to reveal sub-threshold differences in a listening test ?
Title: SoundExpert explained
Post by: Woodinville on 2010-12-01 22:55:23
This still seems useless.
So which options are available to reveal sub-threshold differences in a listening test ?


This leads to a very simple question: What does "sub-threshold differences in a listening test" mean?

Therein lies, perhaps, the underlying philosophical problem here.
Title: SoundExpert explained
Post by: greynol on 2010-12-02 05:47:24
sub-threshold differences

Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 07:53:23
This leads to a very simple question: What does "sub-threshold differences in a listening test" mean?

Therein lies, perhaps, the underlying philosophical problem here.

No, the question is without "in a listening test" part. What does sub-threshold differences mean?
It is probably something that distinguishes, say, aac@192 from aac@256.

I'm not sure about philosophical but problem of definitions in this case exists for sure.

EDIT: ... or may be contradiction between objective and subjective plays some role here ...
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 08:01:52
So, it's "by clip". This still seems useless.

Yes, by clip, like in ordinary listening tests.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-12-02 08:35:00
This leads to a very simple question: What does "sub-threshold differences in a listening test" mean?
Differences that can be proven to exist with technical means, but are undetectable with a standard listening test.

Let me try this analogy:
Someone has to leave the next day on a 6-month boat trip. He has to prepare canned food and can choose between two unlabeled lots that look identical. Someone told him that the lots have different "best before" dates: one expires in 1 month, the other in 10 months. He tastes a bit from each, but both taste absolutely identical. He knows that best before dates don't mean that the food will be bad the day after, but his chances to survive the trip are probably bigger when he picks the fresher one.
(btw, the boat is too small to take both)
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 08:41:31
With SBR, there's nothing you can usefully present to a listener that's "just like the coded version, but with the faults a bit louder" or "just like the coded version, but with the faults a bit quieter".

Why not? If there is a difference with main signal, then there is something to present. The main question is - how good such differences will represent the drawbacks really important for HAS. Probably there are some psychoacoustic tricks which are badly covered by the metric. Then usual question is - to what extent such cases will affect final rating? All metrics have its limits.
It's like me singing the same song twice. You can't figure out how close the two different versions are by subtracting them or amplifying differences. Subjectively (if I was a very consistent singer) the two versions could sound basically identical, but mathematically every sample would be very different, and I can't see how what you propose could work. SBR isn't so different from this example!

Why not as well?
Title: SoundExpert explained
Post by: greynol on 2010-12-02 09:34:32
Differences that can be proven to exist with technical means, but are undetectable with a standard listening test.

...and the intended purpose of transparent perceptual compression is to satisfy the latter.  The rest is little more than mental masturbation.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-12-02 10:25:15
This leads to a very simple question: What does "sub-threshold differences in a listening test" mean?
Differences that can be proven to exist with technical means, but are undetectable with a standard listening test.

Let me try this analogy:
Someone has to leave the next day on a 6-month boat trip. He has to prepare canned food and can choose between two unlabeled lots that look identical. Someone told him that the lots have different "best before" dates: one expires in 1 month, the other in 10 months. He tastes a bit from each, but both taste absolutely identical. He knows that best before dates don't mean that the food will be bad the day after, but his chances to survive the trip are probably bigger when he picks the fresher one.
(btw, the boat is too small to take both)
The best before date is a simple function - an apples to apples comparison - you know that 6 months is better than 5 months. You also know that what you want to do (go out longer in the boat) relates to what you are measuring (how long the food will last).

Comparing codecs isn't like this at all. Comparing codecs is an apples to oranges comparison - you don't know that artefacts 6dB below threshold are better than artefacts 5dB below threshold - 1) because the characteristic of the artefacts could be different, and 2) you haven't said what "better" means. Better for what? Not for just listening (either is fine), so for what?

Cheers,
David.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-12-02 10:32:13
It's like me singing the same song twice. You can't figure out how close the two different versions are by subtracting them or amplifying differences. Subjectively (if I was a very consistent singer) the two versions could sound basically identical, but mathematically every sample would be very different, and I can't see how what you propose could work. SBR isn't so different from this example!
Why not as well?
There must be some disconnect here, because this doesn't make sense to me. Either I don't understand what you mean, or you don't understand what I mean.

If I sing the same thing twice, what do you do to these two files to present them on SoundExpert.com?

Cheers,
David.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 11:18:36
If I sing the same thing twice, what do you do to these two files to present them on SoundExpert.com?

There is nothing to present on SE in this case.
Or we have to define that the first recording is reference one and we want to know how different (bad) is the second. Then why not to amplify the difference to some extent.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-12-02 12:09:59
Comparing codecs isn't like this at all. Comparing codecs is an apples to oranges comparison - you don't know that artefacts 6dB below threshold are better than artefacts 5dB below threshold - 1) because the characteristic of the artefacts could be different, and 2) you haven't said what "better" means. Better for what? Not for just listening (either is fine), so for what?
Do we agree that there are 3 types of quality levels, from better to worse:
1- artefacts are non-existent (-inf), like in lossless coding
2- artefacts are below the hearing threshold
3- artefacts are audible, by at least one listener for at least one (killer)sample

In my view the better codec is the one that will remain in category 2 in any situation (e.g. inserting an Orban in the monitoring chain).

Example: original master is 24/96. Two lossy copies are made, one 16/44.1 and one mp3 320kbs. Both sound identical to the master.
I would say the 16/44.1 is better than the mp3, but if you can give arguments for the contrary, I'm all ear.

If I sing the same thing twice, what do you do to these two files to present them on SoundExpert.com?
SoundExpert won't work for this, nor will ABX since there's a huge risk for false positives. A lot depends on where you switch from A to B. Small tempo and pitch differences will remain unnoticed when heard in isolation, but as soon as you jump from one to the other they can become apparent. This is the daily job of an audio editor, to find the best spot to inaudibly switch from one take to another. (hint: it's not always easy and I'm glad to be paid per hour)
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 12:10:15
Comparing codecs isn't like this at all. Comparing codecs is an apples to oranges comparison - you don't know that artefacts 6dB below threshold are better than artefacts 5dB below threshold - 1) because the characteristic of the artefacts could be different, and 2) you haven't said what "better" means. Better for what? Not for just listening (either is fine), so for what?

SE metric is trying to find the way how to get to know this. And "better" means "as if it was judged by golden ears in perfect listening environment".
What for? I don't know but all audio pro guys want huge quality margins for their equipment and most listeners want flac while aac@192 is transparent. May be they are just not clever enough.
Title: SoundExpert explained
Post by: 2Bdecided on 2010-12-02 15:04:10
Comparing codecs isn't like this at all. Comparing codecs is an apples to oranges comparison - you don't know that artefacts 6dB below threshold are better than artefacts 5dB below threshold - 1) because the characteristic of the artefacts could be different, and 2) you haven't said what "better" means. Better for what? Not for just listening (either is fine), so for what?
Do we agree that there are 3 types of quality levels, from better to worse:
1- artefacts are non-existent (-inf), like in lossless coding
2- artefacts are below the hearing threshold
3- artefacts are audible, by at least one listener for at least one (killer)sample
You can certainly define 3 such categories. It also sounds like a thing that's theoretically true (whatever that means). I suspect your categories are completely useless though...

In practice, it's hard to find a codec in category 2 that gives a significant bitrate saving over those in category 1.

It's rather difficult to prove that the codec is in category 2 rather than 3. You've got to get everyone in the world to listen carefully to every possible audio signal.

Quote
In my view the better codec is the one that will remain in category 2 in any situation (e.g. inserting an Orban in the monitoring chain).
Ah, good, so now we have everyone in the world listening to every possible audio signal via every possible piece of audio processing. Excellent.

Now, seriously, even if we put the "every person" and "every audio signal" parts to one side, you must realise that for any codec which changes the signal (let's assume the change is inaudible), there must be some audio processing we can do to make that change audible. So no codec can remain in category 2 "in any situation".


Quote
Example: original master is 24/96. Two lossy copies are made, one 16/44.1 and one mp3 320kbs. Both sound identical to the master.
I would say the 16/44.1 is better than the mp3, but if you can give arguments for the contrary, I'm all ear.
If the mp3 is made from a 16/44.1 file (as is normal) then this is silly - of course it can't be better, since it's a copy of a copy.

However, if the mp3 is made from the 24/96 by resampling to 44.1 but maintaining 24-bits, then it's kind of trivial to find a situation where the mp3 is "better":
The original master contains a signal at -110dB
The mp3 is decoded to 24-bits
The "processing" applied to the 16/44.1 wav and the decoded 320kbps mp3 is... increasing the level by 80dB.

Oh look - both sounded identical to the master before processing, but with my highly advanced processing in place (well, OK, it was a volume control!) the mp3 is revealed to be far closer to the master than the 16/44.1 version.


These are all silly examples, but I think they prove the point - there's far too much assumption in the SoundExpert methods, or the "this thing sounds the same but must be better" statements.


Quote
If I sing the same thing twice, what do you do to these two files to present them on SoundExpert.com?
SoundExpert won't work for this, nor will ABX since there's a huge risk for false positives. A lot depends on where you switch from A to B. Small tempo and pitch differences will remain unnoticed when heard in isolation, but as soon as you jump from one to the other they can become apparent. This is the daily job of an audio editor, to find the best spot to inaudibly switch from one take to another. (hint: it's not always easy and I'm glad to be paid per hour)
But SBR is "singing along" with the music without tempo and pitch differences, yet re-creating it from scratch (the original waveform is discarded). ABX works fine. Amplifying the sample-by-sample differences is meaningless.


I don't see any explanation of why the SoundExpert approach works for SBR, or accurately quantifies the subjective quality of SBR wrt "traditional" coding.


It's funny - we've seen a second revolution in audio coding. The first was when basic psychoacoustics came in, and suddenly having a waveform that was "closest" to the original was no longer the way to judge quality. With two codecs, the one which had a greater error signal could sound better.

Now with SBR and PS we have another revolution where the waveform isn't an (inaudibly) distorted version of the original, but actually bares no resemblance to the original. So any measurements that include psychoacoustics while assuming that the waveform should be at least vaguely similar are also broken.

I'm not convinced that the SoundExpert method actually survived the first revolution, but it's difficult to see how it survived the second.

ABX will survive whatever happens.


I'll eat my words if someone can provide a detailed explanation of how SoundExpert works, and prove a correlation - but if it relies on sticking plasters to undo or account for each new coding trick, it's no good generally.

Cheers,
David.
Title: SoundExpert explained
Post by: Kees de Visser on 2010-12-02 16:52:16
In my view the better codec is the one that will remain in category 2 in any situation (e.g. inserting an Orban in the monitoring chain).
Ah, good, so now we have everyone in the world listening to every possible audio signal via every possible piece of audio processing. Excellent.
Exactly, that's not very practicle.
And that's the very reason why so many audio professionals prefer to offer lossless formats and let the customer decide how to process it for his/her personal use.
I remember numerous complaints from HA members about online music being only available in lossy formats. Deutsche Grammophon offers both flac and 320kbs mp3, which makes a lot of sense IMO, even if they sound identical
Title: SoundExpert explained
Post by: greynol on 2010-12-02 18:15:31
2- artefacts are below the hearing threshold

I'm sure you won't be surprised to hear from me that this is an oxymoron.
Title: SoundExpert explained
Post by: Serge Smirnoff on 2010-12-02 18:24:44
Now with SBR and PS we have another revolution where the waveform isn't an (inaudibly) distorted version of the original, but actually bares no resemblance to the original. So any measurements that include psychoacoustics while assuming that the waveform should be at least vaguely similar are also broken.

Below are Diff. Levels of 9 SE samples processed by HE and LC profiles of CT encoder (@128 kbit/s):

Code: [Select]
           aac+ CBR@128.9 (Winamp 5.21)       aac CBR@129.6 (Winamp 5.21)
----------------------------------------------------------------------------------
BAH:       -34.4139 dB                           -33.6044 dB          
BAS:       -35.8823 dB                           -36.6633 dB
CST:       -9.9989 dB                            -21.8093 dB
FMS:       -30.0811 dB                           -36.1838 dB      
GLK:       -19.6055 dB                           -36.1699 dB      
HRP:       -14.6460 dB                           -21.5798 dB
LOB:       -16.6801 dB                           -22.8038 dB
MOF:       -21.3063 dB                           -31.3638 dB
QRT:       -33.6662 dB                           -33.6656 dB


The same for encoders @192 kbit/s:

Code: [Select]
           aac+ CBR@192.7 (Winamp 5.24)       AAC VBR@190.9 (NeroRef 1530)
----------------------------------------------------------------------------------
BAH:       -37.1624 dB                           -33.0144 dB          
BAS:       -39.2936 dB                           -32.5532 dB
CST:       -23.2628 dB                           -28.0893 dB
FMS:       -39.2991 dB                           -33.3562 dB      
GLK:       -33.8942 dB                           -37.5733 dB      
HRP:       -20.2250 dB                           -26.7197 dB
LOB:       -29.0020 dB                           -29.4662 dB
MOF:       -34.7739 dB                           -36.6439 dB
QRT:       -37.0264 dB                           -32.7181 dB


As you see waveforms of both profiles differ from reference waveforms approx. to th? same degree. So it is an illusion that with SBR the waveforms "bares no resemblance to the original". The illusion is inspired by the knowledge of how SBR works. In reality both waveforms are changed to the same level (@192 HE versions even closer to references than LC ones, though the encoders and modes are different). The main question is how they are changed.
Title: SoundExpert explained
Post by: Woodinville on 2010-12-02 23:32:58
This leads to a very simple question: What does "sub-threshold differences in a listening test" mean?
Differences that can be proven to exist with technical means, but are undetectable with a standard listening test.


Thanks to Quantum Mechanics, a good enough measurement will always be different if this is an analog signal.  If it's a digital signal, well, you have a tiny leg to stand on, but still, let's take a 120 second log sweep from 20 to 15khz. Under that by 40dB I put a 4kHz tone.

Now, the difference is going to be a constant 4khz tone. The "noise" is stationary and exactly predictable.  Its audibility is going to vary enormously over the time of the sweep.

We have to use how many different examplars or whatever to decide the audibility of this noise? The scale is continuous, so ...??????