_Decibeled_ human hearing
Reply #6 – 2009-10-27 14:28:52
One more doubt then. Given a signal (acoustic, of course), it's mask (noise or tone mask) has no correlation with the perceptibility of a variation in the intensity (say a increase or decrease of x dB) of the signal at a given frequency, besides that a change when the signal is far below the mask, it is probably not perceptible, but because the signal itself is not audible, and besides that if the resulting intensity crosses the mask, then the signal at that point will become audible (if was below the mask an we increase it) or inaudible (if was above the mask an we decrease it), right? With correlation I mean that the mask tells us nothing about the effect that a change at a given frequency will have. It's only occasionally that a lossy encoder will completely silence a frequency component because it deems it to be either fully masked or and completely devoid of influence on human hearing. This is how I originally naively imagined lossy coders worked - producing a frequency spectrum with a load of zeroes (easy to store in little space) and a few spikes for the small number of sinusoids present, which might be encoded efficiently thanks to harmonics. Real-world music signals, Transform-overlapping and sampling resolution plus a good deal of broadband noise and transients means that sampled digital audio rarely looks that simple when frequency-analyzed. My naive way of envisaging it was wrong. If we consider an MP3 encoder, for example (somewhat crudely, but to make the point) aiming for just-transparent encoding: • the audio frame is divided into many sub-bands in the time domain, each of which has fewer samples that represent a frequency band (with soft edges). • each subband is transformed by a DCT to product DCT coefficients (effectively each is a frequency component or frequency, though it's not quite so simple). • Each DCT coefficient starts as a floating point number but eventually has to be expressed to a certain binary precision (e.g. 2-bit, 3-bit, ...8-bit, 16-bit) with some scaling factor. This limited precision creates rounding error which is mathematically equivalent to adding noise to the true value of that coefficient. The scaling factor adjusts the absolute level of this noise. It is called quantization noise and its statistics and spectral characteristics are well understood in relation to the bit-depth applied. • The precision (bit-depth) chosen to encode the DCT coefficients in each sub-band is such that the frequency-components of the quantization noise are kept just below the lowest masking level within that sub-band (the masking level was determined by the analysis process, using FFT, according to the psychoacoustic model of human hearing). Reducing the bit-depth like this means the audio can be encoded using fewer bits (the aim of lossy coding). The masking implies that the original signal with the added quantization noise and all its spectral variations should be indistinguishable from the lossless original. Unlike MP3, some other codec working in its "transparent" region might not use sub-bands but instead choose the precision with which individual frequency components are stored after, say a full-spectrum transform into the frequency domain. Or like MP2 and MusePack it might not use the DCT, but instead use time-domain samples within the sub-bands and quantize those to the required minimum number of bits to keep the added noise masked. Essentially this is dividing the signal into frequency bands, encoding each band separately with the lowest bit-depth you can get away with so you can save bits. The decoder then reconstitutes the numerous bands into the full spectrum, which will have added noise that varies somewhat over the frequency spectrum but. LossyWAV goes a step further in simplification in reducing bit-depth of the PCM signal directly, having looked at the spectrum and determined the "noise floor" for each block, which can be directly related to the level of quantization noise that can be added without significantly increasing the noise floor. But because it added essentially white noise, or only lightly shaped, lossyWAV cannot efficiently choose to hide more noise at frequency bands where there's a good masking tone, so it won't come close to matching the low bitrate of MP2, MP3 or MPC for example when they achieve transparency. The added noise could be calculated in the linear domain and compared to the mask in the linear domain, or the could both be expressed in decibels relative to some reference. If you think of Test signal A as a strong sinusoid at say 3.0 kHz, with another fairly strong and audible sinusoid at 2.5 kHz where the masking effect is strong. For Test signal B, perhaps the second sinusoid is at 2.0 kHz where masking is somewhat weaker. Signals A and B sound different but now we can modify each of them by adjusting the amplitude of the quieter sinusoid in each while leaving the 3 kHz one the same. Now consider varying the amplitude of the lower-freqeuency sinusoid by 0, 1, 2, 3, 4, 5 dB and randomly and ABXing the difference between the original and the changed amplitude of the lower frequency tone, to determine the threshold of audible change. You can generate tones in various programs to try this out for yourself. The masking effect would indicate that one would expect to find a higher threshold of audibile change for the 2.5 kHz tone (ABXed against the original 2.5 kHz one) than for the 2.0 kHz tone (ABXed against the original 2.0 kHz tone). We could test this to find out.