Why is it that lossy audio encoders, that try to model the human auditory system in order to avoid encoding perceptually irrelevant information tend to use uniform frequency resolution (AFAIK)
If you base your representation on a uniform block-transform, and combine coefficients to selectively widen the higher-frequency bands, you ought to be loosing temporal resolution compared to full-band filterbank, wavelet or whatever.
Would not e.g. a gammatone filterbank be a better mapping to our hearing, and therefore a better starting-point for deciding what information to throw away (if cpu cycles were of no concern, and possible redundancy in the representation could be efficiently compressed)?
I seem to remember that the neuron firing rate is limited to something like an effective bandwidth of 2kHz. If critical bands of higher bandwidth are encoded only as half-wave rectified envelope, this would put a limit on the temporal resolution relevant needed to model the system, and perhaps this is some "lucky coincidence" that ensure the success of (perceptually modified) uniform block transforms?
Another (only somewhat related) question: When 50% overlapped block transforms are used, this would seem to introduce some redundancy into the parameters (coefficients).
You're at a resort in Turkey and you're reading HA? This is obviously some definition of "relaxation" that I'm not familiar with.
QuoteI seem to remember that the neuron firing rate is limited to something like an effective bandwidth of 2kHz. If critical bands of higher bandwidth are encoded only as half-wave rectified envelope, this would put a limit on the temporal resolution relevant needed to model the system, and perhaps this is some "lucky coincidence" that ensure the success of (perceptually modified) uniform block transforms?I'm not familiar with this research nor am I sure I understand what you're saying.
QuoteAnother (only somewhat related) question: When 50% overlapped block transforms are used, this would seem to introduce some redundancy into the parameters (coefficients).MDCT is 2N coefficients in, N coefficients out, with 50% overlap. So there's no redundancy - it's critically sampled. (Another reason why it's so popular!)
Kutinh you are missing one of the issues in coding.There are two:1) We must impliment the masking thresholds well.
2) We must squeeze as much rate coding gain out of the codec as we can afford latency, without 1) falling out from under the time window.The reason uniform banks are used is that we get a lot of coding gain.
It was suggested to me some years ago that communication from ears to brain operate at an effective sampling rate of a couple of kHz. It was also suggested that the reason why we still can "hear" stuff at higher frequencies was that (at some point), the signalling started to detect the envelope instead, sampled at the same sampling rate.
Quote from: Woodinville on 21 April, 2012, 12:32:29 PMKutinh you are missing one of the issues in coding.There are two:1) We must impliment the masking thresholds well.This is the "lossy" part, right? Introducing error/inaccuracy in such a way as to decrease bitstream bandwidth without affecting audible distortion too much. ( irrelevancy removal).
I had just assumed that since lossless codecs (such as flac) operate at typically 1/2 the rate of LPCM, that this is a good estimate of the redundancy in typical PCM, and that the remainder have to be removed through removal of irrelevancy.
Would not (any orthogonal transform + ideal vector quantization) do as well as (ideal orthogonal transform)? If PCA or KLT gives you the most energy compaction (hopefully something close to DCT), could not (in principle, and at unreasonable computational cost) a raw time-sample representation (identity transform) + vector quantization do the same thing?
This is where I step in over my head, but my guess is no. I think the DCT is somewhat special in that it's decomposition lines up fairly neatly with the actual time/frequency response of the human ear (except that it is linear in frequency). I think if you went with something like PCA you would get better energy compaction, but have a lot of trouble computing the masking thresholds as well as with the DCT. However, I admit I am much more familiar with decoders then actual encoding.
I think that the DCT (or something similar) is special in that it does a)representation in a form suitable for perceptually guided quantization, b)energy compaction, c)low implementation cost. I agree that PCA would probably give a (small) benefit wrgt energy compaction, but a big hit wrgt perceptually sensible quantization.
But if there is some transform that does perceptual mapping better than DCT (probably some non-uniform (non-linear?) wavelet/filterbank), and another operation that can squeeze out the redundancy of any finite-length correlated, transformed signal (vector quantization?), then those two together would seem to be able to both a)convert all irrelevancy into redundancy, and b) convert all redundancy into fewer transmitted bits?
It strikes me as unfortunate that the transform have to be chosen for satisfying both energy compaction and masking thresholds at the same time, if those two are somewhat different goals?-h
You could use DCT over a few wavelet bands. The ATRAC codec mentioned above implements this strategy and thus achieves nonuniform time-frequency resolution.
JJ, are you referring to "overall performance" of ATRAC codec or the energy compaction property of its filter bank? I would be surprised if its energy compaction was worse than a single-resolution MDCT (not considering block switching here).
You seem to imply that aliasing in mp3 has a strong effect on the coding efficiency. Is it really true for real-world signals? I thought that sparsity of transform coefficients is not strongly affected by aliasing.