I understand that most psychoacoustic based transform coders have difficulties in
coding signals like speech or music clips with strong vocals.. From my listening tests.. I find that these clips seemed to lose some of its "original" quality..
One possible explanation has something to do with the mis-match between the masking threshold calculated in long block for a signal that changes rapidly in time.. and switching to short blocks isn't a good solution as it involves too much block switching.. In AAC there is the TNS tool which flattens the temporal envelope and provides a better matching between the masking thresh and the quantization noise..
Still it is NOT good enough.. The vocals sounded a little flat.. sometimes like someone singing with a "nose block"!! Pitch related problem ??
I wondered if the LTP tools will provide an even better modelling of these kind of signals..