Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: PNS and speech-like signals (Read 6092 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

PNS and speech-like signals

I noticed that for speech-like signals, PNS seemed to fail.. The decoded signal seemed to lose some pitch information.. Is it possible that PNS is best used together with Long Term Predictor?  The ISO specs specified that if the signal is noise like and its energy does not change very much between frames, then PNS for that scalefactor band is activated. Speech spectrals are noise-like but it's energy changes very quickly (contain pitch information). I  wondered if this is what it meant?

PNS and speech-like signals

Reply #1
PNS should not be enabled for frequencies less than 4 kHz, and even better - for frequencies less than 6 kHz. This is well above the speech frequencies.

To measure "pitch" level you can employ either SFM or LPC on spectrum and if the flatness measure is low, or LPC gives big gain (signal/error)  you do not need to activate PNS.

If your psychoacoustic model implementation calculate both short and block windows in the same time, you can also use energy distrubance between short blocks to decide whether the signal is steady or not for the long block.

PNS and speech-like signals

Reply #2
I don't think that speech frequencies are that low for a sampling rate of 44100 Hz.
It's pitch information extends all the way to the max. For below 4Khz, it is assumed that human speech is bandlimited.. but that is not always the case.

It is a great pity that PNS cannot be used for human speech.. Much of musical clips contain human voices.. and as a result, the efficiency of PNS is not that great as I would hope it would be.  Noise substitution would just destroy the pitch structure of the spectrals.

PNS and speech-like signals

Reply #3
Quote
I don't think that speech frequencies are that low for a sampling rate of 44100 Hz.
It's pitch information extends all the way to the max. For below 4Khz, it is assumed that human speech is bandlimited.. but that is not always the case.


That's mostly true but

- Above some frequency (say, 10 kHz) these "pitches" are less perceptually important and PNS artifacts should not be that offensive, and sometimes they can be even more pleasing than signal without PNS

- TNS algorithm should deal with glotter-introduced time-domain "pitches" (such as es02.wav) and post-TNS energy is relatively free of these distrubances, if TNS is employed correctly.

Quote
It is a great pity that PNS cannot be used for human speech.. Much of musical clips contain human voices.. and as a result, the efficiency of PNS is not that great as I would hope it would be. Noise substitution would just destroy the pitch structure of the spectrals.


Like I said - there is also a TNS tool for that

PNS and speech-like signals

Reply #4
I don't think that TNS is good enough in completely modelling the pitch structure..
For pure speech signals.. there resultant noise substituted samples sounded terrible.. However, it is not so bad for musical clips that contained human voices..

wkwai

PNS and speech-like signals

Reply #5
PNS, like IS is a lossy tool - and triggering it on and off purely depends on psychoacoustic decision - are there benefits or not.

Modeling pitchy structure with PNS is not possible,  but PNS usually shouldn't be triggered for such samples - depending on a switch criteria,  pitchy structure would be detected and sfb would not be considered for a candidate for PNS - of course, TNS should be employed before just for testing.

PNS and speech-like signals

Reply #6
I think substituting the TNS filtered residual with noise is a bad idea.. In theory, TNS filtering would "whitened" the temporal envelope.. But in practical, the TNS implementation is just an approximation... first, the filter length is limited.. then there is quantization errors in the reflection coefficient coding..

I remembered in speech coding techniques for LPC class codecs such as GSM and CELP, the lpc filtered residual must be handled VERY carefully as it has an effect on the coded speech quality.. In fact there is a lot of on going research into the modelling of the lpc residual.. in CELP such as the US Department of Defence 4.8kbps CELP, the lpc residual is just a look-up table containing a series of 1s and 0s... in the old GSM and  the ITU G723 -6.3kbps codec, it is a series of impulses with variable amplitude and width...

Of course, in speech coding, prediction is done in the time domain whereas TNS is in the frequency domain.. However, I doubted if there are much differences when it comes to the residual.

PNS and speech-like signals

Reply #7
Quote
Of course, in speech coding, prediction is done in the time domain whereas TNS is in the frequency domain.. However, I doubted if there are much differences when it comes to the residual.


The key difference lies that TNS relies on a fact that reflection quantization errors and residual quantization errors will be somehow masked by pre and post masking in time domain and that the resulting signal will perceptually still be better (less time-domain errors) than signal without TNS.  This is not always the case, but on the other hand - there are better methods to control the TNS than one proposed in the ISO standard.

Like I said - PNS will almost never give perfect reconstruction below the audibility threshold - the key idea behind PNS is to improve quality when psychoacoustic constraints can't be met.  Some class of signals would benefit from TNS filtering of PNS signal as well - and it is up to analysis algorithm to figure that out.

PNS and speech-like signals

Reply #8
Ok, I attached couple of test files to test PNS performance with TNS

Castanets clip, without short blocks @128 kbps

TNS ON, PNS ON
TNS OFF, PNS ON
TNS ON, PNS OFF
TNS OFF, PNS OFF

As you can see, TNS definitely improves PNS performance on such clips.  Howver, since the bit rate is high enough for perceptual model  PNS versions sound clearly inferior compared to  "PNS OFF, TNS ON" version.

Note: samples are encoded without short blocks - and the clip contains considerable amount of pre-echo which is impossible to filter out without short windows @128 kbps.

PNS and speech-like signals

Reply #9
Perhaps even better example is 'Fatboy' Clip

Attached... 

Also, no short blocks - even considering the fact that the clip is impossible to code properly without short blocks,  TNS greadly improves perceptual quality with and without PNS.

PNS and speech-like signals

Reply #10
Quote
Some class of signals would benefit from TNS filtering of PNS signal as well - and it is up to analysis algorithm to figure that out.


I have not evaluated your sample files yet... but are you suggesting that at the encoder, NS detection and noise substitution is done BEFORE TNS filtering... and TNS filtering is done on the newly substitued noise scale factor bands to better reflect the same process at the decoder?  What I did was, 

1) TNS filtering first
2) then Noise Detection  and zero the scalefactor bands..

My technique would generate terrible result for the same experiment you did.. Somehow, Temporal Noise and PNS noise do not mix..

Another problem... I was wondering if the random number generator at the decoder is standardized?? In the Final Committee Drafts, it just mentioned that the random number generator is constructed as a series of add and multiply operations.. They did not exactly specify how..  I supposed that the Reference Source Code PNS decoder is the true implementation ?  If it is, then it is possible to simulate the same PNS noise regeneration at the encoder as in the decoder.. My concerned is that it is not standardized..

I do not really understand about PNS theory... It is stated somewhere that PNS is based on the fact that one could not differentiate one noise from another.. All noise sounded the same.. Somehow I would dispute this.. If a scalefactor band is classified as noise, then by quantizing this band would inject additional quantization noise to it.. According to this theory, we would not be able to notice any difference??? Somehow this is not the case if the quantization is very coarse... The masking principals holds!!

PNS and speech-like signals

Reply #11
Quote
I have not evaluated your sample files yet... but are you suggesting that at the encoder, NS detection and noise substitution is done BEFORE TNS filtering... and TNS filtering is done on the newly substitued noise scale factor bands to better reflect the same process at the decoder? What I did was,

1) TNS filtering first
2) then Noise Detection and zero the scalefactor bands..


Noise substitution is usually done before quantization and >after< TNS in order to capture temporal structure of the original signal.

Noise detection is naturally done before TNS, or alternatively with some kind of threshold correction before TNS - it depends how you implemented TNS and perceptual model.  Some things that we do I can't comment, because they are trade secrets.

You may, however, exclude PNS bands from bit stats calculation, or any other thing that your AAC encoder does in connection with PNS bands.

Quote
My technique would generate terrible result for the same experiment you did.. Somehow, Temporal Noise and PNS noise do not mix..


You are making a mistake somewhere,  maybe an encoder bug?

Quote
Another problem... I was wondering if the random number generator at the decoder is standardized?? In the Final Committee Drafts, it just mentioned that the random number generator is constructed as a series of add and multiply operations.. They did not exactly specify how.. I supposed that the Reference Source Code PNS decoder is the true implementation ? If it is, then it is possible to simulate the same PNS noise regeneration at the encoder as in the decoder.. My concerned is that it is not standardized..


I think that ISO noise generator always generates same sequences - try decoding one file couple of times to check - there is a conformance criteria anyway. By using the ISO method in the encoder, I think you will end up with a good simulation of the result.

Quote
I do not really understand about PNS theory... It is stated somewhere that PNS is based on the fact that one could not differentiate one noise from another.. All noise sounded the same.. Somehow I would dispute this.. If a scalefactor band is classified as noise, then by quantizing this band would inject additional quantization noise to it.. According to this theory, we would not be able to notice any difference??? Somehow this is not the case if the quantization is very coarse... The masking principals holds!! 


We are talking about couple of different things - PNS is triggered if:

a) signal is considered non-tonal (without pitches)

AND

B) energy distrurbance is small in time domain (signal is relatively "flat")


If you undercode  particular sfb (coarse quantization) in MDCT quantizer,  what you will get is signal which is more silent, and with varying envelope loundness during the time and frequency (i.e. - violating both PNS requirements) - which is a main reason for causing the "artifact" nature on noise-like signals.

PNS does replace the SFB noise with artificial noise of equal distribution in time and frequency (if PNS criteria is met) - which is a key difference between noise substition in PNS and noise injecting in quantizer.

PNS and speech-like signals

Reply #12
Quote
If a scalefactor band is classified as noise, then by quantizing this band would inject additional quantization noise to it.. According to this theory, we would not be able to notice any difference??? Somehow this is not the case if the quantization is very coarse...

You don't quantize PNS band frequencies, you just supply the decoder with band's sum energy (noise_nrg)... Or?
daniel

PNS and speech-like signals

Reply #13
No, of course not - I'm just commenting on wkwai's remark about the nature of the PNS and quantization noise - they are very different.

 

PNS and speech-like signals

Reply #14
Dear Ivan,


I have studied your results... I too got similar results as yours... It seems that even though TNS do improve on PNS for clips like castanets, it is still not as good as NOT using PNS at all..  Also using PNS on the residual of TNS can cause the reconstruction TNS filter at the decoder to become unstabile... I have uploaded these 3 files which contains a lot of tiny attacks which is very closely spaced... Using all short blocks would not be an optimal solution so, most of the time, TNS is used instead... 

You can download these files at the uploads section..


wkwai