Music over GSM

Topic: Music over GSM (Read 7440 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Music over GSM

2004-05-07 21:40:34

Hi,

I am working on a project that requires music to be played to mobiles over GSM. Obviously, the quality is not going to be great but in trials some music sounds fine and some sounds unrecognisable.

Does anyone have any experience of this or any advice on any prefiltering techniques to get the best out of the GSM codecs? ie compression, attack levels, limiting etc.

Thanks,

Andy

Music over GSM

Reply #1 – 2004-05-19 14:15:08

GSM codecs models the human vocal tract.. It is highly unsuitable for music!
Not only will you fail to reproduce many of the frequency components.. Worse,
the LPC reconstruction filter is very likely to become unstable!!

Music over GSM

Reply #2 – 2004-05-22 11:27:50

Quote

GSM codecs models the human vocal tract.. It is highly unsuitable for music!

Right. But it's also appropriate enough to code monophonic music (playing just one note at a time).
But I guess that's not goint to please the OP.

Quote

Not only will you fail to reproduce many of the frequency components.. Worse, the LPC reconstruction filter is very likely to become unstable!!

Why do you think it'll become unstable ??
Any reason it should do so ?

bye,
Sebastian

Music over GSM

Reply #3 – 2004-05-22 13:36:54

I thought there are allready a number of song identification services that work via cell phone. Just hold it up to the speaker for a couple of seconds (maybe 15) and you'll recieve an SMS with the artist, title etc.

UK: http://www.shazam.com/uk/do/home
Germany (same technology as UK): http://www.vodafone.de/kundenbetreuung_ser...ment/31669.html

They fft the signal and then look at the peaks in the spectrum. One peak (for example the lowest) is chosen as the reference peak and from there they construct vectors to the other peaks. Presto: a musical thumbprint. I'm fairly certain they also run bpm analysis and other gizmos, but basically the above seems to allow sufficient song discrimination.

That sounds like a very robust system to me. So depending on what you want to do, you may not need prefiltering at all.

Music over GSM

Reply #4 – 2004-05-23 11:59:54

I could be wrong about the stability problem.. To be honest, I have never work on the GSM codec in detail before.. only other variants of LPC based codecs..

The closest similarity to LPC and music coding that I think of is the MPEG4 Twin VQ codec.. It uses the LPC analysis in time domain of music sequences to model the fft spectrum shape.. You can actually construct the fft spectrum shape from the LPC coefficients.. Since convolution in the time domain equals multiplication in the frequency domain, in Twin VQ, the spectral decomposition are divided by the fft spectrum shape.. This is equalavent to filtering in the time domain.. Tone components are "supposed" to be flattened out and only noise remain.. This is the theory in LPC filtering.. The residual is supposed to be white noise..

However, in complex musical tones.. this is Not that simple.. If you plot the spectral shape / frequency response of the LPC filter against a fft decomposition, you will notice that while some tones are correctly modeled.. it can also missed out some other tones.. Tones that are too closely spaced can be mistakened as a single tone.. As a result, the LPC filtered residual isn't a white noise.. It may contain some tone components..

This is where the complication comes in.. the modelling of the LPC residual assumes that it is white noise ! What happened if it is not? I am not very sure about GSM's technique of residual modelling.. I think since it is using a series of pulses.. (I cannot be sure..) from the fourier theory, since pulses consists of a long series of sines.. I think at the reconstruction LPC filter can handle monothonic tones rather well.. However problems will arise with complex tonal music..
You will very likely lose a lot of tone components..

In twin VQ, this is the case.. as a result it is used only at low bitrates.. mostly speech like signals..

Then there is the issue of the quality of the reconstructed tone.. since the lpc spectral shape slopes isn't "sharp" or steep enough..

Try plotting the freq response of the LPC filter against the fft decomposition to find out..

As for the stability issue.. Levinson Durbin technique guarantees stability as long as the reflection coefficient is not >= 1.. which I think is always true..

Music over GSM

Reply #5 – 2004-05-23 12:56:54

@ Gecko:

That's interesting. I wasn't aware of such services. If their song recognition systems manage to classify correctly for 99% of the songs they have in their databases I would be very surprised. Classifying sound is already hard/complex enough.
I once read something about a classifying sheme (for MPEG7) that calculates a "tonality" factor per subband per time slice comparing it to other tonality patterns.
This will probably fail to classify music over GSM because GSM's model is NOT suited for music and thus would introduce much noise.

@ wkwai:

In my humble opinion LPC is a good tool for coding time domain samples. It won't produce a perfect white residual for every input but it's white enough for most cases.
I think it's overkill and inappropriate to use an LPC filter for transform codecs such as AAC/Vorbis/VQF. Its frequency response is "too wavy" and won't match well with the source signal. So you end up with frequency regions which exceede average energy and frequency regions that fall below the average energy level. An better alternative would be to directly code the frequency response of a synthesis filter instead of coding LSP coeffs and computing the response at decoder side (what a waste of time it is). But this curve would not be very smooth due to tonal peaks and therefore hard to encode compactly. That's why MP3/AAC/Vorbis and MPC (starting in Streamversion

encode the noise-floor instead and adaptivly select an appropriate codebook for a specific frequency region. The noise-floor is much smoother and easier to encode using differential huffman and stuff - But I guess, you're aware of that. :-)

Anyhow, one can take care of the instability problem by applying a Hann window on the time samples before running the Levinson-Durbin algorithm. The probability of an unstable LPC filter is very very low then. Even if it got unstable, special care can be taken by adjusting the LSP afterwards (they should not be to close together and not equal 0 or pi).

@ andymac:

I may be wrong (sometimes I am) - but in my optinion the chance is low that any pre-processing would improve the sound quality of GSM-coded music. The codec just is not suited for that kind of use.

bye,
Sebastian

Music over GSM

Reply #6 – 2004-05-24 09:13:25

Quote

I think it's overkill and inappropriate to use an LPC filter for transform codecs such as AAC/Vorbis/VQF. Its frequency response is "too wavy" and won't match well with the source signal.

VQF is using a LPC filter to model the fft spectrals. Dividing the MDCT spectrals in frequency domain is equivalent to convolution in the time domain..and vice -verse.. Convolution in the frequency domain is equivalent to multiplication/division in the time domain.

So, structurally, VQF is a hybrid LPC - transform coder.. transform because the residual are coded as transformed MDCT coefficients...

It is the same thing.. you can linearly filter the time domain music signal then transformed the residual with the MDCT into frequency domain..

Otherwise, you can construct the LPC freq response shape, transformed the unfiltered music into freq domain.. then divide it with the LPC freq response shape..

In both cases, they are the same process..

So, the GSM, CELP codecs are very similar to VQF.. Only the residual is coded differently.. Because for VQF, the residual is in the frequency domain, it has the advantage of using psychoacoustic principals in coding the residuals..

For GSM, CELP codecs perhaps someone could implement a "modified" encoder that uses a Psychoacoustic model to accurately model the LPC residuals without sacrificing compatibility at the decoder? You will most likely need to transformed the residuals into freq domain to correctly select the best time domain sequences from a set of tables or codebooks.. taking into account masking principals!!

As you can hear in VQF 8 kbps, music clips, it is very good.. outperforming GSM and ITU-T G723.. I supposed it is due to the Psychoacoustic Model?

Music over GSM

Reply #7 – 2004-05-24 17:10:06

Quote

VQF is using a LPC filter to model the fft spectrals. Dividing the MDCT spectrals in frequency domain is equivalent to convolution in the time domain..and vice -verse.. Convolution in the frequency domain is equivalent to multiplication/division in the time domain.

So, structurally, VQF is a hybrid LPC - transform coder.. transform because the residual are coded as transformed MDCT coefficients...

Yeah, I'm aware of all this.

Quote

It is the same thing.. you can linearly filter the time domain music signal then transformed the residual with the MDCT into frequency domain..

Otherwise, you can construct the LPC freq response shape, transformed the unfiltered music into freq domain.. then divide it with the LPC freq response shape..

In both cases, they are the same process..

Yes, but why should I calculate LSP coeffs for a LPC analysis/synthesis filter if I'm operating in the freq domain anyway ? The only thing you need to know is the filter's response - not the LSP values or something. So IMHO calculating & coding LSP coeffs is inappropriate for T/F coders.

The other thing is - as you already pointed out - an LPC analysis filter does not produce a perfect white residual in most cases. This is bad, because the energy will be still somewhat frequency varying - meaning the spectral samples have frequency varying variance - so you have to use different codebooks with a) different SNRs (to do proper noise shaping based on psychoacoustics) and b) different variances because the spectrum has not been flattened well.

Quote

For GSM, CELP codecs perhaps someone could implement a "modified" encoder that uses a Psychoacoustic model to accurately model the LPC residuals without sacrificing compatibility at the decoder? You will most likely need to transformed the residuals into freq domain to correctly select the best time domain sequences from a set of tables or codebooks.. taking into account masking principals!!

(Frequency varying) noise shaping could be done in time-domain, too (the same way as Foobar2K does with this ATH-noise shaping for example). I think Speex is doing noise shaping to increase subjective quality. And I agree with you, that this is something that could be added to a GSM encoder (if it's not already built-in) without loosing backwards compatibility.

bye,
SebastianG

Music over GSM

Reply #8 – 2004-05-25 14:41:54

Quote

Yes, but why should I calculate LSP coeffs for a LPC analysis/synthesis filter if I'm operating in the freq domain anyway ? The only thing you need to know is the filter's response - not the LSP values or something. So IMHO calculating & coding LSP coeffs is inappropriate for T/F coders.

Because at the decoder you need this lpc response to reconstruct the flattened tone components.. At the encoder, the spectrals are divided by this lpc response and at the decoder, you have to multiple the residual spectrals by the same lpc reponse.. that is why you need to transmit these lsps to the decoder because then only can you reconstruct the lpc response envelope..

In freq domain, tones are usually very bit consuming especially when coding in huffman codes.. The idea of flattening some of these tones components first before coding them losslessly and since lsp's are highly efficient, you will result it extra coding gain!

I cannot be sure about this anyway.. as I have not done any measurements with the lpc - huffman combination vs huffman alone.. but I can see from the AAC huffman coding, tones usually results in quantized spectrals > 16.. which requires attaching some extra codes to "extend" the huffman codes.. These are very bit consuming.. I am not very sure if there is a "better" overall solution to this in the context of huffman coding alone.. but if some of these tones can be represented first by a more efficient method.. then it will certainly help..

Anyway in VQF, the residual spectrals are vector quantized...

Then again, in AAC, there is a almost equivalent tool to this.. that is the Predictor Tools which would also flattened "stationary" signals... (not necessary tones in MDCT domain as tones are not stationary in MDCT.. I think there might be some error in VQF by assuming that MDCT decomposition is identical to FFT decomposition.. There is a slight difference in the spectral shape of the 2 transforms )

It is also true that even in VQF, the spectrals are not properly flattened (unless it is speech ).. that is why it is not white noise.. and requires a psyhoacoustic model to select the correct vectors.. even it might loses some closely spaced tone components..

In comparison with CELP/ GSM, it is already assumed that the spectrals (the residuals) are already white noise / flattened !! That is why it had problem handling complex music. GSM/ CELP is just a simplified VQF coder structure..

Music over GSM

Reply #9 – 2004-05-25 17:09:25

Quote

I am working on a project that requires music to be played to mobiles over GSM. Obviously, the quality is not going to be great but in trials some music sounds fine and some sounds unrecognisable.

Does anyone have any experience of this or any advice on any prefiltering techniques to get the best out of the GSM codecs? ie compression, attack levels, limiting etc.

The only pre-fiter I can suggest is cutting some of the low frequencies which speech codecs tend to not like. Otherwise, the differences in quality are probably mainly due to the content. For example, when the music signal is periodic (only harmonics of the same fundamental), the codec's pitch predictor can help a lot, but if you have many notes, it won't be able to do anything.

Music over GSM

Reply #10 – 2004-05-25 18:26:53

Maybe it's a language problem, wkwai. We seem to talk at cross-purposes.

I see your point in the similarities of VQF and a CELP codec and I totally agree with you.

We both said, that in VQF we need the filter's response for frequency domain (de)convolution (not its LSP or LPC representation).
The thing I'm trying to say is: I don't like the LSP representation of such a filter for T/F coders. Why coding LSP coeffs and calculating the response within a decoder instead of directly (de)coding the response ? Is the LSP representation more compact ? I don't think so. It's just complicating everything...

LSP representation of such a spectral filter for time domain coding makes sense. We can easily compute the LPC coeffs and apply the filter in time domain. We don't have to compute the response here...

By "AAC predictor tools" I guess you mean TNS which is the dual of a CELP-like sheme. It is applied in frequency domain not affecting the spectral shape but the temporal shape. We don't need the filters response here either rather the LPC coeffs to be able to apply the filter.

bye,
Sebastian

Music over GSM

Reply #11 – 2004-05-26 12:13:29

Quote

The thing I'm trying to say is: I don't like the LSP representation of such a filter for T/F coders. Why coding LSP coeffs and calculating the response within a decoder instead of directly (de)coding the response ? Is the LSP representation more compact ? I don't think so. It's just complicating everything...

Because, you need the lsps to reconstruct the lpc spectral envelope at the decoder.. At the encoder, the MDCT spectrals are divided by this lpc envelope.. This is the duality to convolution in the time domain..

Therefore at the decoder you need to multiply the flattened MDCT spectrals by the same lpc envelope.. This is the duality to deconvolution in time domain..
Since this lpc envelope varies from frame to frame, it is necessary to transmit the lsps / lpcs coefficients to the decoder..

Of course, the lpc computation is done in the time domain.. but it also has a freq domain representation.. You can approximate FFT spectral shape by conducting lpc analysis in time domain..

I do think that LSP representations is more compact..

In the case of TNS vs spectrum normalization (VQF).. there are not the same.. I am sorry about that.. Spectrum Normalization is the CELP like type of scheme.. Remember that convolution in time domain equals multiplications in freq domain.. TNS is the opposite..

Anyway, all these discussion about VQF is just to show how GSM / CELP or other LPC based speech coders can be analysed in the frequency domain, to better understand the limitation of these speech coders in handling music..

Music over GSM

Reply #12 – 2004-05-26 12:20:10

Quote

Does anyone have any experience of this or any advice on any prefiltering techniques to get the best out of the GSM codecs? ie compression, attack levels, limiting etc.

I think you might have pre-echo problem with music with a lot of attacks.. At 8khz sampling rate, 160 time samples, the noise spread is about 20 msec..

Anyway, since GSM is already so lousy for music, you probably won't notice the pre-echo noise..

In VQF, there is a need to switch to shorter blocks.. just like in AAC/ MP3..

Music over GSM

Reply #13 – 2004-05-26 16:10:34

Quote

Because, you need the lsps to reconstruct the lpc spectral envelope at the decoder..

LSPs can be used for that. But it's not the only possible representation of a filter. So, strictly speaking you are wrong! I don't want to repeat myself for the 3rd time why i think the LSP representation of such a filter for T/F encoders is inappropriate. The only thing you keep telling me is that convolution in time is the same as multiplication im frequency domain. (Yes, I was aware of this even before you mentioned it more than once)

Quote

Therefore at the decoder you need to multiply the flattened MDCT spectrals by the same lpc envelope.. This is the duality to deconvolution in time domain..
Since this lpc envelope varies from frame to frame, it is necessary to transmit the lsps / lpcs coefficients to the decoder..

I have to say: You are stuck in the LSP/LPC world.
(I may be stuck in the AAC/Vorbis world thinking: Coding the noise-floor via scalefactors or a piecewise linear function instead of a spectrum flattening filter is the ultimate solution.)

BTW: This is similar to the discussion about floor type 0 versus floor type 1 (Vorbis).
Type 0 lost the battle, and I'm quite happy with that.

Quote

I do think that LSP representations is more compact..

Thanks for a new statement. I know this is a common assumption. But I believe that this is not true in case we are not interested in the LP coeffs like in VQF. (We don't need the LP coeffs of a flattening filter, we need the filter's response for a component-wise multiplication with the MDCT samples, agree ?)

Quote

In the case of TNS vs spectrum normalization (VQF).. there are not the same.. I am sorry about that.. Spectrum Normalization is the CELP like type of scheme.. Remember that convolution in time domain equals multiplications in freq domain.. TNS is the opposite..

You call it "the opposite", I call it "the dual in another domain". TNS is done via convolution in the frequency domain and CELP coders do convolution in the time domain.
The spectral filter in a VQF decoder is done via multiplicatin in frequency domain. That's why I said TNS and CELP has something in common. They both need the LP coeffs in order to apply the filter via convolution.

Quote

Anyway, all these discussion about VQF is just to show how GSM / CELP or other LPC based speech coders can be analysed in the frequency domain, to better understand the limitation of these speech coders in handling music..

Right. I'm very sorry about being rather off-topic now but I had to defend my position about the LSP / TF-coder issue.

To be not fully off-topic:
I believe that CELP based speech codecs in general would benefit of a more advanced use of psychoacoustics by using sophisticated time-domain spectral noise shaping techniques without loosing compatibility.
The long-term prediction (aka pitch prediction) is an important tool for CELP coders that greatly improves the sound quality for single-voice situations. For music this won't work well, usally, which will lead to very noisy results due to the low bitrate.

About the pre-echo stuff: CELP coders transmit more than one scalefactor for a 20ms frame. So no pre-echo issues. (For this I'm not 100% sure)

bye,
Sebi

Music over GSM

Reply #14 – 2004-05-27 15:04:25

Quote

Quote
Because, you need the lsps to reconstruct the lpc spectral envelope at the decoder..

LSPs can be used for that. But it's not the only possible representation of a filter. So, strictly speaking you are wrong! I don't want to repeat myself for the 3rd time why i think the LSP representation of such a filter for T/F encoders is inappropriate. The only thing you keep telling me is that convolution in time is the same as multiplication im frequency domain. (Yes, I was aware of this even before you mentioned it more than once)

Quote
Therefore at the decoder you need to multiply the flattened MDCT spectrals by the same lpc envelope.. This is the duality to deconvolution in time domain..
Since this lpc envelope varies from frame to frame, it is necessary to transmit the lsps / lpcs coefficients to the decoder..

I have to say: You are stuck in the LSP/LPC world.
(I may be stuck in the AAC/Vorbis world thinking: Coding the noise-floor via scalefactors or a piecewise linear function instead of a spectrum flattening filter is the ultimate solution.)

BTW: This is similar to the discussion about floor type 0 versus floor type 1 (Vorbis).
Type 0 lost the battle, and I'm quite happy with that.

Quote
I do think that LSP representations is more compact..

Thanks for a new statement. I know this is a common assumption. But I believe that this is not true in case we are not interested in the LP coeffs like in VQF. (We don't need the LP coeffs of a flattening filter, we need the filter's response for a component-wise multiplication with the MDCT samples, agree ?)

Quote
In the case of TNS vs spectrum normalization (VQF).. there are not the same.. I am sorry about that.. Spectrum Normalization is the CELP like type of scheme.. Remember that convolution in time domain equals multiplications in freq domain.. TNS is the opposite..

You call it "the opposite", I call it "the dual in another domain". TNS is done via convolution in the frequency domain and CELP coders do convolution in the time domain.
The spectral filter in a VQF decoder is done via multiplicatin in frequency domain. That's why I said TNS and CELP has something in common. They both need the LP coeffs in order to apply the filter via convolution.

Quote
Anyway, all these discussion about VQF is just to show how GSM / CELP or other LPC based speech coders can be analysed in the frequency domain, to better understand the limitation of these speech coders in handling music..

Right. I'm very sorry about being rather off-topic now but I had to defend my position about the LSP / TF-coder issue.

To be not fully off-topic:
I believe that CELP based speech codecs in general would benefit of a more advanced use of psychoacoustics by using sophisticated time-domain spectral noise shaping techniques without loosing compatibility.
The long-term prediction (aka pitch prediction) is an important tool for CELP coders that greatly improves the sound quality for single-voice situations. For music this won't work well, usally, which will lead to very noisy results due to the low bitrate.

About the pre-echo stuff: CELP coders transmit more than one scalefactor for a 20ms frame. So no pre-echo issues. (For this I'm not 100% sure)

bye,
Sebi

Well, the mathematics from LPCs to LSPs is rather elaborate.. (I could not remember them.. after 4 years..) but that is what VQF is using.. The LPCs are converted to LSPs before vector quantization..

I am sorry that I may not be very sure what you did not understand !! But the key point here is that from time domain lpc analysis, we can approximate the FFT spectral shape.. which is the "MAGNITUDE RESPONSE" in frequency domain of the FIR LPC filter! (NOT the IMPULSE RESPONSE! )

TNS is the duel of ATRAC3 gain control NOT CELP!!

Music over GSM

Reply #15 – 2004-05-27 18:49:45

Hi, wkwai !

We're going around in circles. You keep telling me things I already know and I'm unable to package the thing I'm trying to say into appropriate words so you can understand. I apologize for that.

(language problem I guess)

I followed the whole thread again and noticed I should have used the term "magnutude response" instead of just "response". You might have thought I'm talking about the impulse response. I apoligize for that one, too.

It's not that I'm not aware of how VQF works and want you to teach me.
It's just that I dislike the LSP representation of a spectral flattening/reconstruction filter for transform codecs.

Anyway, It's off-topic, so I keep my mouth shut from now on.

bye,
Sebiastian

Music over GSM

Reply #16 – 2004-05-29 05:12:56

Quote

It's just that I dislike the LSP representation of a spectral flattening/reconstruction filter for transform codecs.

Anyway, It's off-topic, so I keep my mouth shut from now on.

bye,
Sebiastian

You are right.. I finally remember the lsp plot of 10 coefficients vs. frame no.. For speech, there is a high level of correlation between them.. They almost have similar shapes.. resulting in high coding gains..

However, this may NOT be true for other type of audio signals.. (I have not plotted them out for musics!!.. )

Arigatou..

Notice