Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Why DCT for MFCC? (Read 12069 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Why DCT for MFCC?

Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!

Why DCT for MFCC?

Reply #1
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Hmmm...

Lots missing here. Look up "prediction gain", "spectral flatness measure", and "transform gain" for starters.
-----
J. D. (jj) Johnston

Why DCT for MFCC?

Reply #2
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Although u can get 40 mfcc coefficients, it is enough for only the leading 12 to store the signal  essential characteristic.

In my humble opinion,that is where dct data compression propertity reside.

Why DCT for MFCC?

Reply #3
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!


Although u can get 40 mfcc coefficients, it is enough for only the leading 12 to store the signal  essential characteristic.

In my humble opinion,that is where dct data compression propertity reside.



Sigh.

Some terms to look up

"transform gain"
"Diagonalization"
"matched filtering"

This may lead you both in the right direction.

To explain:

If we have a sine wave of maximum amplitude, you have something whose average amplitude (i.e. mean absolute value) is .5.  While what I'm saying is not mathematically correct and is only approximate, that means that you have, say, for 16 bits, 15 bits on average per sample.

Now, if we take a 65536 point transform that happens to exactly match the sine wave in one particular basis vector (please look up transforms to see what a basis vector is!), you will have 65535 lines with ZERO information, and 1 line with an amplitude of 65536.

This gives you 16 bits above 1 (and 16 below), for a total of 32 bits in the signal representation, divided over 65536 samples. 

That's hardly 15 bits per sample.

This is, of course, a massively extreme case of transform gain, in practice with windows, etc, you can not achieve this kind of gain. It does, however, explain the basic gain.

Basically log_2 (n) is a lot smaller than n for most n.
-----
J. D. (jj) Johnston

Why DCT for MFCC?

Reply #4
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?


The first thing to consider is that the DCT is actually just a shortcut and is equivalent to computing the FFT using the "full spectrum" (i.e. including the negative frequencies). About compression, it depends on how you see things. You just cannot obtain a smaller number of samples without throwing away information. What the DCT does however is that it concentrates the amount of information in just the first few points (i.e. with 10 MFCC, you have almost as much information as with the 40 Mel bands).


Why DCT for MFCC?

Reply #6
Hello,
To obtain mfcc coefficients, we would do following:
(sound signal frame in time domain)->FFT->mel freq. scale filter->log->DCT

Our goal here is to extract a characteristic vector of the signal frame as reducing number of samples.
For example, when a number of samples in the frame is 400, taking FFT makes it 200. After mel freq. scale filtering, the number is, say, 40. 
It seems to me that taking DCT does not compress data imformation at all, as number of the points after DCT is still 40...

Am I missing something here?
If DCT is for data compression, how can I get its effect?
If DCT is not for data compression, what is it doing?

Thanks!

IIRC the target of MFCC is to characterize human vocal tract properties when speaking (outputting a small set of aggregated representative data). The basic idea is that you want to separate the driving force (breath) from the signal (since this is basically the same for everyone and is thus useless information) and keep only the things that are different amog humans (so you're analyzing vocal tract "filter"). The ->log->DCT part serves, I guess, this purpose (since you usually discard the portion of the coefficients which carries the "breath" information).
It has been a long time since I last saw this so I may be wrong.
PS: shouldn't the formula be like this?
signal->FFT->magnitude spectrum (absolute values of the complex frequency samples)->square each sample->mel...