Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: MP3 High Frequency Reconstruction Help (Read 1737 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

MP3 High Frequency Reconstruction Help

I have been trying to perform efficient spectral enhancement to convert low bitrate MP3 files to high bitrate ones using Machine Learning techniques. I have been successful in recreating a good part of the audio spectrum as seen below in the image (Showing 80Kbps vs 160Kbps vs Upscaled 80Kbps to 160Kbps).



My problem here is that the reconstructed audio still doesn't sound any different from the original 80Kbps file inspite of decent spectrogram, phase and magnitude spectrum plots.





I did plot the magnitude squared coherence estimate of the upscaled and the high bitrate audio and the graph isn't even close to being what I had expected (a flat line plot of an array having all ones).



Could this be the reason for the reconstructed audio to not sound the same? If so then is there any way to convert the scipy coherence function to a differentiable function which I can optimize on and get better results or perhaps a different function to optimize coherence?
My current scores are as follows:
MSE between the high kbps and upscaled STFTs (Unnormalized)= 0.07
SSIM score = 0.0024
MSE between magnitude spectrums (Scaled)= 3e-9
MSE between phase spectrums (Scaled) = 0.008

At this point I am extremely confused as to what I could do more to make them sound similar. If you wish to hear the above two second sample then I have attached the three files below. Some advice or guidance would be really helpful. Thanks in advance!

Re: MP3 High Frequency Reconstruction Help

Reply #1
I see two different problems here. 

First, the difference in high frequency content you are trying to reconstruct is small, so if everything works you're probably not going to hear much difference because there isn't much difference to hear.

Second, from the coherence function it looks to me like you're not even close to reconstructing the actual signal, so it seems like your approach isn't working. 

The first is pretty easy to fix.  Pick test signals where there is some meaningful difference in frequency content to reconstruct.  For example, take a lossless file and low pass filter it at 4 kHz.  Then you can listen to your reconstruction and see if its working. For the second (figuring out how to do what you want), I can't really comment since I have no idea what you're doing.




Re: MP3 High Frequency Reconstruction Help

Reply #2
I haven't listened to a lot of low-bitrate MP3s but it seems like the loss of high frequencies isn't the most noticeable issue.

An exciter effect can "regenerate" lost higher frequencies but it's only an approximation.   You could probably use other processing to "fill-in" other frequencies, and it shouldn't be that hard to match the spectrum of the lossless (or less lossy) format but I'd guess you'd only make the sound worse.  

Quote
using Machine Learning techniques.
Of course it's impossible to know what information was thrown-away.   You (or somebody) might be able to make a good "guessing machine" but I don't believe anyone has achieved that yet.   And a few bad guesses in a song/program could sound horrible.  

Some added noise & distortion should give you a "better spectrum" at the expense of degraded sound quality. ;) 

Re: MP3 High Frequency Reconstruction Help

Reply #3
I see two different problems here. 

First, the difference in high frequency content you are trying to reconstruct is small, so if everything works you're probably not going to hear much difference because there isn't much difference to hear.

Second, from the coherence function it looks to me like you're not even close to reconstructing the actual signal, so it seems like your approach isn't working. 

The first is pretty easy to fix.  Pick test signals where there is some meaningful difference in frequency content to reconstruct.  For example, take a lossless file and low pass filter it at 4 kHz.  Then you can listen to your reconstruction and see if its working. For the second (figuring out how to do what you want), I can't really comment since I have no idea what you're doing.





Yes the coherence function is my exact problem but my real issue here is that I do not know how to optimize it. I have managed to achieve a decent amount of similarity between the STFTs and hence all other spectrums other than coherence but I am unable to figure out why that is the case since the MSE between the time-series representation of the upscaled and high bitrate wave is of the order of 0.05 and the coherence is calculated from the time-series represntation itself. What am I missing here?
If I try to explain it in simple terms, I am trying to minimize the MSE between an array with all ones (perfect coherence) and coherence(high,upscaled). If the MSE becomes zero then the upscaled audio will have the same coherence as the high bitrate audio. The problem I am facing here is that to minimize this difference, I cannot use the welch based implementation provided by scipy since it is not differentiable. So I was wondering if there was any other way (which supports derivatives) in which I could calculate the coherence between the two audio waves

Re: MP3 High Frequency Reconstruction Help

Reply #4
I haven't listened to a lot of low-bitrate MP3s but it seems like the loss of high frequencies isn't the most noticeable issue.

An exciter effect can "regenerate" lost higher frequencies but it's only an approximation.   You could probably use other processing to "fill-in" other frequencies, and it shouldn't be that hard to match the spectrum of the lossless (or less lossy) format but I'd guess you'd only make the sound worse.  

Quote
using Machine Learning techniques.
Of course it's impossible to know what information was thrown-away.   You (or somebody) might be able to make a good "guessing machine" but I don't believe anyone has achieved that yet.   And a few bad guesses in a song/program could sound horrible.  

Some added noise & distortion should give you a "better spectrum" at the expense of degraded sound quality. ;) 

I am actually doing research on this, trying to find out if ML techniques can recover the lost data artifacts but I have been stuck here for a while now. The articles I have read and what code I have gone through from the LAME library let me believe that the deletion of higher frequencies was the biggest problem with the non-recoverable nature of MP3 lossy compressions. I do know that there are other methods to upscale and achieve a decent spectrum but I wanted to figure out why exactly is it so hard to achieve this restoration. That led me to this point where I have realized that just reconstructing the higher frequencies doesn't do the job. If it doesn't then what does? What factors should I be looking for if I have to get a more audibly efficient reconstruction of the audio?

Re: MP3 High Frequency Reconstruction Help

Reply #5
Yes the coherence function is my exact problem but my real issue here is that I do not know how to optimize it. I have managed to achieve a decent amount of similarity between the STFTs and hence all other spectrums other than coherence but I am unable to figure out why that is the case since the MSE between the time-series representation of the upscaled and high bitrate wave is of the order of 0.05 and the coherence is calculated from the time-series represntation itself. What am I missing here?

What is the MSE between your original two files (before you applied the processing)?  How much are you actually improving the error?
  
The articles I have read and what code I have gone through from the LAME library let me believe that the deletion of higher frequencies was the biggest problem with the non-recoverable nature of MP3 lossy compressions.

Not sure where you got that idea.  Deletion of higher frequencies is not a big problem for lossy compression, or even really a problem at all.  You can test this easily yourself.  You have the higher frequency spectrum for those files.  Add it back exactly and listen to what it sounds like. 

Re: MP3 High Frequency Reconstruction Help

Reply #6
Quote
The articles I have read and what code I have gone through from the LAME library let me believe that the deletion of higher frequencies was the biggest problem with the non-recoverable nature of MP3 lossy compressions.
It's the easiest thing to see in the spectrum.  ;)

Re: MP3 High Frequency Reconstruction Help

Reply #7
What is the MSE between your original two files (before you applied the processing)?  How much are you actually improving the error?
 
The MSE between one low and high file is 0.002 and high and upscaled is 0.001. I agree there is not a lot of difference here

Not sure where you got that idea.  Deletion of higher frequencies is not a big problem for lossy compression, or even really a problem at all.  You can test this easily yourself.  You have the higher frequency spectrum for those files.  Add it back exactly and listen to what it sounds like. 
There is clearly an audible difference between the high and low bitrate files especially during segments of high pitched cymbals. This is even more noticeable between a 64Kbps and 160Kbps file. If you are saying that the higher frequencies aren't really the problem then what is? Afterall it is lossy so I assume something must be getting lost/deleted

Re: MP3 High Frequency Reconstruction Help

Reply #8
What is the MSE between your original two files (before you applied the processing)?  How much are you actually improving the error?
 
The MSE between one low and high file is 0.002 and high and upscaled is 0.001. I agree there is not a lot of difference here

You said 0.05 for the upsampled file before, so processing makes it worse?

Not sure where you got that idea.  Deletion of higher frequencies is not a big problem for lossy compression, or even really a problem at all.  You can test this easily yourself.  You have the higher frequency spectrum for those files.  Add it back exactly and listen to what it sounds like. 
There is clearly an audible difference between the high and low bitrate files especially during segments of high pitched cymbals. This is even more noticeable between a 64Kbps and 160Kbps file. If you are saying that the higher frequencies aren't really the problem then what is? Afterall it is lossy so I assume something must be getting lost/deleted

 Lossy compression generally works by performing time/frequency analysis and then adaptively quantizing each bin in the distribution based on its calculated audibility with the goal of making all error equally (in)audible, so for a good codec artifacts are not specific to any one frequency band.  If you found a codec where the problem was too much quantization error at high frequencies the fix would be simple:  adjust the encoder to allocate more bits to higher frequencies until the errors were perceived as more evenly distributed.  Most remotely usable codecs will do this by default since if they don't they won't work very well.

MP3 has well known limitations in how it performs time/frequency analysis that make it encode transients poorly, so often signals are displaced in time (so called pre-echo).  This is probably what you're hearing on symbols.  The part of the impulse from the symbols is reconstructed too early, which is pretty easy to notice since it sounds strange. 

Re: MP3 High Frequency Reconstruction Help

Reply #9
You said 0.05 for the upsampled file before, so processing makes it worse?

No that was actually the MSE between the two STFTs. The time series MSE is 0.001 for high and upscaled and 0.002 for high and low (so slightly better than what it was before), I must have mis-typed that before. Sorry

Lossy compression generally works by performing time/frequency analysis and then adaptively quantizing each bin in the distribution based on its calculated audibility with the goal of making all error equally (in)audible, so for a good codec artifacts are not specific to any one frequency band.  If you found a codec where the problem was too much quantization error at high frequencies the fix would be simple:  adjust the encoder to allocate more bits to higher frequencies until the errors were perceived as more evenly distributed.  Most remotely usable codecs will do this by default since if they don't they won't work very well.

MP3 has well known limitations in how it performs time/frequency analysis that make it encode transients poorly, so often signals are displaced in time (so called pre-echo).  This is probably what you're hearing on symbols.  The part of the impulse from the symbols is reconstructed too early, which is pretty easy to notice since it sounds strange. 
So the really problem is that there is no way of reversing this signal displacement (pre-echo)? Can I visualize (or get some representation of) this displacement in any way?

Re: MP3 High Frequency Reconstruction Help

Reply #10
Yep, lots of pictures on Google. As already said altered high frequency content is not much of an issue on decent bitrate MP3 (and on low bitrates it was alleviated by Spectral Band Replication in mp3pro format) - artifacts like preecho and ringing are always worse.

@saratoga you mispelled "cymbals" for "symbols" a couple of times in your previous post ;)

Re: MP3 High Frequency Reconstruction Help

Reply #11
Yep, lots of pictures on Google. As already said altered high frequency content is not much of an issue on decent bitrate MP3 (and on low bitrates it was alleviated by Spectral Band Replication in mp3pro format) - artifacts like preecho and ringing are always worse.

@saratoga you mispelled "cymbals" for "symbols" a couple of times in your previous post ;)
Spectral Band Replication is exactly what I am trying to do but by using Machine Learning instead, just as a research project to see if it is achievable. Is there any metric (like coherence) that I can use to recognize these artifacts or is it just perceptual?
By the way, when I said visualizing I meant where can I find this phenomenon in the audio? Will it be visible in the time series representation or will I have to look for some specific disturbances in the Spectral representations? I got the answer here

Re: MP3 High Frequency Reconstruction Help

Reply #12
Do google for pictures of pre echo, it is possible to see them on both waveforms and spectrograms. It is a time domain artifact, but after encoding it's not quite possible to tell if it is just a sound with longer attack or indeed an artifact (you may end up sharpening all the transients even the ones being intentionally soft). Ringing is a frequency domain artifact and seems even less possible to be reliably identified...

Re: MP3 High Frequency Reconstruction Help

Reply #13
Okay then I guess its just a perceptual thing instead of a metric which I could optimize on to get better results. Thanks for the clarification!

Re: MP3 High Frequency Reconstruction Help

Reply #14
Think about substracting encodes from original files so you get the actual differences for the training...

Re: MP3 High Frequency Reconstruction Help

Reply #15
Yes I will try that too. Thanks for the advice :)

Re: MP3 High Frequency Reconstruction Help

Reply #16
Quote
Think about substracting encodes from original files so you get the actual differences for the training...
That may be useful but it can be very misleading...     The "sound of the difference" isn't the same as "the difference in sound".  For example, MP3 adds silence to the beginning of the file.     Even without MP3 compression, if you add a few milliseconds of silence (creating a delay), subtraction will indicate a "big loud" difference file even though there is really no difference in the sound.    You may recognize the comb filtering as a result of delay but you don't hear the actual delay.

So make sure you time-align before subtracting!

Or if you invert the polarity ("phase") that doesn't change the sound either but if you subtract, you are "subtracting a negative" which results in a 6dB volume increase (and possibly clipping).   

Those are just a couple of simple examples and there may be other slight phase shift issues or other changes that create a sample-data difference without creating a sound difference.

And...   We know it's lossy so there WILL be "data" differences...    But the idea with (high-quality) lossy compression is to minimize or eliminate the audible differences.

DTS HD Master Audio actually does save the difference file which is used to get lossless playback.     This makes it compatible so people with a regular DTS decoder get the lossy audio and if you have the newer decoder you get lossless.

 
SimplePortal 1.0.0 RC1 © 2008-2021