Regarding this topic:
My question is that why we lose quality of an audio when we retranscode it. I though it should have the same quality every time.
It is my assumption that when we try to transcode an audio first we sample it. For example
16 bit/sample × 44100 samples/second = 705600 bit/second
We take those samples (red arrows) as shown here:
The samples (red arrows) are spaced with Ts = 1/Fs where Fs is our sampling frequency (For example 44100 samples/second).
We use bits to store those samples, it's called bit depth (For example 16 bit/sample).
Here is the part I don't understand:
If we were to reconstruct the samples with track and hold method (Also named sample and hold) we should have been able to get the same quality of audio when we retranscode it again.
If we are only storing the samples then there shouldn't be any loss next time we sample it again with the same quality.
For example when we convert MP3 128kbps to MP3 128kbps we have the samples and we where they are. We take the same samples again and again.
We only had to take samples from where those red bars start which takes us to the same red bars again.
We also had the same exact memory (number of bits) to store those samples in the first place, so there shouldn't be any problem with number of bits and memory.
Here is my assumptions (I'm not sure):
Although we target a constant bit rate (Let's say 128kbps) but we have to change cut off frequency, unless the number needed for bit depth doesn't add up.
bit_depth bit/sample × 44100 samples/second = 128000 bit/second
Then bit_depth is 2.9, which is wrong. It should have been an integer.
I guess since we don't have a perfect brickwall filter, every time we try to re-transcode the audio, we filter it and lose some high frequency data there. And if it is true, then what type of low pass filter we use (Hann window, Hamming window, Blackman windows, etc.)?
[Type of windows: https://en.wikipedia.org/wiki/Window_function]
It's good that you're trying to make sense of all of it, but you're looking in entirely the wrong place, I'm afraid. Lossy compression isn't just about lowpassing or resampling. It will effectively add new distortion and new noise to make the compression work better.
If you feed these small distortions to the encoder a second time, it just takes that new data that now has the artifacts, and tries to preserve them in some way because the encoder can't distinguish "original audio" and "compression artifacts". It's all data. Your distortions have now been encoded with some more distortion.
Repeat as many times as you like.
High-frequency content is usually pretty noisy, hard to compress and often masked by the rest of music anyway, which is the only reason it's cut off. It's like trimming fat before applying your meat hammer.
Each encoding step retransforms the audio and then requantizes it, so more and more error accumulates. If the software used isn't fully gapless, the transform boundaries also shift, which makes things worse.
First, I would like to correct you on the sampling part.
Your first graphic, the one with arrows, reflects quite faithfully what sampling really is, although probably it needs explanation:
Sampling (capturing it) is commonly described as taking the value of a signal in a periodic fashion. That does not mean that you reconstruct it by putting out those periodic values.
Concretely, what sampling actually is, is taking the value of a signal in an infinitesimal moment in time, and virtually storing the value "0" until the next infinitesimal sampled period.
In other words. What gets stored are dirac pulses of the signal at the periodic instants. And dirac pulses tend to be represented as arrows ;)
Then, you mention sample and hold. Sample and hold is the worse way to regenerate a continuous signal out of a periodic sampled signal. It is sometimes used in software players (mostly in the 90s where the CPU was slow) and when trying to represent graphically a digital signal when this "defect" doesn't matter.
Any properly designed resampler (even if not of the highest quality) interpolates, filters or reconstructs the signal (cubic spline, zero-hold and a step filter (not to mistake by sample and hold) or sinc funcion approximation)
As such, you should mostly forget about the graphic with straight lines. It does not exist, except when doing it wrong.
Now, that's only the part that we could call "lossless" audio. You need to understand what lossy audio is in order to understand what type of signal you can obtain from it.
The first thing to know is that most of the lossy codecs do not look at the samples that we are talking about here, but instead, look to a transformed signal that represents the frequencies of that signal. The basic concept is the Fourier Transform, but codecs use different transforms depending on what works better for that codec. a Fourier transform should be lossless (I.e. one should be able to convert from samples to frequencies and back), but by itself, a fourier transform does not gain a single bit of compression.
Once on the frequency domain, the codecs analize the strenghs of the bands, apply some roundings and modify the data according to how the human audio perception works, apply some other conversions that might help on compression and finally most of them do a lossless compression of that data.
When decoding that signal, the codec decompresses the data, gets back the frequencies as they were encoded, and from those frequencies transforms back to samples. But these frequencies are not exactly the same that the original ones and so, the samples are not the same, even when both, the frequencies and the samples might look similar when not looked in detail, and most importantly, can sound the same or almost the same.
Now, what effect has another compression of these already-encoded samples?
The first step, converting the samples to frequencies is still the same, and you should obtain the same frequencies that were stored on first encoded sample.
Then, it comes the lossy operations. You would think that if the codec threw information before, then it doesn't need to throw anything else, but most codecs don't work that way. The filters applied might add some difference, the rounding might have actually added content that didn't exist, and that now the codec might think it should keep, and will probably cause it to increase that initial rounding error.
Also, saratoga mentions another thing which is about the codec being sample exact or not, which means that the codec might generate more samples than original, and when reencoding it again, the analisys will be done over a different group of samples than the first encoding, and so on.
It should be possible to do a lossy encoder that does not degrade after multiple passes, but probably in order to do that, the codec will be limited to what can do to reduce the size of the signal, or to maintain the audio quality.
I think what everybody is trying to say here is that we still add distortion or noise in high frequency (as dhromed mentioned) and when we try to requantize it again we add some error again (as saratoga mentioned).
I was after the reason of forming this error and where it comes after first time we encoded the audio.
[JAZ], Thank you for correcting me.
Now, when that low pass filter applies?
I guess a low pass filter applies at the beginning of the encoding process for the second time or to be more accurate when decoding the encoded audio.
Only the samples (dirac pulses with their magnitude and Fs) are stored. Reconstruction through interpolates and filters only happen when we want to convert digital signal to analog signal. I don't think we store reconstructed audio even with properly designed resamplers. How can we store it on bits and why should we? After all not only that needs more space to be stored, but also it is not digital anyway.
Surely when we want to hear the audio we use DAC and we use more advanced method compared to sample and hold but we do not store the result of the reconstruction. If they used the simple and plain sample and hold method for reconstruction, we would have had the same results every time. Surely decoder is using a more sophisticated method.
I guess we only have the samples on an MP3 file then we reconstruct it when we want to hear it or encode it for the second time.
But that distortion or noise which were mentioned are because of a low pass filter that applies at the beginning of the encoding process for the second time and that low pass filter (reconstructor) is not perfect and has limited number of samples, thus adding some interpolated numbers to the signal.
And I agree with you on the last paragraph, if we didn't use that low pass filter and if we didn't try to reconstruct it in the 2nd run, we would have obtained the same results every time.
I guess we only have the samples on an MP3 file
No, MP3 doesn't store this information. IT looks like you described PCM, but it's just a regular uncompressed digital audio.
Mp3 compression does not include a low pass filter. Encoders can choose to include one prior to encoding, but that has nothing to do with mp3.
Let me try this again. You get progressively worse quality because each encoding/decoding cycle transforms, quantizes, requantizes, and then inverse transforms. Each time you do this error is compounded. It is really not much more complicated then that.
Maybe this can get him pointed in the right direction...
I don't think he understands what an FFT is, especially in light of his presentation of the topic of windowing; and it appears to me that he still believes that mp3 stores time domain data, despite being told otherwise.
it appears to me that he still believes that mp3 stores time domain data, despite being told otherwise.
Yes, I was wrong on that part.
I was also wrong that I though the whole process of encoding is being done in time domain (that is why I kept mentioning windowing).
I get it now!
The loss in quality after 100 recompression is due to excessive use of psychoacoustic model followed by quantization error.
No sample in frequency domain is stored here.
No convolution with a window in time domain.
First algorithm divides audio into smaller pieces (MDCT is reversible so no loss here) followed by FFT (takes it into frequency domain and is long enough) then psychoacoustic model (Low pass filter to throw out some of the data which reduces file size) and quantification (inevitable after producing a lot of numbers in fractional part).
Sorry for all the trouble, I just wanted to make it sensible with what I knew.
This was also useful: