What is the current status of wavpack 4?

Topic: What is the current status of wavpack 4? (Read 13003 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

What is the current status of wavpack 4?

Reply #25 – 2003-05-30 18:19:04

While making a post the mentions Wavpack lossy in the Musepack forum, I think I got a better feeling for how Wavpack works.

(Please correct me if I'm wrong, bryant)

1. Generate a single predictor based on numerous preceding samples. This guesses the sample value for the current sample. (A later post gave a simple example with a linear prediction)

2. Work out the difference between actual sample and predictor and store this "error", allowing perfect reconstruction.

3. Pack the files as well as possible (e.g. Huffman coding, like Zip) to make use of redundancies.

With lossy hybrid mode, after the prediction is made, you can choose to knock the least significant few bits off the error. Of course, it's possible that error can build up, since each sample is predicted from previous ones and those are now slightly incorrect, so it must be a little more complicated than that, presumably ensuring that the net error is close to zero over a reasonable time period. Reducing the error term to fewer bits creates more redundancy and makes the packing more efficient, so reducing the bitrate. The removed bits are compressed separately in the correction file, to enable perfect reconstruction once again.

Now, I can see that there's only one predictor for each sample, and the simple mathematical error (subtraction) is all that remains to be stored.

So, I can imagine how it's quite possible to look at the error terms for a series of samples and round them in such a way that the correction file contains noise that follows the shape of the ATH curve pretty well but has an average amplitude of zero over a certain timescale, causing no DC shift.

From the udial.ape thread (test your soundcard for clipping), where someone damaged his tweeters, one ought to be careful about putting too much ultrasonic noise content in files (but this was full scale against a quiet audible tone, causing him to increase the volume knob!).

Soft ATH noise shaping might have lower ultrasonic content, for example.

Then again, the decoder could be required to instigate a lowpass filter to protect the user if we're pushing to very low bitrate with lots of error in the ultrasonic range (but not so much to cause regular clipping). The filter would be turned off, of course, if the correction file is being used to restore lossless playback, but in lower bitrate lossy modes a flag in the lossy stream could indicate the attenuation required for frequencies above, say 19-20 kHz. This would safeguard against this potential tweeter risk without breaking the predictor.

It is also plausible to shape the correction file noise in different ways, e.g. to follow some calculated frequency dependent masking threshold based on simple psychoacoustics, but this would require some frequency analysis. Maybe the analysis could be greatly simplified from full blown lossy coders (e.g. using RMS amplitude of sub-bands instead of using transforms) and the masking threshold could be ultraconservative, but this is a lot more work than a consistent noise shape.

It doesn't seem (on limited evidence) that splitting the signal into reconstructable bands before running the predictor is viable. It might be possible to use such a method to shape the noise, however.

It does seem plausible to make a rough measurement of the loudness (e.g. RMS value of the signal is very easy) and modulate the allowed noise that way, with no consideration of the frequency-dependence of masking, simply the loudness. That might make a reasonable easy "standard" lossy mode, which remains audibly transparent for the vast majority of samples at non-painful volume.

Clipping might be a plausible concern if strong noise shaping has a high enough amplitude in the high frequencies.

Just some further thoughts on the subject, which might clarify my contribution to this thread down to stuff that's reasonably viable to implement (rather than some of my sub-band-with-separate-predictor ideas, which don't look too promising).

What is the current status of wavpack 4?

Reply #26 – 2003-05-31 22:06:40

Cheers bryant B)
I know an open source project can become much of a pressure once it starts becoming popular , you should read the flames we receive when a new eMule version screws up something and ppl lose their partial Anime XXX downloads.

Best wishes to your family, stay away from public transportation and I hope you find a job (one that allows some free time for wavpack though:))

What is the current status of wavpack 4?

Reply #27 – 2003-06-03 04:08:48

sony666:
Yeah, I have thought about what kind of e-mail I might receive if a WavPack bug trashed all of somebody's original music. In fact, I could even imagine a rude knock at the door! Somehow I don't think that "Dude, didn't you read the disclaimer!?" would make them very happy.

DickD:
Your description of WavPack's lossless mode is pretty much on the mark. The predictor first makes a prediction based on some number of previous samples. In the case of the "high quality" mode this is a polynomial applied to the last 16 samples, however because the polynomial terms adapt to the changing audio, in a sense it's really looking at hundreds of previous samples. The difference between the prediction and the actual sample is called the error (or residual) and this is stored using Rice Coding (which is a kind of Huffman code for Gaussian numbers).

In the hybrid mode the user's kbps number is converted to a number of bits per sample (for example 320 kbps = 3.63 bits/sample) and we only store the residual with as much resolution as we can given that average number of bits. So, if the error is running with an average magnitude of 100 and we are allowed 3.63 bits per sample, then we can store the errors with an accuracy of about +/-20. Note that if a big error comes along we use more bits to store that sample while samples close to zero require fewer bits, but every sample is stored with the same accuracy and we achieve the average bitrate. If a transient comes along and the average residual value goes up suddenly, we will store the first few with a lot of extra bits to maintain the accuracy, but then the exponentially lagging average will start going up and we will start storing with less and less accuracy until we hit the target bitrate again. When the average is falling (after the transient) we will be storing fewer bits because the average will be high (it always lags) and this will balance the extra bits we stored at the beginning. It's actually pretty interesting how it can maintain the average bitrate to within about 1% over the long term even though it's completely open-loop (no feedback).

So, the mode is essentially CBR (with a little "slop") and the only factors affecting the noise level are the target bitrate and the accuracy of the predictor. The reason that the "high" mode works so much better than the default mode on the Furious sample is simply because the predictor works better on the high frequency signals. In fact, the predictor is the only difference between the modes.

Which brings me to a clarification on one of your points. The noise really shouldn't be any more audible in quiet parts than in louder parts because the noise is always scaled to the signal (lower the level 6 dB and the added noise drops 6 dB) and at very low levels the coder will actually go lossless if it has enough bits to do so. What I think is that the noise is more audible when there's less going on in the music (more "air" around the instruments) and the noise is less audible when there's stuff going on all up and down the spectrum. In this way it really works the opposite from conventional codecs which have the worst time with complex music but shine with simple stuff because they can pour all bits into the "active" subbands. Perhaps den can comment on this as well.

Without filtering, the quantization noise added is perfectly flat in frequency and no dithering is required because the quantization size is always small compared to the residual size. I think it would be easy to implement the ATH noise shaping curves and it would be interesting to see if they lower the audibility of the noise. I don't think that ultrasonic noise would be a problem because it would always be much lower in amplitude than the signal (unless the predictor failed, I guess). I also agree with you that it would be possible to use simple subband level checking to both determine the optimum noise shaping algorithm and to implement a VBR mode to achieve a lower average bitrate for the same quality. The advantage of all this stuff is that it can be done solely on the encode side and therefore does not burden the decode side with aggressive CPU usage and can be implemented after the spec is complete.

I have also thought some about actually using subband coding directly like you describe. One issue here is that to be efficient in lossless mode you cannot increase the number of samples that need to be encoded (this is not an issue in EQ where you sum up everything before you're done). The type of filters that I am familiar with that could work this way are the symmetrical type that split the band exactly in half: frequencies below (Nyquist / 4) go into the lower part and frequencies between (Nyquist / 4) and (Nyquist / 2) go into the upper part. Then you throw out every other sample in both band (because half are redundant) and you have the original number of samples. You can do this as many times as you like and generate 1 octave wide bands all the way down to 20 Hz (or even 1 Hurt ), although I think that probably 4 bands would be the most that would be useful.

This would make it easier to move the quantization noise around where it couldn't be heard. For example, if you had little or no signal over 11 kHz you would have very little noise up there, and I think this is impossible to achieve without breaking into subbands. (I considered pre-emphasis in this case, but am not sure if it would work).

On the other hand, I have the concern that even though the filters can sum to recreate the signal losslessly, what happens when you encode a signal at the crossover point and have different quantization levels on either side of the divide? I am afraid that all of the problems of subband coding will come out and I'll lose the "characterless" nature of the noise that makes is so easy to live with.

At some point I would like to experiment with subband coding for my own curiosity, but I definitely don't want to start down the path of creating an inferior MPC! And my real interest is lossy encoding of high-resolution audio (like 24/96) and this is to some extent directly opposed to psychoacoustic modeling. After all, according to the current models with which I am familiar, the first step would be to downsample to 44.1!

Anyway, thanks for the input and I hope this clears things up a little and gives you some more ideas...

What is the current status of wavpack 4?

Reply #28 – 2003-06-03 04:50:25

@sony666

Quote

I know an open source project can become much of a pressure once it starts becoming popular , you should read the flames we receive when a new eMule version screws up something and ppl lose their partial Anime XXX downloads.

Best wishes to your family, stay away from public transportation and I hope you find a job (one that allows some free time for wavpack though:))

Certainly Wavpack is getting a bit more attention, but hopefully David doesn't have anything too much to worry about. His cat's appearance is getting quite well known though, so it may have to hide from public view, should any problems arise...

@bryant

Quote

The noise really shouldn't be any more audible in quiet parts than in louder parts because the noise is always scaled to the signal (lower the level 6 dB and the added noise drops 6 dB) and at very low levels the coder will actually go lossless if it has enough bits to do so. What I think is that the noise is more audible when there's less going on in the music (more "air" around the instruments) and the noise is less audible when there's stuff going on all up and down the spectrum. In this way it really works the opposite from conventional codecs which have the worst time with complex music but shine with simple stuff because they can pour all bits into the "active" subbands. Perhaps den can comment on this as well.

I agree with the above. It's only in quiet sections with solo instruments that the noise becomes noticeable, particularly in the gaps between the individual notes being played. Once you get a few more instruments and/or some vocals on board, the noise gets covered and I can't usually hear it.

The hiss reminds me of that heard from cassette recordings with the Dolby NR switched off, just not as obvious. If you recall back to those days, the hiss would stand out at the beginning of a track, and perhaps appear in quiet solo sections mid track, but generally disappear the rest of the time, particularly with "busy" genres of music. Wavpack is similar, except that I find the hiss much less annoying than the cassette example with my music, even at the lowest available bit rate setting.

If I hadn't used other formats previously, and didn't ABX Wavpack lossy against the original, I wouldn't have even picked it up, as it is virtually identical to background circuit hiss you get from many audio devices just by turning up the volume without any signal playing.

Den.

What is the current status of wavpack 4?

Reply #29 – 2003-06-03 05:22:38

Quote

Certainly Wavpack is getting a bit more attention, but hopefully David doesn't have anything too much to worry about. His cat's appearance is getting quite well known though, so it may have to hide from public view, should any problems arise...

Yeah, especially since she runs the QA department!

What is the current status of wavpack 4?

Reply #30 – 2003-06-03 07:10:41

David,

From what I understood, the noise is not exactly scaled to signal, but to the average residual (and how many samples you use for calculating the average?). This should give exactly the effect Den hears (more noise in quiet parts right after some more loud parts). I'm not an expert , but I'd suggest to scale error to the actual (or few samples average) sample magnitude and using non-uniform quantization (use more bits for smaller values).

-Eugene

What is the current status of wavpack 4?

Reply #31 – 2003-06-03 18:37:41

Eugene:
There are a couple reasons that I use the residual level rather than the signal level in determining the quantization level. The first is that I wanted to create something that was close to CBR, and this can only be done using the residual. Second, I found that the residual level is more closely tied to the masking properties of the signal (at least in most cases). For example, low frequencies and regular tones (as opposed to noise) are much worse at masking quantization noise, and these generate small residuals because they're more predictable. On the other hand, broadband noisy signals are very good at masking quantization noise, and these generate large residuals. However, you did give me an idea that by simply looking for cases where the original signal level was high compared to the residual, I might be able to detect those cases (like Furious) where the predictor is not working well.

Once I decided to use the residual for the quantization level, I needed to put in the averaging because there are times when the spectral characteristics of the signal suddenly changes and the predictor takes some time to adjust. During these times the residual level spikes and was actually one of the problems with the old WavPack lossy mode. The time constant of the averaging is less than 6 ms though, so I don't think it could be contributing to more audible noise after transients (it falls faster than most transients and so only lags by a tiny amount).

As for the non-uniform quantization, the old WavPack lossy mode also had that (because it seemed more intuitive to me at the time). What I discovered was that because the perceived noise level is based on the RMS value of the errors, a few big values really drives up the average. Even though it may not be intuitive, uniform quantization is the most efficient way of storing these values. In fact, this is why ADPCM is at such a disadvantage here. By restricting the coding of each sample to a fixed number of bits, you get a much higher RMS error level than a variable bit scheme, and you also get distortion. If you listen to the error generated by ADPCM you can clearly hear the music playing (and this, of course, means distortion). Listen to the difference in WavPack lossy and it's pure noise.

What is the current status of wavpack 4?

Reply #32 – 2003-06-03 19:35:22

Thanks for the detailed response, David.

I'd read about Rice Coding before, probably on the Monkey's Audio site. Having pretty much Gaussian distributions of residuals is reassuring when you're trying to treat them in a noiselike manner.

16 samples duration for the polynomial is useful info, as is the info that the polynomial terms of the predictor adapt. Presumably the terms are stored somewhere every so often and it's a case of some sort of best-fit algorithm (e.g least-squares) over a reasonable timescale.

For lossless, I can see that the exact previous 16 samples are known. For lossy, I presume that despite the inaccuracy of the previous 16 samples, the inaccuracies cancel out pretty well in aggregate so that the predictor of the next sample is pretty close to what it would have been had the previous 16 samples been stored losslessly, so the decoder is still pretty accurate. (Or perhaps, and more likely, you base the predictor on the previous 16 lossy samples in the first place so the reconstruction is sure to be as accurate as possible). I also presume that an exact sample value may be stored on a periodic basis to enable seeking or so that data corruption doesn't cause loss of all audio after the error. (E.g. FLAC usually does this in 4608-sample or 1152-sample blocks)

Regarding this bit:

Quote

In the hybrid mode the user's kbps number is converted to a number of bits per sample (for example 320 kbps = 3.63 bits/sample) and we only store the residual with as much resolution as we can given that average number of bits. So, if the error is running with an average magnitude of 100 and we are allowed 3.63 bits per sample, then we can store the errors with an accuracy of about +/-20. Note that if a big error comes along we use more bits to store that sample while samples close to zero require fewer bits, but every sample is stored with the same accuracy and we achieve the average bitrate. If a transient comes along and the average residual value goes up suddenly, we will store the first few with a lot of extra bits to maintain the accuracy, but then the exponentially lagging average will start going up and we will start storing with less and less accuracy until we hit the target bitrate again. When the average is falling (after the transient) we will be storing fewer bits because the average will be high (it always lags) and this will balance the extra bits we stored at the beginning. It's actually pretty interesting how it can maintain the average bitrate to within about 1% over the long term even though it's completely open-loop (no feedback).

Right, so you measure the average magnitude of the prediction error (either by its average absolute magnitude or by its RMS value) with a rolling average over a longish period (e.g. tens to hundreds of samples), then calculate how much error you'd typically allow for the next sample given the typical efficiency or Rice Coding.

I see, so it's a feed-forward control mechanism based on recent history. Actually, isn't that the same as negative feedback?
When the transient hits, you've still got the low residuals of the non-transient forcing you to retain a small error in that residual and employ a large Rice Code. If the predictor doesn't get better, the average residual (RMS or absolute) increases, so you naturally begin allowing greater error shortly afterwards, and this greater error naturally starts using fewer bits once the predictor adapts to the new sound (i.e. after at least 16 samples).

In a sense this takes advantage of temporal masking - specifically post-masking, albeit accidentally and with no calculated masking threshold. The loud transient (burst of loud sound) causes the ear to become less sensisitve to noise/distortion for a period and it happens that the codec also becomes more noisy shortly after the transient because the transient didn't meet the predictions of the predictor.

The beauty is in the lack of pre-echo distortion. The error before the transient doesn't increase. In contrast, frame-based lossy codecs that use FFT or DCT for analysis spread the error/distortion across the analysis frame if a transient occurs within the frame and they decide it will mask the whole frame. In some cases, where the time resolution isn't enough, this occurs before the onset of pre-masking, where the ear doesn't notice increased distortion just before a transient. The effect is of sudden hiss occuring before the main sound of the transient - and where the transient is noise-like too, it sounds like an echo occuring before the main sound - pre-echo.

With knowledge of temporal post-masking, e.g. from the link in the last-but-one paragraph, it may be possible to make use of this effect more smartly (e.g. in VBR mode) to deliberately allow greater error after transients, that decays away and also follows some sort of known post-masking decay threshold. It does seem that the length of the masker burst affects the post-masking effect, with a 200 ms burst providing longer-lived masking than a 5 ms burst, so for conservatism, the 5 ms burst's post-masking profile may be the one to aim for, even though more bits could be saved if one could work out the difference.

In fact, a very sudden cessation of a loud sound would also cause a transient rise in predictor residual, so it would also require more bits. This coder wouldn't distinguish it from a transient onset of sound, so it's important that the allowed error reduces within a sufficiently short time. That time currently depends on the decay of the trailing average of the previous so-many residuals (either RMS or absolute).

Imagine a loud sound with typical prediction residuals of 500 (and allowed error of about 100) that suddenly stops, giving way to a quieter sound (perhaps 20 dB down).
16 samples after the cessation, the predictor ought to be getting close to predicting the quieter sound most of the time. Let's say the typical residuals after these 16 samples come to about 50. Eventually the allowed error should come down to about 10. The allowed error will gradually come down from 100 towards 10 as the moving average residual comes down from 500 to 50. The error will actually remain higher than the typical residual of 50 until half of the pre-cessation sounds have been lost from the moving average window (assuming it's a rectangular, unweighted average, not one with a time decay). Actually, the extra-large error at the sudden cessation transient will push the average up higher still for longer.

So, the audibility of noise after cessation of loud sounds will depend on the speed of decay, which depends on the length of the moving average window for the residual. If the decay curve happens to cross the post-masking threshold, the noise will audibly increase.

I wonder if this is what's happening in Den's example of Mandela Day by Simple Minds. Perhaps a shorter averaging time, or better-still some sort of weighting curve on the average that decays the further back in time you go, would help reduce the audibility of the noise.

I guess with the right tools it's possible to analyze the residuals in that sample, the curve of the average residual and the estimated error being allowed after cessation transients, assuming there are some. (I'm trying to think if I can remember the character of the track from an old cassette I haven't heard in years)

Ah, I just read David's post, which says the decay time is about 6ms, so this does seem small enough not to be a problem (unless of course the quiet sound it hard to predict). (Or even crazy ultrasound, like the tweeter-frying udial.ape test sample)

DickD

What is the current status of wavpack 4?

Reply #33 – 2003-06-04 06:33:33

Thanks for responce, David.

The difference from output of most lossy codecs sounds like music, so I'm not quite sure that it's a dissadvantage.

I just think that if you assign more bits to lower residual values you can always store low residuals better, even if the signal becomes more predicatble "suddenly" - faster than your average adapts. May be this will help. But if the adaptation time is close to post-masking time, it's not very likely

We can be more constructive, if we'd take a short sample that den can ABX and actually look what's wrong with it

-Eugene

Notice