Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: New FLAC compression improvement (Read 4579 times)
0 Members and 1 Guest are viewing this topic.

New FLAC compression improvement

2020-11-04 20:47:09
Hi all,

Last summer, I've been busy thinking of ways to improve on FLACs compression within the specification and without resorting to variable blocksizes, which are in the spec but might not be properly implemented on many decoders. I discussed this old post with its writer: https://hydrogenaud.io/index.php?topic=106545.50

He wrote a small python script to explore the idea of a integer least-squares solver. It's included as ils.py. I explored this idea, read a lot of literature and came up with a different solution.

The current FLAC LPC stage is a classic, textbook example so to speak. This method was designed for speech encoding. The ILS method SebastianG proposed tries to find a more optimal approach. While the current approach comes close to generating least-squares solutions, this could perhaps be improved by a more direct approach.

However, FLACs entropy encoder doesn't encode 'squared' values, it's cost is more linear. That is why I developed a iteratively reweighted least squares solver, which, by weighing, doesn't come to a least squares solution but a so-called least absolute deviation solution. A proof-of-concept is also attached as irils-calculate-improvement.py. It gives mostly equal results to the current approach on most material, but improves on certain, mostly synthetic material like electronic music.

I even got as far as implementing it in FLAC, which can be found here: https://github.com/ktmf01/flac/tree/irls However, this implementation doesn't perform as well as the python proof-of-concept, and as my free time ran out last summer, I haven't been able to improve on it.

Maybe some people here with love for both FLAC and math might be interested in looking into this. I'll look at this again when  I have some time to spare and a mind ready for some challenge

I'd love to hear questions, comments, etc.
Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

Well, did you test it in real world scenarios and what were the results? Don't just drop this here and go away
Error 404; signature server not available.

Re: New FLAC compression improvement

That is a very good question. I probably didn't explain myself well enough in that regard.

The C code on github is not working properly I think. It isn't an improvement over the existing FLAC codebase. So, I haven't reached any real-world compression gains yet.

However, the Python proof-of-concept is promising, albeit with a caveat. To simplify the proof-op-concept, it only performs simple rice bits calculation, without partitioning, and it does not do stereo decorrelation. So, I can only compare single-channel FLAC files with rice partitioning turned off. This is why the gains in the proof-of-concept might be more than what would be achievable with rice partitioning and stereo decorrelation.

For most material, my FLAC files got about 0.2% smaller (which would be 0.1% of the original WAV size). In one case, with electronic music (Infected Mushroom - Trance Party) I got an improvement of 1.2% (which would be 0.6% of the original file size).

So, I think this is promising, but I haven't been able to achieve this in C yet.

Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

Hi all,

Last week I've had some spare time and a mind ready for a challenge, so I took a look at the code (https://github.com/ktmf01/flac/tree/irls). Sadly, it seems it still needs quite a bit of work before it performs as well as current FLAC. I'll share my findings here, so I can read them back later and for anyone interested.

How FLAC has worked since the beginning
So, FLAC uses LPC (linear predictive coding) to predict a sample based on previous samples, and stores the error/residual with a so called rice code. Finding a good predictor is done by modelling the input samples as autoregressive, which means there is some correlation between the current sample and past samples. There is a 'textbook way' of calculating the model parameters (which we'll take as the predictor coefficients) by using the Yule-Walker equations. These equations form a Toeplitz matrix, which can be solved quickly with Levinson-Durbin recursion. While there a few shortcuts taken, this should result in a solution which is close to a least-squares solution of the problem, and is very fast to compute.

While this is all good and fun, the predictor resulting from this process is not optimal for several reasons. First, the process minimizes the square error, while for the smallest file size, we want the shortest rice code. Second, as FLAC input is usually audio and not some steady-state signal, the optimal predictor changes over time. It might be better to ignore a (short) part of the input when trying to find a predictor. In other words: it might be better to find a good predictor for half the signal, then a mediocre predictor for all of the signal. Third, minimizing the square error puts an emphasis on single outlier samples, which messes up the prediction of all other samples, while this single sample will not fit in any predictor at all.

Ignoring a short part of the signal is exactly what a patch I submitted a few years ago does. It added partial_tukey and punchout_tukey windows/apodizations, which ignore part of the signal. This is a brute-force approach, the exact same approach is tried on every block of samples.

What I have been trying last year
Now, last summer, I wanted to try a new approach in finding a predictor. This started out as a different way to find a least squares solution to the problem (without taking shortcuts), but I started to realize least squares is not equal to smallest rice code. As the rice code takes up the most space in a FLAC file, that should be the ultimate aim if I want compression gains. I figured that compared to a least squares solution for a predictor, using a least absolution deviation (LAD for short) solution is more resistant to 'outliers'. As these outliers mess up prediction, I thought this might work well, so I implemented code into FLAC to do this. This is done through iteratively reweighted least squares, or IRLS for short. This code works, but it is very slow and does not (yet) compress better.

What is wrong with this approach
After reviewing my code last week, I realise the LAD (least absolute deviation) IRLS implementation that I made is still not minimizing the rice code. For one, the rice code in FLAC is partitioned, which means that a large error at the beginning of the block can grow the rice code length differently than a large error at the end of a block for example. Depending on the rice parameter, a slightly larger error does not cost any extra bits at all (when the error is smaller than 2^(rice parameter - 1)), it might cost one extra bit (when the error is the same size as 2^(rice parameter)) or it might take a few bits (when the error is much larger than 2^(rice parameter)). I could not find any existing documented IRLS weighting (or a so called norm) that works with rice codes.

What I want to improve about it
So, the next step is to improve the IRLS weighting. This weighting procedure should ideally incorporate knowledge of the rice parameter (as this says something about whether bits can be saved or not) but this is not known during weighting. I think using a moving average* of the residual might be a good way to guess the rice parameter during the weighting stage. I could also use a partitioned average*, but as the number of rice partitions is not known beforehand, just as the rice parameter, the size of the partitions to average on will probably not match the rice partition size, and we might get strange behaviour on partition boundaries. With a moving average window the problem is similar (choosing the size of the window). The optimal window size correlates with the rice partition size, but this size is not known during the weighting stage, but at least this moving average doesn't 'jump' on arbitrary boundaries.

Using a moving average*, the weighting procedure can incorporate knowledge about which errors are considered too large to try and fix, and which are too small to bother with. If the error on a sample is much smaller than the moving average it can be ignored, and if the error on a sample is much larger than the moving average it can be ignored too. Only if the error is in a certain band, it should be weighted such that the next iteration tries to decrease this error.

In the current IRLS weighing scheme is least absolute deviation. To get from the default least squares to least absolute deviation, the weights are the inverse of the residual (1/r). For very small r, the weight 1/r is capped at some value. I can change this value depending on the moving average. Above this value, I will try using a weight of 1/r². The effect I think this will have is sketched in the image I have attached

If this doesn't work, maybe a moving average can instruct the algorithm to ignore a block. This is how the partial_tukey and punchout_tukey work, but here the method is not applied brute-force, but with some knowledge of the samples itself. If the moving average is much larger than the average of the whole block, that part is ignored. This way, short bursts of outliers (for example, the attack of a new tone) can be ignored in the predictor modelling.

*when I say average, I mean average of the absolute value of the residual from the last iteration.

Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

The last few days I've been busy working on this, and there is some progress. Sadly, the code is very slow, and I don't think this will improve much. This improvement fits a certain niche of FLAC users who want maximum compression within the FLAC format and don't care how long encoding takes, perhaps reencoding in the background.

Attached you'll a 64-bit Windows executable compiled by MinGW with -march=native on a Intel Kaby Lake-R processor and a PDF with a graph of my results. I haven't used -march before, but if I understand correctly, this code should run on fairly recent 64-bit CPUs with AVX2 and FMA3. To make testing a little easier, I have added a new preset, -9, which uses the new code.

The following graph shows the results of encoding with (from left to right) setting -5, -8, -8e, -8ep and finally -9.

The new preset is -8e + a new 'apodization function', irls(2 11). It is technically not an apodization, but in this way, the integration into the FLAC tool is pretty clean. The function has two parameters, the number of iterations per LPC order, and the number of orders. So, with irls(2 11), the encoder does two iterations at order 2, two iterations at order 3, all the way to 2 iterations at order 12. Sadly, there is still something wrong in the code, so I would recommend not using anything else then 11 order at this time. Using more iterations is very well possible, and gives a little gain at the cost of much slowdown.

Last time I improved FLAC compression the improvement was about 0.2% with no slowdown. This one is another 0.2%, but at the cost of a 60x slowdown.

NOTE: Please use the attached binary with caution! There has little testing been done on it!
Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

Not convinced of the usefulness, but interesting yes. Do you have any idea of what orders actually make for the improvement?

But I am curious why your graph wasn't visible here. Hotlinking forbidden?  Anyway, it is at
http://www.audiograaf.nl/misc_stuff/irls-compression.png if any other HA kittehs should think the same:

High Voltage socket-nose-avatar

Re: New FLAC compression improvement

Very interesting stuff but even with a 5900x this is a hard nut.
flac-native -9 indeed compresses the albums i tried better as CUEtools flake -8 while this is slightly better on average as flac -8 ep.
The speed is to slow for my taste to consider it for use in my workflows but it is a very nice idea.
Many thanks for the effort and the working version
With 24bit music you can listen to silence much louder!

Re: New FLAC compression improvement

To late to edit the above, sorry. I did configure my frontend wrong by keeping -ep in the command.
The above counts for flac-native -9 -ep
With 24bit music you can listen to silence much louder!

Re: New FLAC compression improvement

But I am curious why your graph wasn't visible here. Hotlinking forbidden?  ...
Graph shows for me, so I suspect it's something in your browser settings preventing it being shown. It's happened to me before over the years.

Re: New FLAC compression improvement

Some unscientific numbers for 29 random CD format albums:

CUEtools flake -8
7.591.331.615 Bytes

flac-native -8
7.623.712.629 Bytes

flac-native -9
7.586.738.858 Bytes

flac-native -9 -ep
7.581.988.737 Bytes
With 24bit music you can listen to silence much louder!

Re: New FLAC compression improvement

The hotlinking is broken because the link is http and the forum is https, and any browser which enforces security rules will block http resources, rather than downgrade the page security.

Re: New FLAC compression improvement

Some unscientific numbers for 29 random CD format albums:
[...]
I was shocked to see that the difference between flake and this new LPC analysis method was so small, but I just realised that is probably because flake uses a smaller padding block by default. I tried to get CUEtools.Flake working here, but for some reason I can't. Wombat, can you check whether padding is indeed smaller with CUEtools.Flake? If there are no tracks longer than 20 minutes in your test, the difference should be 4096 bytes per track.
Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

You can use my old compile. If you want to use the encoder from the recent CUEtools download you need to copy the additional file Newtonsoft.Json.dll
Not to sure about padding. With foobars Optimize file layout + minimize i get this for a single album:
Total file size decreased by 29953 bytes for the CUEtools -8 version and Total file size decreased by 62721 bytes for the flac-native -9 version. I doubt this is much.
With 24bit music you can listen to silence much louder!

Re: New FLAC compression improvement

8pe vs 9  (+0.14% compression gain on 1 album).
~55-60x vs ~20x on my laptop with 6 cores/12 threads.

As for me, 9 is ok.

P.S. 9pe brings +0.08% compression gain at ~16x comparing to -9. Ok too.

Re: New FLAC compression improvement

You can use my old compile.
[...]
Not to sure about padding. With foobars Optimize file layout + minimize i get this for a single album:
[...]
You are right. I did a comparison and CUEtools.Flake does a much better job than regular flac. It seems my time would have been better spend figuring out why CUEtools.Flake compressed so much better, but perhaps (hopefully) these gains stack, or my work will be for nothing.

Anyway, I've also rewritten my code to be able to use irls as postprocessing. So, when FLAC has found the best predictor in the normal way (through the regular LPC process, with the partial_tukey and punchout_tukey windows), it will try using that result as a starting point for IRLS, instead of starting over. In the graph this is called irlspost, with the number of iterations between round brackets. Using current -8e + some 4 extra windows (punchout_tukey()) like CUEtools.Flake does, and adding 3 iterations comes close to using -9 but is 4x faster.

Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

I found out why Flake compressed so much better than FLAC. Apparently, this has been a thing for at least 14 years: flake computes autocorrelation with double precision, and FLAC with single precision. this makes quite a difference on some tracks, especially piano music.

See the flac-dev mailinglist for the complete story: http://lists.xiph.org/pipermail/flac-dev/2021-June/006470.html

See the graph below for the results.

Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

Nice catch!

So the worst-slowdown CPUs for going double, would be those with SSE (which get the dark-blue benefits over the green) but no SSE2 (thus getting the light-blue penalty vs the red) - they will at -8 have a slowdown by a factor of 2.5 or so?
But at a benefit.

Thinking aloud about what near-equivalences between "new" and "old" for those worst-case  SSE-but-not-SSE2 CPUs:
If I may presume that the most interesting (to end-users!) compression levels are -0, -8 and the default  - signifying "speed please!", "compression please!" and "I don't care, it works":
* For the SSE-but-not-SSE2, going from dark-blue -6 to light-blue -5 would get as good as the same outcome. Of course you don't change the default out of the [light|dark]blue, but for those who choose default from the "I don't care, it works" it shouldn't matter much if a new build gives them the performance equivalent of bumping -5 to -6.
* And if the using-whatever-defaults users start to care, they can start using options. You didn't include any -3, but what would the light-blue -3 be? Hunch: close to dark-blue -5?
* It is also a stretch to redefine -8 or reassign "--best" to something else than -8 (even if the documentation always said "Currently" synonymous with -8) but the -7 suddenly looks damn good - and that goes for the red as well.

So maybe at the conclusion of your efforts one should rethink what the numbers - and "--best" (and "--fast"?) stand for. -6 to -8 use -r 6 which is not the subset maximum; and "4096" is not the maximum subset blocksize.

(And dare I say it, but as impressive as TAK performs, one may wonder how much of it would be feasible within other formats too.)
High Voltage socket-nose-avatar

Re: New FLAC compression improvement

So the worst-slowdown CPUs for going double, would be those with SSE (which get the dark-blue benefits over the green) but no SSE2 (thus getting the light-blue penalty vs the red) - they will at -8 have a slowdown by a factor of 2.5 or so?
Yes. SSE2 has been around for 20 years and is present on all 64-bit capable CPU's, so this should not affect many users. I'm not sure about the VSX part (for POWER8 and POWER9 CPU's) but I think the penalty will be comparable to the SSE-but-not-SSE2 part if nobody updates these routines before the next release. I don't have the hardware, so I can't develop it.

Quote
* And if the using-whatever-defaults users start to care, they can start using options. You didn't include any -3, but what would the light-blue -3 be? Hunch: close to dark-blue -5?
-3 disables stereo decorrelation and is in a completely different ballpark. See http://www.audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%204.pdf Graph from that PDF is below, see red circle for -3. Using -3 is pretty much pointless, but perhaps some archaic decoders can only decode without stereo decorrelation. I think I'd rather leave that untouched.

Quote
* It is also a stretch to redefine -8 or reassign "--best" to something else than -8 (even if the documentation always said "Currently" synonymous with -8) but the -7 suddenly looks damn good - and that goes for the red as well.
Yes, but the name "best" does imply that is the best. It isn't, -8e is better and -8ep is even better, but it is the best preset. I wouldn't want to change that.
Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

the name "best" does imply that is the best. It isn't,

Let's hear it for the good(?) old  --super-secret-totally-impractical-compression-level !
High Voltage socket-nose-avatar

Re: New FLAC compression improvement

But on -3 I think you turned two arguments on their head, or am I just having a bad hair day?

* -3 improves over -2 at both speed and compression simultaneously. That is not pointless, that is good.
If anything of those settings is pointless from a performance perspective, it is -2.

* if your new -3 would happen to turn out as good as the old -5, it would mean that users who want "old -5" performance can happily go "new -3".
But that does not prevent anyone with a special need for -3 to keep using -3.

High Voltage socket-nose-avatar

Re: New FLAC compression improvement

Preset -0, -1 and -2 are only using fixed subframes. These are really easy to decode. If one wants maximum performance with only fixed subframes, -2 is the way to go, It is the odd one out performance wise, but I think is is like using TAK's -p0m instead of -p1: slower encoding, same decoding speed and less compression, but there is some limitation in the encoding process which should help decoding. I don't know for sure, I've never studied the TAK format.

Anyway, from a performance perspective, I think -0, -1, -2 and -3 are all pointless. The gain from -4 to -8 is less than 0.5%point (~1%) while the gain from -3 to -4 is 2%point (~4%). However, in the very early days of FLAC support on hardware devices through Rockbox, some levels gave a longer battery life than others, and some devices with very limited hardware didn't run fast enough to decode LPC. There is probably some documentation around referring to FLAC presets in terms of decoding performance. That's why I think the presets should only be changed as long as decoding performance is not affected.

Very archaic indeed, but FLAC has been broadly accepted for quite a long time now, so I guess it's part of the deal
Music: sounds arranged such that they construct feelings.

Re: New FLAC compression improvement

But again, allowing users to migrate to a lower-number setting does not hurt any compatibility.

Admittedly, I only speculated through gut-feeling extrapolation, as of what your new -3 would be able to do.
But let me just point at the graph you posted. If I understand your light-blue -5 vs dark-blue -6 correct, it means that an SSE (no SS2) user who as of today prefers -6 could get a performance hit - which could be as good as eliminated by switching to -5 in the new version.

Going -6 to -5 is ... fine. Going the other way, forcing users to choose a higher number, that has a risk - but -6 on an old version to -5
on a new version means they keep their performance and maybe even get a slightly better battery life upon decoding. So that is a case for just adopting the double.

Of course it isn't that straight - not everyone has a collection represented by your graphs.
High Voltage socket-nose-avatar

Re: New FLAC compression improvement

n the very early days of FLAC support on hardware devices through Rockbox, some levels gave a longer battery life than others, and some devices with very limited hardware didn't run fast enough to decode LPC
Can someone name at least one such slow device? According to these results FLAC -8 decoding is even faster than mp3 decoding on all tested hardware.

Re: New FLAC compression improvement

Can someone name at least one such slow device? According to these results FLAC -8 decoding is even faster than mp3 decoding on all tested hardware.
Seeing that table I think I remembered wrong. That table was what I remembered but couldn't find, FLAC compression levels being referred to in a benchmark on decoding performance.

Anyway, I'd like the opinion of the people reading this on the following (quote from the mailing list: http://lists.xiph.org/pipermail/flac-dev/2021-June/006470.html)

Quote
Code is here: https://github.com/ktmf01/flac/tree/autoc-sse2 Before I send a push request, I'd like to discuss a choice that has to be made.
I see a few options
- Don't switch to autoc[] as doubles, keep current speed and ignore possible compression gain
- Switch to autoc[] as doubles, but keep current intrinsics routines. This means some platforms (with only SSE but not SSE2 or with VSX) will get less compression, but won't see a large slowdown.
- Switch to autoc[] as doubles, but remove current SSE and disable VSX intrinsics for someone to update them later (I don't have any POWER8 or POWER9 hardware to test). This means all platforms will get the same compression, but some (with only SSE but not SSE2 or with VSX) will see a large slowdown.

So, switching from single precision to double precision, I'm faced with a choice: should I keep the fast but single-precision routines for SSE and VSX, so they keep the same speed but no compression benefit? Or should these be removed/disabled so all platforms (ARM, POWER, ia32, AMD64) get the same compression, but at varying costs?
Music: sounds arranged such that they construct feelings.