Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: FLAC v1.5.x Performance Tests (Read 6554 times) previous topic - next topic - Topic derived from FLAC v1.4.x Performan...
0 Members and 1 Guest are viewing this topic.

FLAC v1.5.x Performance Tests

Today we received FLAC 1.5.0
Here are my recent Exact Rice AVX2 builds additional to the default AVX2 builds in the 1.5.0 release thread.
https://hydrogenaud.io/index.php/topic,127408.msg1059172/topicseen.html#new

edit: i posted this in the FLAC v1.4.x Performance Tests thread but some mod must have made this an own thread.



Moderator: changed title (was: FLAC v1.5.0 (Exact Rice AVX2 builds))
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.5.x Performance Tests

Reply #1
Today we received FLAC 1.5.0
Here are my recent Exact Rice AVX2 builds additional to the default AVX2 builds in the 1.5.0 release thread.
https://hydrogenaud.io/index.php/topic,127408.msg1059172/topicseen.html#new

edit: i posted this in the FLAC v1.4.x Performance Tests thread but some mod must have made this an own thread.

Thank you for this! It’s an exciting day.

In layperson’s terms, how exactly does this build differ from the ones in the other thread and why might somebody want to choose one over the other?

I understand that one of the builds there is optimized for 16bit material at the expense of 24bit performance, but aside from that…


Re: FLAC v1.5.x Performance Tests

Reply #3
In layperson’s terms, how exactly does this build differ from the ones in the other thread and why might somebody want to choose one over the other?

I understand that one of the builds there is optimized for 16bit material at the expense of 24bit performance, but aside from that…
Instead of thinking too much I did a quick test. Since we are talking about speed. Decoder seems to be a bit faster but Encoder is really worse. And only on AVX2 supported CPUs.
3,209,871,510 bytes (86 tracks, 2 channels, 16 bit, 48 khz) and AMD CPU(AVX2), Single Thread

FLAC VersionEncode sDecode s
FLAC 1.4.321.252 s16.564 s
FLAC 1.5.028.376 s14.752 s
FLAC 1.5.0 Disasm29.688 s14.500 s

Re: FLAC v1.5.x Performance Tests

Reply #4
The "Exact Rice" build will be slower, as it does extra work. It brute-forces the optimal Golomb-Rice parameter instead of estimating one that is at worst very close. If you compare the encoded files, they should be ever so slightly smaller.

Re: FLAC v1.5.x Performance Tests

Reply #5
A quick test, 4 passes for every build of encoder:

Spoiler (click to show/hide)

edit: 16 bit 44.1 kHz

Re: FLAC v1.5.x Performance Tests

Reply #6
I can't test AVX-512 builds myself. Very interesting it compresses as the Exact Rice version.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.5.x Performance Tests

Reply #7
The "Exact Rice" build will be slower, as it does extra work. It brute-forces the optimal Golomb-Rice parameter instead of estimating one that is at worst very close. If you compare the encoded files, they should be ever so slightly smaller.
I double checked but the compressed sizes seem to be the same. I'm using the default settings directly.
Also without AVX2 the processing speeds of 1.5.0 are almost the same as 1.4.3. I don't think I made a mistake. It would be nice if you could confirm this.

Re: FLAC v1.5.x Performance Tests

Reply #8
Thank you all for the clarification, this is something I might consider using in cases where speed is not really a concern (EAC ripping, for example)

With multithreading even intensive -8ep encodes happen practically within the blink of an eye. I could stand to slow things down a bit

Re: FLAC v1.5.x Performance Tests

Reply #9
I can't test AVX-512 builds myself. Very interesting it compresses as the Exact Rice version.
I’m really sorry for the mix-up. Turns out my system doesn’t support AVX512 either, and I accidentally relied on automated data in the report. My bad! Here’s the accurate quick-test data to fix this. Spoiler (click to show/hide)

Re: FLAC v1.5.x Performance Tests

Reply #10
Looking at the source code all vectorized sections are explicitly using AVX2, so enabling AVX512 will give you the larger register file (32 registers instead of 16) but won't actually generate AVX512 instructions most likely.   I expect that will not make much difference in performance. 

Has anyone profiled FLAC?  I'm curious how much time it spends in the vectorized functions vs scalar x86. 

Re: FLAC v1.5.x Performance Tests

Reply #11
@Porcus once did some benches and at least at higher compression gcc can ceate AVX-512 code that is up to 10% faster as AVX2.
https://hydrogenaud.io/index.php/topic,123025.msg1030848.html#msg1030848
So i thought to add it. Maybe we get some more numbers if it is worth.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.5.x Performance Tests

Reply #12
Instead of thinking too much I did a quick test. Since we are talking about speed. Decoder seems to be a bit faster but Encoder is really worse. And only on AVX2 supported CPUs.
3,209,871,510 bytes (86 tracks, 2 channels, 16 bit, 48 khz) and AMD CPU(AVX2), Single Thread

FLAC VersionEncode sDecode s
FLAC 1.4.321.252 s16.564 s
FLAC 1.5.028.376 s14.752 s
FLAC 1.5.0 Disasm29.688 s14.500 s
When i looked at the Flac 1.5.0 Avx2 version, i saw that it was 109,184 bytes less than normal. Since the difference in 3.2 GB of data was meaningless, it escaped my attention. Personally, i had higher expectations when it came to speed(despite Avx2). I thank the developers for their efforts.


Re: FLAC v1.5.x Performance Tests

Reply #14
That was profiling of decoding.

Which functions take most time is heavily dependent on context: 16-bit or 24-bit, preset 0, 5 or 8, architecture, etc.

Here is a profile run for 16-bit input data, preset 8, x86-64 with AVX2, showing all functions that take more than 1% of time

Code: [Select]
  27.76%  flac     flac              [.] FLAC__lpc_compute_autocorrelation_intrin_fma_lag_16
  13.12%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_16_intrin_avx2
   6.60%  flac     flac              [.] FLAC__MD5Transform
   6.54%  flac     flac              [.] FLAC__bitwriter_write_rice_signed_block
   6.17%  flac     flac              [.] find_best_partition_order_.isra.0
   5.37%  flac     flac              [.] FLAC__precompute_partition_info_sums_intrin_avx2
   4.93%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
   4.50%  flac     flac              [.] FLAC__lpc_window_data_partial
   4.30%  flac     flac              [.] FLAC__fixed_compute_best_predictor_wide_intrin_avx2
   2.94%  flac     flac              [.] FLAC__stream_encoder_process
   2.26%  flac     flac              [.] FLAC__lpc_compute_lp_coefficients
   1.38%  flac     flac              [.] process_subframes_
   1.09%  flac     flac              [.] format_input.constprop.0
   1.07%  flac     flac              [.] FLAC__MD5Accumulate

So, here calculation of autocorrelation coefficients takes the most time. However, if I use preset 8p, I get this:

Code: [Select]
  36.18%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_16_intrin_avx2
  14.09%  flac     flac              [.] find_best_partition_order_.isra.0
  12.68%  flac     flac              [.] FLAC__precompute_partition_info_sums_intrin_avx2
  12.30%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
   8.29%  flac     flac              [.] FLAC__lpc_compute_autocorrelation_intrin_fma_lag_16
   1.96%  flac     flac              [.] FLAC__MD5Transform
   1.92%  flac     flac              [.] FLAC__bitwriter_write_rice_signed_block
   1.84%  flac     flac              [.] FLAC__lpc_quantize_coefficients
   1.47%  flac     flac              [.] FLAC__lpc_window_data_partial
   1.23%  flac     flac              [.] FLAC__fixed_compute_best_predictor_wide_intrin_avx2

If I use -8p on 24-bit material, I get this

Code: [Select]
  79.79%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual
   4.52%  flac     flac              [.] FLAC__precompute_partition_info_sums_intrin_avx2
   3.38%  flac     flac              [.] find_best_partition_order_.isra.0
   2.93%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_intrin_avx2
   2.44%  flac     flac              [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
   1.49%  flac     flac              [.] FLAC__lpc_compute_autocorrelation_intrin_fma_lag_16
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.5.x Performance Tests

Reply #15
Interesting.
I don't know if -p re-partitions the subframe during its search (and if not, whether that is done first or last), but if I have understood correctly:
It starts from "maximum" precision, and for each successive right-shifting of the predictor, compute new residual for the entire subframe - and in the end choosing the best one?
And then with 24 bit singals, the speed takes a hit because it has to be done with 64-bit words?

If it is easy to run such profilings: How does it do fixed predictors?
Like, for example:
 * -0r0 --no-md5 for speed. -0r0 --no-md5 -b4096 I think is even faster, YMMV, but is that the bit-reading or what?
 * -2er8 because that is the most brute-forcing (subset) setting with fixed predictors

The -e also makes it compute a bunch of residuals I suppose - but with -l0 it should remain in the 32-bit world even for 24 bit signals?


@Hakan Abbas , you got anything similar for HALAC, which apparently does the residual writing very fast?

Re: FLAC v1.5.x Performance Tests

Reply #16
Health for the labor of all developers for FLAC 1.5

@Porcus; Below is the profile data of HALAC Encoder and Decoder.
X   X
The biggest bottleneck for the Encoder is RICE encoding. So binarization. But theoretically it should be close to Huffman in terms of speed, so it can run a bit faster. Obtaining(FORWARD_LINEAR_PREDICTION) and using LPC coefficients is quite fast as there is no dependency.

In fact, there is much less processing in the decode stage than in the encode stage. Again, the RICE decoder is not yet fully optimized. At this stage, it actually takes a long time to get the required values back using the LPC coefficients(UNCOMPRESS). The main reason for the slowness here is dependency. Each decoded value is necessary to get the next value. So one cannot move on to the next without resolving one. In this case, parallelization is not very efficient. SIMD is not used in HALAC, but the compiler can try to do some parallelization as automatically as possible.

24-bit data is much larger than 16-bit data (and so are the error values). This can inherently add a bit more processing overhead to the linear prediction stage. I haven't used 24-bit data yet, but I wouldn't want any slowdown in terms of speed. However, the 79.79% processing load I see for FLAC doesn't seem normal.

Note: I'm currently working on a different project so I'm taking a break from HALAC.

Re: FLAC v1.5.x Performance Tests

Reply #17
24-bit data is much larger than 16-bit data (and so are the error values). This can inherently add a bit more processing overhead to the linear prediction stage. I haven't used 24-bit data yet, but I wouldn't want any slowdown in terms of speed. However, the 79.79% processing load I see for FLAC doesn't seem normal.
It's when employing the "-p" switch, which brute-forces the predictor quantification. So if I am right, it calculates residuals over and over again rather than estimating new coefficients.

Also, using estimated coefficients rather than the fixed predictors, means you need to go up so many bits that there is not much chance that a 24 bit signal can be treated with a 32 bit datatype. But I think fixed-predictor subframes can.

Note: I'm currently working on a different project so I'm taking a break from HALAC.
It's of course up to you how much you want to share, but chances are your code could speed up the fastest FLAC. 

 

Re: FLAC v1.5.x Performance Tests

Reply #18
Also, using estimated coefficients rather than the fixed predictors, means you need to go up so many bits that there is not much chance that a 24 bit signal can be treated with a 32 bit datatype. But I think fixed-predictor subframes can.
In the case of FLAC, I don't know if they do linear prediction on 24-bit data as it is. The residuals from different values of transformations or predictions applied to 24-bit data can be mathematically larger. This is a difficult possibility, but it can be avoided. Because each process has a rough range of desired or required results. Unnecessary overflows can therefore be avoided with a quick check. However, each of these checks will require branching.

Quote
It's of course up to you how much you want to share, but chances are your code could speed up the fastest FLAC.
FLAC is already quite fast and offers a good compression ratio. It is therefore always the standard codec of choice. It doesn't need to prove anything more. HALAC (and HALIC) is experimental and aims to improve the speed/compression ratio as much as possible. It does not claim to be a standard format. Without losing control I would like to add 24-bit and multi-channel support.

Re: FLAC v1.5.x Performance Tests

Reply #19
FLAC is already quite fast and offers a good compression ratio. It is therefore always the standard codec of choice. It doesn't need to prove anything more. HALAC (and HALIC) is experimental and aims to improve the speed/compression ratio as much as possible. It does not claim to be a standard format. Without losing control I would like to add 24-bit and multi-channel support.
I don't mean to change the format - but there must be something about your algorithm for reading bits from file and decoding them.

As for the LPC'ing, you probably know it better than I do, but some of the number-crunching is done in double precision float (that explained why ffmpeg could sometimes compress better until 1.4.0). But then each coefficient is taken down to q bits, where q can be so big that the "predicted value before right-shift" exceeds 32-bit range. Of course you can "right-shift the coefficients first" (potentially at the cost of a weaker prediction), and FLAC-the-format allows that too. The "-p" switch that spends so much time encoding residual, works by brute-forcing that part; calculate with high precision, then decimate precision one bit at the time and recalculate the residual, rinse and repeat and select the overall smallest.
And that - I think, ktf has corrected me more than twice - is THE reason for the "-8p" profiling numbers.
(Turns out you save more than "one bit per coefficient", because randomly rounding off sometimes helps the prediction.)

Re: FLAC v1.5.x Performance Tests

Reply #20
First of all, I have a serious lack of knowledge about what FLAC does. So I can talk about the things I do, but this subject belongs to FLAC. So, I will speak more generally without going too far off topic.

You can see a graph of the LPC in the image on the left below. Here the dots show the sample values. The colored line shows how close the LPC gets to them. At the bottom you can see the LPC coefficients and average error values. The image on the right is something more special. It is an adaptive solution, so it works symmetrically, but the prediction accuracy is much better. And interestingly, the parameters obtained for a small block of audio data can be valid for a very large number of other different blocks. I have not used it in a real application as it might be a bit slow in terms of speed (2x or 3x).
   
With the coefficients from the linear estimation stage we can roughly represent the graph of our audio data. The higher the coefficients, the better the results, but they have to be kept as side information. This is a penalty. The resulting coefficients are double 64-bit (8 byte) floating points. However, 32 bit (4 byte) floats can also be obtained. In my early work I used these parameters in 4 bytes each, but later I reduced this to 2 bytes. There was no significant difference between the compression results obtained. Because on the one hand, even if the compression efficiency decreased, the amount of side information per block decreased. This balance needs to be adjusted well. Of course, the conversion from 8 bytes to 2 bytes adds extra processing overhead.

As you mentioned, it is a burden in terms of speed to do enough processing to be able to select on the parameters. On the other hand, the difference between FLAC -8 and -8p doesn't seem to be a significant gain in compression efficiency.

Re: FLAC v1.5.x Performance Tests

Reply #21
That was profiling of decoding.

Which functions take most time is heavily dependent on context: 16-bit or 24-bit, preset 0, 5 or 8, architecture, etc.

Here is a profile run for 16-bit input data, preset 8, x86-64 with AVX2, showing all functions that take more than 1% of time

Thanks. 

Did not look at disassembly, but is FLAC__lpc_compute_autocorrelation_intrin_fma_lag_16 not worth explict vectorization? 

Re: FLAC v1.5.x Performance Tests

Reply #22
When I introduced the fma accelerated functions, I also wrote those explicitly vectorized, but I found the improvement not outweighing the added complexity.

There are still a lot of things FLAC demanding my attention, so I cannot say whether I'll have time to look at improving them in the near future.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.5.x Performance Tests

Reply #23
Nice work on the new FLAC release! Two questions from my side:
1. With the "official" Windows-64 binary and single-threaded operation, I get bit-identical encodings when not using the changed -M behavior, at least for some randomly selected CDDA audio lying around on my laptop. So no efficiency changes over version 1.4.3, other than the "exact Rice" thing (which isn't used by default), am I right?
2. Are you planning to make another encoder speed-performance figure incl. multithreaded FLAC, similar to http://audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%206%20-%20cdda.html ?

Chrid
If I don't reply to your reply, it means I agree with you.

Re: FLAC v1.5.x Performance Tests

Reply #24
Nice work on the new FLAC release! Two questions from my side:
1. With the "official" Windows-64 binary and single-threaded operation, I get bit-identical encodings when not using the changed -M behavior, at least for some randomly selected CDDA audio lying around on my laptop. So no efficiency changes over version 1.4.3, other than the "exact Rice" thing (which isn't used by default), am I right?
I can't say that, could very well be a different compiler produces a binary that gives you ever so slightly smaller/bigger FLAC files. See https://xiph.org/flac/faq.html#tools__different_sizes I quote
Quote
Why doesn't the same file compressed on different machines with the same options yield the same FLAC file?

It's not supposed to, and neither does it mean either encoding was bad. There are many variations between different machines or even different builds of flac on the same machine that can lead to small differences in the FLAC file, even if they have the exact same final size. This is normal.

Quote
2. Are you planning to make another encoder speed-performance figure incl. multithreaded FLAC, similar to http://audiograaf.nl/losslesstest/Lossless%20audio%20codec%20comparison%20-%20revision%206%20-%20cdda.html ?
At some point, maybe, if time allows.
Music: sounds arranged such that they construct feelings.