FLAC v1.4.x Performance Tests

Topic: FLAC v1.4.x Performance Tests (Read 79237 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #475 – 2024-05-07 16:09:35

Quote from: cid42 on 2024-05-07 09:58:37

It'll be interesting to see how APX (the biggest thing being doubled registers) improves things. Probably ~10% (pure speculation) in general but every workload will be different.

GCC 14.1 compiler was released today with some APX support. Msys2 for windows has not updated yet but some linux enthusiasts may help out.
https://www.phoronix.com/news/GCC-14.1-Released

Re: FLAC v1.4.x Performance Tests

Reply #476 – 2024-05-07 17:26:23

There are no CPUs with APX support yet, so I wouldn't be in too much of a hurry to build APX binaries.

Re: FLAC v1.4.x Performance Tests

Reply #477 – 2024-05-07 18:20:52

There should be intel CPU's late this year with APX at earliest. Doubt it'll be in Zen5, might be in Zen6, designs are done years in advance and intel will have kept it back for competitive advantage. I doubt the gcc support is ready, they have to start adding it in now to make sure it's in good shape by next year.

Re: FLAC v1.4.x Performance Tests

Reply #478 – 2024-05-07 23:24:21

ODD SMALL BLOCK SIZES (and a couple more normal-sized)

Because I didn't pay sufficient attention to my own testing, I ran something more - testing blocksizes that nobody wants to use. (Reason for that: make sure there is no partitioning for the residual, so there are no differences in that.)

Took the 219 minute CDDA .wav file and made 30 .flac files:
-0 and -5 at fifteen distinct block sizes, all odd numbers to be sure I didn't test different handling of partitioning.
Most are too low to be interesting! (Why? I'll explain at the end.)
17, 23, 31, 45, 65, 97, 147, 223, 341, 523, 803 <in between here is the -0 to -2 preset blocksize!> 1237, 1907, 2941 <in between here are the other presets> 4537 (ffmpeg uses the subset upper bound of 4608)

Those were encoded repeatedly using different FLAC releases (Xiph builds ... I hope I didn't get that wrong!). Timings for the -0b<N> and -5b<N>:

To the right end of both diagrams, you see that 1.4.3 has gotten rid of the slowdown around the 4096 blocksize, which is good because that is a default.

Then decoding. Note, here the flac versions are different! Because the results were a bit surprising, I dug up 1.3.1 too (a tinytiny bit slower than 1.3.4, I deleted the latter to get the number of curves down) and also 1.2.1. Omitted is also 1.4.2 which is as good as tied to 1.4.3. Kept in both the 32-bit and 64-bit version of 1.4.3 (it being the current release).
And here the block sizes are visible, I lost them from the encoding charts ...

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!
The observation that 1.4.3 32-bit crosses 1.2.1 this way also indicates that something got lost since 1.3.

Note the difference between 64-bit builds is pretty much nothing at the reasonable block sizes. Don't overinterpret this
Imgur album: https://imgur.com/a/JDAccOD

* Comparison with ffmpeg - which spends bonkers amounts of time at small blocks - at https://hydrogenaud.io/index.php/topic,125791.msg1044102.html#msg1044102

* Why these block sizes? I tried the minimum block size over in that thread, and since ffmpeg did so horrible, I set out to check (1) where it would start to behave normal, and (2) what is then ... normal? Reference implementation for comparison. And then it turned out, differences between versions ... and a new run.

Corpus: I took twenty-one of the CDs in my signature (7 in each broad subdivision), the first ten and a half minute of each, merged to a 219 minute WAVE file.
Timing was done overnight with hyperfine, warmup + 5 runs with 30 seconds cooldown in between (ffmpeg last), and times are median of 5.

Re: FLAC v1.4.x Performance Tests

Reply #479 – 2024-05-08 10:37:40

Quote from: cid42 on 2024-05-07 18:20:52

There should be intel CPU's late this year with APX at earliest.

And fragmentation continues. AVX10 is hailed to "clean up the mess of AVX-512" but all I see is even more fragmentation. flac on x86 will be an incredibly fat binary with SSE2, SSSE3, SSE4.1, AVX2, AVX512 and AVX10 code paths. Because while Intel is dropping AVX512 for AVX10, AMD just started on AVX512 and CPUs without any AVX are still being sold, so it is not like SSE can be dropped anytime soon.

Quote from: Porcus on 2024-05-07 23:24:21

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!

I think it has to do with the extensive checks that have been added to FLAC 1.4.0 for handling corrupt audio. Also, I think the binary is currently too fat, so it impacts branch prediction, which is most profound on small blocksizes.

Re: FLAC v1.4.x Performance Tests

Reply #480 – 2024-05-08 11:38:29

Quote from: ktf on 2024-05-08 10:37:40

I think it has to do with the extensive checks that have been added to FLAC 1.4.0 for handling corrupt audio.

Fair enough if so.
(I thought of that possibility, but it didn't strike me as something that should depend on the number of blocks. Like, if you decode something that exceeds 16 bit when it shouldn't, I was thinking that you probably check that per sample. But what do I know.)

Re: FLAC v1.4.x Performance Tests

Reply #481 – 2024-05-08 13:25:33

I am really just guessing here. The bounds check is indeed per sample, but there are also additional checks for missing blocks and such. Even the per-sample checks can affect the per-frame overhead if the per-frame "loop" no longer fits the branch prediction history.

Re: FLAC v1.4.x Performance Tests

Reply #482 – 2024-05-08 15:20:13

Quote from: ktf on 2024-05-08 10:37:40

Also, I think the binary is currently too fat, so it impacts branch prediction, which is most profound on small blocksizes.

Too fat in binary size or amount of code? For me, FLAC built with GCC 12.x produces the smallest binary, and at least on my system has the best performance, even if by a very small amount.

Static binary size for x86_64:
GCC-11 = 701.4k
GCC-12 = 689.4k
GCC-13 = 733.4k
GCC-14 = 733.4k

Re: FLAC v1.4.x Performance Tests

Reply #483 – 2024-05-08 15:25:09

Do you see any speed improvement using GCC 14 over GCC 13?

Re: FLAC v1.4.x Performance Tests

Reply #484 – 2024-05-08 15:53:52

Quote from: ktf on 2024-05-08 10:37:40

Quote from: cid42 on 2024-05-07 18:20:52
There should be intel CPU's late this year with APX at earliest.
And fragmentation continues. AVX10 is hailed to "clean up the mess of AVX-512" but all I see is even more fragmentation. flac on x86 will be an incredibly fat binary with SSE2, SSSE3, SSE4.1, AVX2, AVX512 and AVX10 code paths. Because while Intel is dropping AVX512 for AVX10, AMD just started on AVX512 and CPUs without any AVX are still being sold, so it is not like SSE can be dropped anytime soon.

APX is mostly unrelated to AVX10. Being a general core improvement instead of another SIMD category I quite like the idea of APX, it seems long overdue. But it does beg the question, can you have a fat binary containing both an APX and x86_64-non-APX path? I've just assumed there'll have to be 3 binaries, x86 x86_64 and APX. With luck it's not as fragmented as it sounds, anything with APX should have AVX10.256 or AVX10.512 right? Surely...

For intel AVX10 is about hitting the snooze button on SIMD given that they failed to make AVX512 good for years to the point that AMD ate their lunch with Zen4. Technically AVX10 256 bit is superior to AVX2, however all projects for which SIMD is relevant already have an AVX2 path so its impact will be minimal for many years. There should be some workloads that do benefit, the same ones that already benefit from AVX512 for the non-512 bit additions it has.

For full coverage there would have to be an AVX10.256 and AVX10.512 path, although OTOH I can't remember if there's much difference between AVX10.512 and AVX512. Thanks intel, clearly this wasn't an obfuscating table flip move at all. At least the AVX10.256 path can probably be adapted relatively easily from the AVX512 or AVX10.512 paths, in the same way anything prior to and including AVX2 follow the same kind of patterns.

intel CPU's haven't had a very good half decade really. They're screwing up SIMD, heterogeneous CPU's are IMO a mistake at least the way they're doing it, the CPU's run at insane power draw OOTB which is starting to prematurely degrade the latest generations all to gain a few poxy points in benchmarks. Not exactly ideal.

Re: FLAC v1.4.x Performance Tests

Reply #485 – 2024-05-08 16:09:07

Quote from: Wombat on 2024-05-08 15:25:09

Do you see any speed improvement using GCC 14 over GCC 13?

I haven't compared builds with 13 and 14 directly. I can run some tests later if you have anything specific in mind.

I can also upload Linux binaries later. I don't have GCC 14 to build Windows binaries.

Re: FLAC v1.4.x Performance Tests

Reply #486 – 2024-05-08 16:25:37

No hurry. Msys2 will update soon enough.

Re: FLAC v1.4.x Performance Tests

Reply #487 – 2024-05-09 16:13:03

Msys2 just updated and using the same compiler settings the GCC 14.1.0 AVX2 version is slightly bigger and slower.

Re: FLAC v1.4.x Performance Tests

Reply #488 – 2024-05-09 16:45:34

I did some limited testing. FLAC built with gcc 12 seems to beat gcc 13 & 14 builds for 16-bit audio, but FLAC built with gcc 14 does better with 24-bit audio.

Re: FLAC v1.4.x Performance Tests

Reply #489 – 2024-05-09 16:54:56

I will have to play around more. If someone wants a 14.1.0 AVX2 version i attached one.

Re: FLAC v1.4.x Performance Tests

Reply #490 – 2024-05-10 01:50:32

FLAC built with GCC 12, 13 & 14. Times are the average of 3 runs.

Wav: 16-bit/44.1KHz 2h41m 1.60GiB

Code: [Select]

                Generic Build  |  No ASM Optimzations
GCC-12 1 thread:    1m25.343s  |  1m9.162s
GCC-13 1 thread:    1m24.958s  |  1m9.218s
GCC-14 1 thread:    1m25.622s  |  1m9.629s

GCC-12 8 threads:   0m17.890s  |  0m14.816s
GCC-13 8 threads:   0m18.637s  |  0m15.656s
GCC-14 8 threads:   0m18.854s  |  0m15.820s

GCC-12 16 threads:  0m16.536s  |  0m13.916s
GCC-13 16 threads:  0m17.404s  |  0m14.802s
GCC-14 16 threads:  0m17.598s  |  0m14.985s

Wav: 24-bit/48 KHz 2h33m 2.47GiB

Code: [Select]

GCC-12 1 thread:    6m14.467s
GCC-13 1 thread:    6m13.127s
GCC-14 1 thread:    6m12.464s

GCC-12 8 threads:   1m17.415s
GCC-13 8 threads:   1m17.558s
GCC-14 8 threads:   1m17.088s

GCC-12 16 threads:  1m16.173s
GCC-13 16 threads:  1m15.234s
GCC-14 16 threads:  1m14.989s

Re: FLAC v1.4.x Performance Tests

Reply #491 – 2024-05-10 02:01:44

Thanks for the infos. GCC 14 looses most here with 16bit audio and disabled asm optimizations.
24bit files encode way to slow with disabled asm optimizations with all versions imho.

Re: FLAC v1.4.x Performance Tests

Reply #492 – 2024-05-10 02:24:55

Quote from: Wombat on 2024-05-10 02:01:44

Thanks for the infos. GCC 14 looses most here with 16bit audio and disabled asm optimizations.
24bit files encode way to slow with disabled asm optimizations with all versions imho.

That's why I didn't bother with results for the 24-bit audio without asm optimizations. The single threaded runs take about 3 minutes longer to encode.

Re: FLAC v1.4.x Performance Tests

Reply #493 – 2024-05-10 07:59:24

Quote from: Porcus on 2024-05-07 23:24:21

ODD SMALL BLOCK SIZES (and a couple more normal-sized)

Because I didn't pay sufficient attention to my own testing, I ran something more - testing blocksizes that nobody wants to use. (Reason for that: make sure there is no partitioning for the residual, so there are no differences in that.)

Took the 219 minute CDDA .wav file and made 30 .flac files:
-0 and -5 at fifteen distinct block sizes, all odd numbers to be sure I didn't test different handling of partitioning.
Most are too low to be interesting! (Why? I'll explain at the end.)
17, 23, 31, 45, 65, 97, 147, 223, 341, 523, 803 <in between here is the -0 to -2 preset blocksize!> 1237, 1907, 2941 <in between here are the other presets> 4537 (ffmpeg uses the subset upper bound of 4608)

Those were encoded repeatedly using different FLAC releases (Xiph builds ... I hope I didn't get that wrong!). Timings for the -0b<N> and -5b<N>:

To the right end of both diagrams, you see that 1.4.3 has gotten rid of the slowdown around the 4096 blocksize, which is good because that is a default.

Then decoding. Note, here the flac versions are different! Because the results were a bit surprising, I dug up 1.3.1 too (a tinytiny bit slower than 1.3.4, I deleted the latter to get the number of curves down) and also 1.2.1. Omitted is also 1.4.2 which is as good as tied to 1.4.3. Kept in both the 32-bit and 64-bit version of 1.4.3 (it being the current release).
And here the block sizes are visible, I lost them from the encoding charts ...

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!
The observation that 1.4.3 32-bit crosses 1.2.1 this way also indicates that something got lost since 1.3.

Note the difference between 64-bit builds is pretty much nothing at the reasonable block sizes. Don't overinterpret this
Imgur album: https://imgur.com/a/JDAccOD

* Comparison with ffmpeg - which spends bonkers amounts of time at small blocks - at https://hydrogenaud.io/index.php/topic,125791.msg1044102.html#msg1044102

* Why these block sizes? I tried the minimum block size over in that thread, and since ffmpeg did so horrible, I set out to check (1) where it would start to behave normal, and (2) what is then ... normal? Reference implementation for comparison. And then it turned out, differences between versions ... and a new run.

Corpus: I took twenty-one of the CDs in my signature (7 in each broad subdivision), the first ten and a half minute of each, merged to a 219 minute WAVE file.
Timing was done overnight with hyperfine, warmup + 5 runs with 30 seconds cooldown in between (ffmpeg last), and times are median of 5.

Regarding about small sizes and ffmpeg, i wrote explanation that you ignore all the time.

Re: FLAC v1.4.x Performance Tests

Reply #494 – 2024-05-10 13:20:21

Yeah, that is in the other thread - you explained why it does worse, but that raises the question of when it starts doing horrible.
The good news is that it behaves sane at the reference encoder's defaults. Also in the other thread.

Re: FLAC v1.4.x Performance Tests

Reply #495 – 2024-05-16 05:41:49

Binaries are generic x86_64 builds, built with GCC 12.3.0
Wav: 24-bit/48 KHz 2h33m 2.47GiB
CPU: Ryzen 5850U

flac git-04532802 (2024-05-02)

Code: [Select]

     1 thread    4 threads   8 threads
-5   0m14.505s   0m4.750s    0m4.733s
-5p  0m34.664s   0m10.664s   0m7.061s
-8   0m54.564s   0m16.920s   0m10.359s
-8p  6m13.161s   1m57.957s   1m14.785s

flac git-1ab3c8e7 (2024-05-15)
"Improve calculation of when to use wide residual computation. This change should make 24-bit encoding faster, because the limit_residual variant of residual computation is used less often"

Code: [Select]

     1 thread    4 threads   8 threads
-5   0m13.705s   0m4.586s    0m4.719s
-5p  0m24.876s   0m7.702s    0m5.618s
-8   0m42.542s   0m13.355s   0m8.253s
-8p  3m49.137s   1m14.035s   0m47.345s

Re: FLAC v1.4.x Performance Tests

Reply #496 – 2024-05-16 06:39:11

Great that you're also seeing improvements. This is a change for which the improvement is highly dependent on the source material, so this is probably not going to show up for everyone. For your source material the improvement is more pronounced with presets 8 and 8p, with the tests I did it was specifically for 5 and 5p.

Re: FLAC v1.4.x Performance Tests

Reply #497 – 2024-05-16 16:05:07

I am still wondering GCC 14 being slower with all option versions on my AMD 5900x as the older GCC 13.
If someone wants the current git-cfe3afca in 14.1.0 for testing here it is as generic, AVX2 and with disabled asm + AVX2.

Re: FLAC v1.4.x Performance Tests

Reply #498 – 2024-05-16 17:13:07

Same setup as my previous post.

flac git-cfe3afca (2024-05-16)
"Further improve calculation of when to use wide residual computation"

Code: [Select]

     1 thread    4 threads   8 threads
-5   0m13.561s   0m4.579s    0m4.598s
-5p  0m23.162s   0m7.237s    0m5.396s
-8   0m37.276s   0m11.841s   0m7.287s
-8p  2m52.772s   0m57.365s   0m36.769s

Re: FLAC v1.4.x Performance Tests

Reply #499 – 2024-05-16 17:20:03

These are almost impressive numbers

Notice