Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: FLAC v1.4.x Performance Tests (Read 79237 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #475
It'll be interesting to see how APX (the biggest thing being doubled registers) improves things. Probably ~10% (pure speculation) in general but every workload will be different.
GCC 14.1 compiler was released today with some APX support. Msys2 for windows has not updated yet but some linux enthusiasts may help out.
https://www.phoronix.com/news/GCC-14.1-Released
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #476
There are no CPUs with APX support yet, so I wouldn't be in too much of a hurry to build APX binaries.

Re: FLAC v1.4.x Performance Tests

Reply #477
There should be intel CPU's late this year with APX at earliest. Doubt it'll be in Zen5, might be in Zen6, designs are done years in advance and intel will have kept it back for competitive advantage. I doubt the gcc support is ready, they have to start adding it in now to make sure it's in good shape by next year.

 

Re: FLAC v1.4.x Performance Tests

Reply #478
ODD SMALL BLOCK SIZES (and a couple more normal-sized)

Because I didn't pay sufficient attention to my own testing, I ran something more - testing blocksizes that nobody wants to use. (Reason for that: make sure there is no partitioning for the residual, so there are no differences in that.)

Took the 219 minute CDDA .wav file and made 30 .flac files:
-0 and -5 at fifteen distinct block sizes, all odd numbers to be sure I didn't test different handling of partitioning.
Most are too low to be interesting! (Why? I'll explain at the end.)
17, 23, 31, 45, 65, 97, 147, 223, 341, 523, 803 <in between here is the -0 to -2 preset blocksize!> 1237, 1907, 2941 <in between here are the other presets> 4537 (ffmpeg uses the subset upper bound of 4608)

Those were encoded repeatedly using different FLAC releases (Xiph builds ... I hope I didn't get that wrong!). Timings for the -0b<N> and -5b<N>:

To the right end of both diagrams, you see that 1.4.3 has gotten rid of the slowdown around the 4096 blocksize, which is good because that is a default.



Then decoding. Note, here the flac versions are different! Because the results were a bit surprising, I dug up 1.3.1 too (a tinytiny bit slower than 1.3.4, I deleted the latter to get the number of curves down) and also 1.2.1. Omitted is also 1.4.2 which is as good as tied to 1.4.3. Kept in both the 32-bit and 64-bit version of 1.4.3 (it being the current release).
And here the block sizes are visible, I lost them from the encoding charts ...

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!
The observation that 1.4.3 32-bit crosses 1.2.1 this way also indicates that something got lost since 1.3.

Note the difference between 64-bit builds is pretty much nothing at the reasonable block sizes. Don't overinterpret this
Imgur album: https://imgur.com/a/JDAccOD


* Comparison with ffmpeg - which spends bonkers amounts of time at small blocks - at https://hydrogenaud.io/index.php/topic,125791.msg1044102.html#msg1044102

* Why these block sizes? I tried the minimum block size over in that thread, and since ffmpeg did so horrible, I set out to check (1) where it would start to behave normal, and (2) what is then ... normal? Reference implementation for comparison. And then it turned out, differences between versions ... and a new run.


Corpus: I took twenty-one of the CDs in my signature (7 in each broad subdivision), the first ten and a half minute of each, merged to a 219 minute WAVE file.
Timing was done overnight with hyperfine, warmup + 5 runs with 30 seconds cooldown in between (ffmpeg last), and times are median of 5.


Re: FLAC v1.4.x Performance Tests

Reply #479
There should be intel CPU's late this year with APX at earliest.
And fragmentation continues. AVX10 is hailed to "clean up the mess of AVX-512" but all I see is even more fragmentation. flac on x86 will be an incredibly fat binary with SSE2, SSSE3, SSE4.1, AVX2, AVX512 and AVX10 code paths. Because while Intel is dropping AVX512 for AVX10, AMD just started on AVX512 and CPUs without any AVX are still being sold, so it is not like SSE can be dropped anytime soon.

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!
I think it has to do with the extensive checks that have been added to FLAC 1.4.0 for handling corrupt audio. Also, I think the binary is currently too fat, so it impacts branch prediction, which is most profound on small blocksizes.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #480
I think it has to do with the extensive checks that have been added to FLAC 1.4.0 for handling corrupt audio.
Fair enough if so.
(I thought of that possibility, but it didn't strike me as something that should depend on the number of blocks. Like, if you decode something that exceeds 16 bit when it shouldn't, I was thinking that you probably check that per sample. But what do I know.)

Re: FLAC v1.4.x Performance Tests

Reply #481
I am really just guessing here. The bounds check is indeed per sample, but there are also additional checks for missing blocks and such. Even the per-sample checks can affect the per-frame overhead if the per-frame "loop" no longer fits the branch prediction history.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #482
Also, I think the binary is currently too fat, so it impacts branch prediction, which is most profound on small blocksizes.

Too fat in binary size or amount of code?  For me, FLAC built with GCC 12.x produces the smallest binary, and at least on my system has the best performance, even if by a very small amount.

Static binary size for x86_64:
 GCC-11 = 701.4k
 GCC-12 = 689.4k
 GCC-13 = 733.4k
 GCC-14 = 733.4k


Re: FLAC v1.4.x Performance Tests

Reply #483
Do you see any speed improvement using GCC 14 over GCC  13?
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #484
There should be intel CPU's late this year with APX at earliest.
And fragmentation continues. AVX10 is hailed to "clean up the mess of AVX-512" but all I see is even more fragmentation. flac on x86 will be an incredibly fat binary with SSE2, SSSE3, SSE4.1, AVX2, AVX512 and AVX10 code paths. Because while Intel is dropping AVX512 for AVX10, AMD just started on AVX512 and CPUs without any AVX are still being sold, so it is not like SSE can be dropped anytime soon.

APX is mostly unrelated to AVX10. Being a general core improvement instead of another SIMD category I quite like the idea of APX, it seems long overdue. But it does beg the question, can you have a fat binary containing both an APX and x86_64-non-APX path? I've just assumed there'll have to be 3 binaries, x86 x86_64 and APX. With luck it's not as fragmented as it sounds, anything with APX should have AVX10.256 or AVX10.512 right? Surely...

For intel AVX10 is about hitting the snooze button on SIMD given that they failed to make AVX512 good for years to the point that AMD ate their lunch with Zen4. Technically AVX10 256 bit is superior to AVX2, however all projects for which SIMD is relevant already have an AVX2 path so its impact will be minimal for many years. There should be some workloads that do benefit, the same ones that already benefit from AVX512 for the non-512 bit additions it has.

For full coverage there would have to be an AVX10.256 and AVX10.512 path, although OTOH I can't remember if there's much difference between AVX10.512 and AVX512. Thanks intel, clearly this wasn't an obfuscating table flip move at all. At least the AVX10.256 path can probably be adapted relatively easily from the AVX512 or AVX10.512 paths, in the same way anything prior to and including AVX2 follow the same kind of patterns.

intel CPU's haven't had a very good half decade really. They're screwing up SIMD, heterogeneous CPU's are IMO a mistake at least the way they're doing it, the CPU's run at insane power draw OOTB which is starting to prematurely degrade the latest generations all to gain a few poxy points in benchmarks. Not exactly ideal.

Re: FLAC v1.4.x Performance Tests

Reply #485
Do you see any speed improvement using GCC 14 over GCC  13?

I haven't compared builds with 13 and 14 directly.  I can run some tests later if you have anything specific in mind.

I can also upload Linux binaries later.  I don't have GCC 14 to build Windows binaries.

Re: FLAC v1.4.x Performance Tests

Reply #486
No hurry. Msys2 will update soon enough.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #487
Msys2 just updated and using the same compiler settings the GCC 14.1.0 AVX2 version is slightly bigger and slower.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #488
I did some limited testing.  FLAC built with gcc 12 seems to beat gcc 13 & 14 builds for 16-bit audio, but FLAC built with gcc 14 does better with 24-bit audio.

Re: FLAC v1.4.x Performance Tests

Reply #489
I will have to play around more. If someone wants a 14.1.0 AVX2 version i attached one.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #490
FLAC built with GCC 12, 13 & 14.  Times are the average of 3 runs.

Wav:  16-bit/44.1KHz  2h41m 1.60GiB
Code: [Select]
                Generic Build  |  No ASM Optimzations
GCC-12 1 thread:    1m25.343s  |  1m9.162s
GCC-13 1 thread:    1m24.958s  |  1m9.218s
GCC-14 1 thread:    1m25.622s  |  1m9.629s

GCC-12 8 threads:   0m17.890s  |  0m14.816s
GCC-13 8 threads:   0m18.637s  |  0m15.656s
GCC-14 8 threads:   0m18.854s  |  0m15.820s

GCC-12 16 threads:  0m16.536s  |  0m13.916s
GCC-13 16 threads:  0m17.404s  |  0m14.802s
GCC-14 16 threads:  0m17.598s  |  0m14.985s

Wav: 24-bit/48 KHz  2h33m  2.47GiB
Code: [Select]
GCC-12 1 thread:    6m14.467s
GCC-13 1 thread:    6m13.127s
GCC-14 1 thread:    6m12.464s

GCC-12 8 threads:   1m17.415s
GCC-13 8 threads:   1m17.558s
GCC-14 8 threads:   1m17.088s

GCC-12 16 threads:  1m16.173s
GCC-13 16 threads:  1m15.234s
GCC-14 16 threads:  1m14.989s

Re: FLAC v1.4.x Performance Tests

Reply #491
Thanks for the infos. GCC 14 looses most here with 16bit audio and disabled asm optimizations.
24bit files encode way to slow with disabled asm optimizations with all versions imho.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #492
Thanks for the infos. GCC 14 looses most here with 16bit audio and disabled asm optimizations.
24bit files encode way to slow with disabled asm optimizations with all versions imho.

That's why I didn't bother with results for the 24-bit audio without asm optimizations.  The single threaded runs take about 3 minutes longer to encode.

Re: FLAC v1.4.x Performance Tests

Reply #493
ODD SMALL BLOCK SIZES (and a couple more normal-sized)

Because I didn't pay sufficient attention to my own testing, I ran something more - testing blocksizes that nobody wants to use. (Reason for that: make sure there is no partitioning for the residual, so there are no differences in that.)

Took the 219 minute CDDA .wav file and made 30 .flac files:
-0 and -5 at fifteen distinct block sizes, all odd numbers to be sure I didn't test different handling of partitioning.
Most are too low to be interesting! (Why? I'll explain at the end.)
17, 23, 31, 45, 65, 97, 147, 223, 341, 523, 803 <in between here is the -0 to -2 preset blocksize!> 1237, 1907, 2941 <in between here are the other presets> 4537 (ffmpeg uses the subset upper bound of 4608)

Those were encoded repeatedly using different FLAC releases (Xiph builds ... I hope I didn't get that wrong!). Timings for the -0b<N> and -5b<N>:

To the right end of both diagrams, you see that 1.4.3 has gotten rid of the slowdown around the 4096 blocksize, which is good because that is a default.



Then decoding. Note, here the flac versions are different! Because the results were a bit surprising, I dug up 1.3.1 too (a tinytiny bit slower than 1.3.4, I deleted the latter to get the number of curves down) and also 1.2.1. Omitted is also 1.4.2 which is as good as tied to 1.4.3. Kept in both the 32-bit and 64-bit version of 1.4.3 (it being the current release).
And here the block sizes are visible, I lost them from the encoding charts ...

1.3 is much faster at the small block sizes! At the very smallest, 17 samples, it decodes at at 22 seconds, 1.4 needs 40, 1.4 32-bit needs 60. That is a little bit of difference?!
The observation that 1.4.3 32-bit crosses 1.2.1 this way also indicates that something got lost since 1.3.

Note the difference between 64-bit builds is pretty much nothing at the reasonable block sizes. Don't overinterpret this
Imgur album: https://imgur.com/a/JDAccOD


* Comparison with ffmpeg - which spends bonkers amounts of time at small blocks - at https://hydrogenaud.io/index.php/topic,125791.msg1044102.html#msg1044102

* Why these block sizes? I tried the minimum block size over in that thread, and since ffmpeg did so horrible, I set out to check (1) where it would start to behave normal, and (2) what is then ... normal? Reference implementation for comparison. And then it turned out, differences between versions ... and a new run.


Corpus: I took twenty-one of the CDs in my signature (7 in each broad subdivision), the first ten and a half minute of each, merged to a 219 minute WAVE file.
Timing was done overnight with hyperfine, warmup + 5 runs with 30 seconds cooldown in between (ffmpeg last), and times are median of 5.



Regarding about small sizes and ffmpeg, i wrote explanation that you ignore all the time.
Please remove my account from this forum.

Re: FLAC v1.4.x Performance Tests

Reply #494
Yeah, that is in the other thread - you explained why it does worse, but that raises the question of when it starts doing horrible.
The good news is that it behaves sane at the reference encoder's defaults. Also in the other thread.

Re: FLAC v1.4.x Performance Tests

Reply #495
Binaries are generic x86_64 builds, built with GCC 12.3.0
Wav: 24-bit/48 KHz  2h33m  2.47GiB
CPU: Ryzen 5850U

flac git-04532802 (2024-05-02)
Code: [Select]
     1 thread    4 threads   8 threads
-5   0m14.505s   0m4.750s    0m4.733s
-5p  0m34.664s   0m10.664s   0m7.061s
-8   0m54.564s   0m16.920s   0m10.359s
-8p  6m13.161s   1m57.957s   1m14.785s

flac git-1ab3c8e7 (2024-05-15)
"Improve calculation of when to use wide residual computation. This change should make 24-bit encoding faster, because the limit_residual variant of residual computation is used less often"
Code: [Select]
     1 thread    4 threads   8 threads
-5   0m13.705s   0m4.586s    0m4.719s
-5p  0m24.876s   0m7.702s    0m5.618s
-8   0m42.542s   0m13.355s   0m8.253s
-8p  3m49.137s   1m14.035s   0m47.345s

Re: FLAC v1.4.x Performance Tests

Reply #496
Great that you're also seeing improvements. This is a change for which the improvement is highly dependent on the source material, so this is probably not going to show up for everyone. For your source material the improvement is more pronounced with presets 8 and 8p, with the tests I did it was specifically for 5 and 5p.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #497
I am still wondering GCC 14 being slower with all option versions on my AMD 5900x as the older GCC 13.
If someone wants the current git-cfe3afca in 14.1.0 for testing here it is as generic, AVX2 and with disabled asm + AVX2.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #498
Same setup as my previous post.

flac git-cfe3afca (2024-05-16)
"Further improve calculation of when to use wide residual computation"
Code: [Select]
     1 thread    4 threads   8 threads
-5   0m13.561s   0m4.579s    0m4.598s
-5p  0m23.162s   0m7.237s    0m5.396s
-8   0m37.276s   0m11.841s   0m7.287s
-8p  2m52.772s   0m57.365s   0m36.769s

Re: FLAC v1.4.x Performance Tests

Reply #499
These are almost impressive numbers :)
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!