FLAC v1.4.x Performance Tests

Topic: FLAC v1.4.x Performance Tests (Read 79859 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #125 – 2022-10-11 21:09:14

I'm planning something to keep y'all busy: https://github.com/xiph/flac/pull/476

This might make compiling with -march=native much more rewarding when combined with --disable-asm-optimizations. I've changed to code in such a way that it is much easier to vectorize by a compiler. Currently the intrinsics routines cannot really be tuned by a compiler, but with this change a compiler can use the C code to get an even better result.

I've seen improvements of over 10% with preset 8, when run with -march=native. One could then use AVX512 for example. I don't have access to hardware with AVX512, so I can't say whether that would make sense.

Also, as a bug was found in libFLAC that affects playback with gstreamer, I won't wait long with releasing.

Re: FLAC v1.4.x Performance Tests

Reply #126 – 2022-10-11 22:30:56

Quote from: ktf on 2022-10-11 21:09:14

much easier to vectorize by a compiler.

Speaking of which, GCC 12 vectorizes even at -O2:

Quote from: https://gcc.gnu.org/gcc-12/changes.html

Vectorization is enabled at -O2 which is now equivalent to the original -O2 -ftree-vectorize -fvect-cost-model=very-cheap.

Not sure how much of an impact this particular change has here given that pretty much everyone just builds FLAC with -O3, but it's interesting nonetheless.

Re: FLAC v1.4.x Performance Tests

Reply #127 – 2022-10-12 01:47:09

I took the flags GrieverV uses back with the 1.3.3 version. I left out the math flags because they are part of -Ofast already.
The compile options as single steps are hard to measure here but using them all together creates clearly smaller binaries with a small speed advantage.
A -8p single file encode is now at ~110x vs ~108x or ~1070x vs ~1058x for multiple files in foobar.
I have atached a gcc skylake tuned version also. No difference for me on the 5900x against the haswell tuning but others may test.
And now lets see what ktf's git offers

Re: FLAC v1.4.x Performance Tests

Reply #128 – 2022-10-12 02:53:43

Quote from: ktf on 2022-10-11 21:09:14

I'm planning something to keep y'all busy: https://github.com/xiph/flac/pull/476

I compiled reference libFLAC git-3d55a9dc 20221009 in 3 ways but left all additional flags in, sorry.
Nonetheless strange results.
I added --disable-asm-optimizations to ../configure
Compare the numbers to my posts above.

mtune=native is really slow!
408x
40x
since i have a Zen 3 5900x i tried mtune=znver3 and it crawls exactly as slow.

Finaly a mtune=haswell and numbers are almost normal but not fast.
981.51x
104x

It may be it collides with the additional flags but i wonder why mtune=haswell works.

Edit: the same slowness for mtune=native without fancy additional flags

Re: FLAC v1.4.x Performance Tests

Reply #129 – 2022-10-12 06:36:56

Quote from: Wombat on 2022-10-12 02:53:43

Quote from: ktf on 2022-10-11 21:09:14
I'm planning something to keep y'all busy: https://github.com/xiph/flac/pull/476
I compiled reference libFLAC git-3d55a9dc 20221009 in 3 ways but left all additional flags in, sorry.

You've compiled the wrong branch. The branch you're compiling is one without the mentioned optimizations. Checkout branch libFLAC-fast-math

Re: FLAC v1.4.x Performance Tests

Reply #130 – 2022-10-12 07:41:07

@Wombat : Tested your builds here:

Code: [Select]

Reference:
FLAC Binary: flac141-case-haswell.exe (860160 bytes)
FLAC Option: -7
 Average time =  25.384 seconds (5 rounds), Encoding speed = 425.94x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-wombat-manyflags.exe (718848 bytes)
FLAC Option: -7
 Average time =  25.283 seconds (5 rounds), Encoding speed = 427.65x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-wombat-manyflags-skylake.exe (712192 bytes)
FLAC Option: -7
 Average time =  26.346 seconds (5 rounds), Encoding speed = 410.39x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

So your "manyflags" build with GrieverV settings is a little faster than Case's Haswell build here. But, oddly enough, your Skylake build is slower although my 8th gen i7 is a family member...

Re: FLAC v1.4.x Performance Tests

Reply #131 – 2022-10-12 11:15:14

Quote from: ktf on 2022-10-11 21:09:14

I'm planning something to keep y'all busy: https://github.com/xiph/flac/pull/476

This might make compiling with -march=native much more rewarding when combined with --disable-asm-optimizations. I've changed to code in such a way that it is much easier to vectorize by a compiler. Currently the intrinsics routines cannot really be tuned by a compiler, but with this change a compiler can use the C code to get an even better result.

I've seen improvements of over 10% with preset 8, when run with -march=native. One could then use AVX512 for example. I don't have access to hardware with AVX512, so I can't say whether that would make sense.

Also, as a bug was found in libFLAC that affects playback with gstreamer, I won't wait long with releasing.

Intel's way to deal with AVX-512 in 12th gen Core i is completely unfair to the the non-K i5 and i3 as they don't use E-cores so there should be no compatibility issue with AVX-512. @Porcus ' CPU should support AVX-512?

Quote from: Porcus on 2022-09-15 15:59:57

On this CPU, an 11th generation i7 mobile

Also thanks for looking into the -ffast-math issue.

Re: FLAC v1.4.x Performance Tests

Reply #132 – 2022-10-12 11:57:22

Quote from: bennetng on 2022-10-12 11:15:14

...
Intel's way to deal with AVX-512 in 12th gen Core i is completely unfair to the the non-K i5 and i3 as they don't use E-cores so there should be no compatibility issue with AVX-512. @Porcus ' CPU should support AVX-512?

Earlier 12th gen didn't have AVX512 fused off so some motherboard+bios combinations allowed you to enable AVX512 if you disabled E-cores, it was disabled across the board so they didn't have to validate and because otherwise they'd have a situation where cheaper models would perform better than expensive models in some situations which would not be a good look, marketing nonsense. AFAIK newer 12th gen runs have unfortunately disabled AVX512 properly.

Muddying the waters a bit more is that Zen 4's AVX512 implementation differs in some key areas (some better some worse, some instruction-chaining performs well/poorly on one arch but not the other, etc), adds to the benchmarking fun: https://mersenneforum.org/showthread.php?t=28102

Re: FLAC v1.4.x Performance Tests

Reply #133 – 2022-10-12 12:17:00

Heck, this sounds like "fun" ...

Question:
As go differences between compiles: if build X is faster than build Y, is that
* due to "fewer instructions" executed (--> less heat generated)
or
* due to "instructions queued more efficiently" and some CPU-internal parallelization (--> same heat generated in shorter time)

- or a combination of both?

On a cooling-constrained setup (laptop!) that makes differences - which depend critically on how much you actually are FLACing at one run:
For someone who acquires a lossless album, does the tagging, and then (re-)compresses it to get everything from a tiny improvement to a large depending on the source file - that is when you will actually watch the thing run to the end, right? - then one might be pretty much done with the album before the CPU needs to wipe sweat? Long-term energy usage that would need to be dissipated during an overnight job is simply not the yardstick then.

Re: FLAC v1.4.x Performance Tests

Reply #134 – 2022-10-12 14:49:31

Quote from: sundance on 2022-10-12 07:41:07

So your "manyflags" build with GrieverV settings is a little faster than Case's Haswell build here. But, oddly enough, your Skylake build is slower although my 8th gen i7 is a family member...

Nice. It may be GCC 12.2.0 does things differently with skylake as older versions and even when your 8700 is a coffee lake it does better with the haswell optimizations.

Quote from: bennetng on 2022-10-12 11:15:14

Also thanks for looking into the -ffast-math issue.

What exactly was this math issue?

Re: FLAC v1.4.x Performance Tests

Reply #135 – 2022-10-12 15:58:24

Quote from: ktf on 2022-10-12 06:36:56

You've compiled the wrong branch. The branch you're compiling is one without the mentioned optimizations. Checkout branch libFLAC-fast-math

WOW! Great job!
This time really the fast-math files

Compiled with
haswell
~1290x
~135x

and

native
~1280x
~136x

Most likely only measuring tolerance . It identifies as reference libFLAC 1.4.1 20220922. Is it ok to offer it here?

Re: FLAC v1.4.x Performance Tests

Reply #136 – 2022-10-12 16:12:37

Quote from: Wombat on 2022-10-12 14:49:31

What exactly was this math issue?

As mentioned by ktf:
https://github.com/xiph/flac/pull/476
There are a lot of online resources on this denormal topic, for example in this interactive demo:
https://www.h-schmidt.net/FloatConverter/IEEE754.html
You can toggle the checkboxes to see the numeric representations. Specifically, when all "Exponent" checkboxes are empty, the represented values are called denormals (or subnormals). One of what -ffast-math does is setting denormals to zero. Depends on the programmer's intent it may break some codes as the values are no longer the intented ones.

Even if this version of flac is safe to do math in this way, the mentioned issue is the -ffast-math logic could affect other codes which are unrelated to flac, and those codes may require proper denormal support.

A separate process (e.g. foobar2000 loading flac.exe) should be safe, as the exe is being loaded as a separate process.

Re: FLAC v1.4.x Performance Tests

Reply #137 – 2022-10-12 16:29:12

Thanky! Didn't have a problem yet with my frontends but ktf's effort is surely most welcome.

Re: FLAC v1.4.x Performance Tests

Reply #138 – 2022-10-12 19:14:58

Quote from: Porcus on 2022-10-12 12:17:00

As go differences between compiles: if build X is faster than build Y, is that
* due to "fewer instructions" executed (--> less heat generated)
or
* due to "instructions queued more efficiently" and some CPU-internal parallelization (--> same heat generated in shorter time)

- or a combination of both?

It is a combination. What instructions execute more efficiently varies highly between CPUs. In fact, the resource you linked on AVX512 in Zen 4 lists quite a few such issues. Certain instructions are executed directly on a specific part of the CPU, while others need to be decoded into several instructions. On another CPU, other instructions might have dedicated silicon. This dedicated silicon might be more power hungry, like AVX512.

Quote from: Wombat on 2022-10-12 15:58:24

WOW! Great job!
This time really the fast-math files

If I read this correctly, you're seeing a 20% speedup, right?

Quote

Most likely only measuring tolerance . It identifies as reference libFLAC 1.4.1 20220922. Is it ok to offer it here?

Yes, sure. You probably downloaded a tarball instead of checking out git. It can only generate the proper version string when checked out with git. No worries though, this is very close to libFLAC 1.4.1.

Re: FLAC v1.4.x Performance Tests

Reply #139 – 2022-10-12 19:50:08

Indeed ~20%! The several additional flags optimize C code further and they seem to work well here.

I used https://github.com/ktmf01/flac.git so it gave me the wrong files but the zip from fast-math downloaded manualy worked.
Attached the version i tested above.

Re: FLAC v1.4.x Performance Tests

Reply #140 – 2022-10-12 20:18:19

Quote from: ktf on 2022-10-12 06:36:56

You've compiled the wrong branch. The branch you're compiling is one without the mentioned optimizations. Checkout branch libFLAC-fast-math

I compiled this (flac git-cb822660 20221012) on Linux using -march=znver3 -Ofast. I get the same performance with or without asm optimizations.

Re: FLAC v1.4.x Performance Tests

Reply #141 – 2022-10-12 20:33:35

Tested ktf's fastmath build:
(sorry, test results were corrupted, will re-test asap)

Re: FLAC v1.4.x Performance Tests

Reply #142 – 2022-10-12 20:49:34

Quote from: bennetng on 2022-10-10 17:07:28

Added more electronic and loudness war contents, hand-picked to only include the highest bitrate files, but does not contain noise music. Around 74.5% compression ratio.

1.3.1 (Xiph)

-8 -b2304
3200412387 bytes

-8
3202131236 bytes

1.3.2 (Xiph)

-8 -b2304
3200203911 bytes

-8
3201989505 bytes

1.4.1 (Case GCC 12.2.0)

-8 -b2304
3199429338 bytes

-8
3201122995 bytes

-8 -A "tukey(5e-1);partial_tukey(2);punchout_tukey(3)"
3201407279 bytes

Yes, somewhat bigger file size, see the quoted data for comparison.
-8 -b2304
The two speeds are single and multi-thread results.

Case GCC 12.2.0
Total encoding time: 1:39.094, 245.88x realtime
Total encoding time: 0:29.609, 822.91x realtime
3199429338 bytes

ktf-fast-math-noasm-manyflags-haswell
Total encoding time: 1:23.563, 291.58x realtime
Total encoding time: 0:25.078, 971.59x realtime
3200178267 bytes

[EDIT] Added -8p -b2304 tests, only multi-thread:

ktf-fast-math-noasm-manyflags-haswell
Total encoding time: 1:21.890, 297.54x realtime
3196718833 bytes

Case GCC 12.2.0
Total encoding time: 1:30.437, 269.42x realtime
3196159402 bytes

Re: FLAC v1.4.x Performance Tests

Reply #143 – 2022-10-12 21:33:50

... now the corrected results for ktf's fastmath build:
(somehow an orphaned flac file wasn't deleted before starting the test and was accounted in the total FLAC size)

Code: [Select]

FLAC Binary: flac141-case-haswell.exe (860160 bytes) = Reference
FLAC Option: -7
 Average time =  25.392 seconds (3 rounds), Encoding speed = 425.80x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-ktf-fastmath.exe (665600 bytes)
FLAC Option: -7
 Average time =  21.760 seconds (5 rounds), Encoding speed = 496.87x	  <= faster encoding (429x -> 497x)
 FLAC size = 1.167.045.858 bytes (= 61,189% of WAV size, ~863 kbps)		<= on-par compression: -0.001 percent points

Re: FLAC v1.4.x Performance Tests

Reply #144 – 2022-10-12 21:36:26

@bennetng : 1/40th of a percent bigger files. Actually, if you want that much compression improvement by tweaking parameters, you will likely have to pay more than those nineteen percent time penalty? Going to the Case compile seems to be the cheapest bytes saved?
That is about tenfold the savings in @sundance 's test run?

How does it fare with -8p [and your fave -b]? Asking because "p" brute-forces "a certain task", so there is something it does particularly much of.
(That can be said about -8e as well, and -8 --lax -r 12 also? In the latter case, no other -b please. Not saying they are useful for anything but testing.)

Re: FLAC v1.4.x Performance Tests

Reply #145 – 2022-10-12 21:58:49

Quote from: Porcus on 2022-10-12 21:36:26

@bennetng : 1/40th of a percent bigger files. Actually, if you want that much compression improvement by tweaking parameters, you will likely have to pay more than those nineteen percent time penalty? Going to the Case compile seems to be the cheapest bytes saved?
That is about tenfold the savings in @sundance 's test run?

How does it fare with -8p [and your fave -b]? Asking because "p" brute-forces "a certain task", so there is something it does particularly much of.
(That can be said about -8e as well, and -8 --lax -r 12 also? In the latter case, no other -b please. Not saying they are useful for anything but testing.)

As mentioned in the quoted box of my previous test, the corpus used was heavily biased to the very high bitrate files (~74.5% compression ratio). I just conveniently reused this corpus because it is still in my foobar playlist, so it can be considered as a special case, and -b2304 is suitable for this brutal set of files.

I think the significance of ktf's latest tweak is it offers obvious speed boost for different types of CPUs, and makes -8 much cheaper. -8 is an important preset that many people actually use.

Re: FLAC v1.4.x Performance Tests

Reply #146 – 2022-10-13 02:39:17

I tested to compilpe without any flags but -Ofast -m64 -march=haswell to check if the inreased size of resulting files is due to the gcc optimizations. The resulting files are identical so it must be new flac code itself.
My single wav testfile is 2.729.717.132 Bytes consisting of several cd images of different genres.
It compresses to 1.526.366.181 Bytes and 1.526.597.886 Bytes so a 0,015% file increase.

I was also asked for the additional flags. It is no secret and i copied them more or less from Case and GrieverV.
With the new fast-math code Everything together after fno-stack-protector makes almost no difference. Less or even at all against the older flac code.

-Ofast -m64 -march=haswell -fipa-pta -funroll-loops -fno-stack-protector -fno-common -fno-plt -fno-semantic-interposition -falign-functions=32 -fdevirtualize-at-ltrans -fgraphite-identity -floop-nest-optimize -flto -ffat-lto-objects -pipe

Re: FLAC v1.4.x Performance Tests

Reply #147 – 2022-10-13 05:13:01

Ops. Above post misses that these 0,015% file increase is for -8p. Must be i forgot to mention it because i only use -8p for every test here.

Re: FLAC v1.4.x Performance Tests

Reply #148 – 2022-10-13 08:14:51

Another corpus with a more typical compression ratio. Faster overall speed than the previous corpus with an extreme ratio.

PCM (17 files)
4223331916 bytes

-8

ktf-fast-math-noasm-manyflags-haswell
multi: 0:17.453, 1371.78x realtime
single: 1:17.250, 309.92x realtime
2511293691 bytes
59.462%

Case GCC 12.2.0
multi: 0:20.859, 1147.79x realtime
single: 1:33.891, 254.99x realtime
2510290651 bytes
59.439%

-8 -A subdivide_tukey(3/2e-1)

ktf-fast-math-noasm-manyflags-haswell
multi: 0:17.422, 1374.22x realtime
single: 1:17.203, 310.11x realtime
2511287164 bytes
59.462%

Case GCC 12.2.0
multi: 0:20.984, 1140.95x realtime
single: 1:34.297, 253.89x realtime
2510265971 bytes
59.438%

-8p

ktf-fast-math-noasm-manyflags-haswell
multi: 0:53.156, 450.40x realtime
single: 3:44.875, 106.46x realtime
2509518160 bytes
59.420%

Case GCC 12.2.0
multi: 1:01.547, 389.00x realtime
single: 4:17.797, 92.87x realtime
2508762363 bytes
59.402%

Re: FLAC v1.4.x Performance Tests

Reply #149 – 2022-10-13 08:53:06

Quote from: Wombat on 2022-10-13 05:13:01

Ops. Above post misses that these 0,015% file increase is for -8p.

Are we killing the double precision here?!?

Notice