FLAC v1.4.x Performance Tests

Topic: FLAC v1.4.x Performance Tests (Read 73270 times) previous topic - next topic

0 Members and 2 Guests are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #100 – 2022-10-06 19:50:38

In my case, single or multi-thread does not affect speed ranking. For example Case's GCC 7.3.0 Haswell compile is always the slowest in both single and multi-thread tests.

For RAM, I am using a budget motherboard which only supports DDR4, even though the CPU supports DDR5. DDR4 has been mainstream for more than 5 years. I am using 2x8GB DDR4 3200.

As for AVX, AVX2 and FMA3, the 2013 Intel Haswell (4th gen) already supports all of them, and I was using i3-4160 before February this year.

In fact, Intel 12th gen does not officially support AVX-512, but some of the older Core i does, even though flac 1.4.x does not seem to use AVX-512 at all.

I am using this RAM disk:
https://sourceforge.net/projects/imdisk-toolkit/

Re: FLAC v1.4.x Performance Tests

Reply #101 – 2022-10-06 20:20:05

Quote from: bennetng on 2022-10-06 19:50:38

I am using this RAM disk:
https://sourceforge.net/projects/imdisk-toolkit/

Since you use (and like) it, I'm gonna give it a try.
Does this RAM Disk hold your WAVs and FLACs during your performance tests?

Re: FLAC v1.4.x Performance Tests

Reply #102 – 2022-10-06 20:48:46

Quote from: sundance on 2022-10-06 20:20:05

Does this RAM Disk hold your WAVs and FLACs during your performance tests?

Yes, all files are in the RAM drive, but I don't use timer64, I use foobar's console for timing. To enforce a single encoder instance, either combine everything into a single file, or do this in foobar's converter dialog:
https://hydrogenaud.io/index.php/topic,123025.msg1016809.html#msg1016809

Also, if relevant, I always use FAT32 to format the RAM drive, as NTFS is a more complex file system and occupies more space when formatted. The limitation is FAT32 only allows up to 4GB for a single file. If you have 32GB it should be no issue to create at least a 24GB RAM drive, but a single file cannot exceed 4GB if formatted in FAT32.

Make sure "Create virtual disk in physical memory" is selected when creating the RAM disk.

Re: FLAC v1.4.x Performance Tests

Reply #103 – 2022-10-06 21:13:56

Tried in foobar2000 like you suggested (single thread, 40 WAVs):
-> Total encoding time: 0:39.531, 273.51x realtime (single thread)
-> Total encoding time: 0:06.688, 1616.65x realtime (allow multiple threads), around 6x faster (matches the 6 cores)
But the single thread encode is way slower compared to flac.exe started in a console window: 0:25.288

Btw. Tested the RAM disk (NTFS) and the encoding time improved by around 200 msec (0.8%) + you extend your SSDs lifetime...
Going to repeat the test with FAT32...

Re: FLAC v1.4.x Performance Tests

Reply #104 – 2022-10-06 21:36:37

My previous tests with CDDA including -7 and other settings:
https://hydrogenaud.io/index.php/topic,123025.msg1016652.html#msg1016652

The important thing is relative speed ranking, for example, is Case's GCC v7.3.0 Haswell compile still the fastest when the test method is changed?

Re: FLAC v1.4.x Performance Tests

Reply #105 – 2022-10-06 21:43:38

Quote from: bennetng on 2022-10-06 19:50:38

In fact, Intel 12th gen does not officially support AVX-512, but some of the older Core i does, even though flac 1.4.x does not seem to use AVX-512 at all.

Sundance's i7-8gen doesn't either, it seems. Your i3-12gen here, the same instruction set extensions are listed.

However the 12th generation boasts the fancy name of "Gaussian & Neural Accelerator" which, at the risk of just parroting marketing spin, "is an ultra-low power accelerator block designed to run audio and speed-centric AI workloads. Intel® GNA is designed to run audio based neural networks at ultra-low power, while simultaneously relieving the CPU of this workload."
Not sure if anything will utilize that?!

Re: FLAC v1.4.x Performance Tests

Reply #106 – 2022-10-06 21:49:26

I disabled GNA in BIOS in all tests.

Re: FLAC v1.4.x Performance Tests

Reply #107 – 2022-10-07 09:25:44

Seems I'm quite limited here with my hp BIOS @ Elitedesk 800 G4.
There are no such settings like to disable some of the extended CPU features, I only can toggle "Multithreading" and "VTx"...

Re: FLAC v1.4.x Performance Tests

Reply #108 – 2022-10-07 09:30:46

Quote from: ktf on 2022-10-05 18:05:16

Quote from: Vladeimir on 2022-10-05 17:44:10
The FMA intrinsics are compiled with "-ffast-math".
[...]
I am not sure why the SSE and AVX ones are not.
Because the SSE and AVX code is with intrinsics, but the FMA is plain C targeted at FMA. For SSE and AVX instructions need not to be reordered, but with FMA there is this need, so -fassociative-math is needed, which is part of -ffast-math

Does it mean using -Ofast globally can affect something completely irrelevant like progress indicator and such? Are there inline codes to prevent such kinds of global optimizations in certain parts of the codes?

My experience in vectorization is rather limited in GPU shaders and game engines, without touching low level stuff like intrinsics.

Re: FLAC v1.4.x Performance Tests

Reply #109 – 2022-10-07 15:14:34

Quote from: sundance on 2022-10-06 21:13:56

Btw. Tested the RAM disk (NTFS) and the encoding time improved by around 200 msec (0.8%) + you extend your SSDs lifetime...
Going to repeat the test with FAT32...

I use the softperfect RAMdisk and exFAT is clearly the fastest with it but has a very big overhead due to its 64k cluster size. It shouldn't matter until you use lots of small files on it.
Until lately Windows had a uppercase renaming bug together with exFAT. That was fixed lately.

Re: FLAC v1.4.x Performance Tests

Reply #110 – 2022-10-07 16:10:30

Quote from: sundance on 2022-10-07 09:25:44

Seems I'm quite limited here with my hp BIOS @ Elitedesk 800 G4.
There are no such settings like to disable some of the extended CPU features, I only can toggle "Multithreading" and "VTx"...

Motherboards being sold separately like the ones from Asus, Gigabyte, MSI and such usually offer more options.

Re: FLAC v1.4.x Performance Tests

Reply #111 – 2022-10-07 17:51:16

Quote from: Porcus on 2022-09-29 10:00:36

Size seconds saved per second setting
11969604531 833 -8
11968502388 940 10300 -8 -A "tukey(666e-3);subdivide_tukey(3/333e-3)"
11967556575 1155 4399 -8 -A "subdivide_tukey(4)"
11966463371 1555 2733 -8 -A "subdivide_tukey(5)"
11961291433 3003 3572 -8p (note jump in time when using -p)
11960179719 3350 3204 -8p -A "tukey(666e-3);subdivide_tukey(3/333e-3)"
11959250164 5131 522 -8p -A "subdivide_tukey(4)"
11958125424 7796 422 -8p -A "subdivide_tukey(5)"

How about this setting on your corpus?
-8 -A "tukey(75e-2);subdivide_tukey(3/25e-2)"
Of course I asked this because it works better on my corpus (about one day of duration), and I adjusted the corpus weighting so that the compression ratio is roughly 55%. You can also try other values which don't require rounding, for example 666e-3 may mean something like 0.66600000858306884765625 in single float.

Re: FLAC v1.4.x Performance Tests

Reply #112 – 2022-10-07 19:15:32

Quote from: Porcus on 2022-10-06 21:43:38

GNA [...] Not sure if anything will utilize that?!

It's like a GPU but much smaller. FLAC won't use it.

Quote from: bennetng on 2022-10-07 09:30:46

Does it mean using -Ofast globally can affect something completely irrelevant like progress indicator and such?

Potentially yes, but in practice most floating-point code doesn't rely on the compiler precisely following the floating-point standards. If the progress indicator is affected, you probably wouldn't be able to see what's different.

The real problem with -Ofast is that it can insert code that switches the CPU into a faster but not standards-compliant mode, and this can affect any program that loads a library compiled with -Ofast.

Re: FLAC v1.4.x Performance Tests

Reply #113 – 2022-10-07 20:13:51

I somehow understand why this Audition bug happened and what the OptimFROG author wanted to correct:
https://hydrogenaud.io/index.php/topic,114816.msg1009053.html#msg1009053
My program will definitely fail if done in the -Ofast way.

Re: FLAC v1.4.x Performance Tests

Reply #114 – 2022-10-08 10:01:44

Quote from: bennetng on 2022-10-07 17:51:16

How about this setting on your corpus?
-8 -A "tukey(75e-2);subdivide_tukey(3/25e-2)"
Of course I asked this because it works better on my corpus (about one day of duration),

Improves - because of the 25e-2. The difference between 75e-2 and 666e-3 in the single tukey is ambiguous over genre (the latter is better in the classical section, the former in the "other"), but the overall impact is less than a part per million.
Tested same with "-p" added.

But lowering the subdivide_tukey tapering parameter helps and I think it should be even lower. I tested and found -A subdivide_tukey(24e-2) to be a good one without the additional -A tukey, but preliminary testing indicates that 25e-2 is "too high" in the presence of that.

The 666 & 333 were not "optimal" choices - they were picked more out of the idea that if I wanted to deviate from 1/2 and 1/2 parameters, then "2/3 and 1/3" would be the next idea. I surely tested both 666&333 and 333&666, but I didn't do any exhaustive testing. So why then state with this three-decimal "accuracy"? Hey, 3/333e-3 is easy to remember. (And then the metal swine selected 666 over 667 for kinda the same reason.)

Quote from: bennetng on 2022-10-07 17:51:16

You can also try other values which don't require rounding, for example 666e-3 may mean something like 0.66600000858306884765625 in single float.

The predictor is rounded off to integer, so decimals beyond some kth will in the very least not matter very often. Quick testing on 11 CD images, starting from your 0.75 & 0.25, I got bit-identical files if I tweaked the fifth decimal, but the fourth would matter. I mean, not "matter" much, but yield different files.

Re: FLAC v1.4.x Performance Tests

Reply #115 – 2022-10-09 11:49:11

Tested: To "-8" and above, added a tukey to a subdivide_tukey, various taperings tested. Do these choices make (much) different impact across genres? (No!)
-8p -A "tukey(Q);subdivide_tukey(N/P)" for N=3, 4, 5 and various P and Q.
Also without "-p".

Of course it doesn't matter much! On one hand, you can shrug it off as nothing by saying that for N=3, the extra tukey - with "optimal" parameters - saves 0.01 percent over standard -8p, and good/bad parameters make for only half of this. Nothing to care about? On the other hand, it is only slightly less than going up to N=4, and slightly more than going from N=4 to N=5. Each of those cost much more time.
So if standard -8p is not enough for you - well for the sports of it I guess - and you are ready to type in some -A manually, you might as well consider this. Same if you want to go up from -8 but without all the way to -8p; then you can just remove the "p" from the below, your material is likely to make more difference than that.

tl;dr: if adding an additional tukey to get ~half the benefit of higher subdivide_tukey at a fraction of the extra time, make its tapering parameter bigger than default (well maybe default if you are at very high compression) and the subdivide_tukey taper parameter very small.
If you like to think in 1/16ths terms: after a bit tweaking, you could try something like 11, 10, 9 or 10, 9, 8 combined with a 1/8 as follows:
N=3: -8p -A "tukey(6875e-4);subdivide_tukey(3/125e-3)" <---- 11/16ths & 1/8th, or reduce the first to 10/16ths for classical music
N=4: -8p -A "tukey(6250e-4);subdivide_tukey(4/125e-3)" <---- 10/16ths & 1/8th, or reduce the first to 9/16ths for classical music. Yes keep the 1/8th.
N=5: -8p -A "tukey(5625e-4);subdivide_tukey(5/125e-3)" <---- 9/16ths & 1/8th, or reduce the first to 5e-1 for classical music. Again keep the 1/8th.

But the genre differences between classical, heavier/metal and "other" didn't cause much drama - not even "relatively" to the very small impact of it all. That is kinda reassuring; even if classical music could use N/2e-1, it gained virtually nothing going down to N/<one eighth>.

So just to explain what I did here:

Quote from: Porcus on 2022-10-06 11:22:46

Also checked (this preliminary): as in Reply 48, combining with a bigger single tukey.
Hypothesis: because single tukey has always had the default parameter 0.5 - this after quite a bit of testing back in the day - there is no good reason that this small tapering should be good for a single tukey run, ==> reason why it works is for the subdivisions ==> if you want to improve, try one with a bigger taper parameter like -5 uses.
This to be tested with -p

I first made the "arbitrary" selection (files with "j" in the name) and then ran the test on the remainder, distinguishing between the classical music, the heavy rock/metal and the "other".
The P and Q are "7e-2", "14e-2" etc., i.e. 0.07 apart, though only the "most reasonable" ones tested on the big corpus. Then tweaked the parameters slightly from the "best", if only to see if small tweaks led to unexpectedly big changes. (They did not.)

Results: Well not unexpected given Reply 48: Make the Q and P tapering parameters quite far from each other as tukey(<big P>);subdivide_tukey(N/<small Q>). The "big" does not mean close to 1, though.
Genre differences: Nothing dramatic - nothing "relatively dramatic" relative to the .01 percent impact either. Sure there is a clear pattern in that the heavier music wants smaller Q, down below 0.1, and also slightly bigger P, but not much - and the classical music calls for slightly lower P. But the "overall" minimum is not far (in kilobytes) from each genre's minimum.

So the first runs ended up with
N=3: -8p -A tukey(70e-2);subdivide_tukey(3/14e-2)
N=4: -8p -A tukey(56e-2);subdivide_tukey(4/14e-2)
N=5: -8p -A tukey(49e-2);subdivide_tukey(5/14e-2)
Tweaking it and looking at genre differences, I ended up with something like up there with the tl;dr. It was the classical music section that made the "56" and "49" win, and it is the heavier section that pulls the other direction. The 14 was a bit too high except for classical music where it mattered very very little, like a few kb on 4 giga.

Re: FLAC v1.4.x Performance Tests

Reply #116 – 2022-10-09 16:22:48

With only -8 subdivide_tukey(3/x) I got these figures, from best to worst compression:

Difficult content (~70.64% compression ratio)

3/1875e-4
2531723115 bytes

3/2e-1
2531723128 bytes

3/22e-2
2531723205 bytes

3/25e-2
2531724460 bytes

3/125e-3
2531724619 bytes

-8
2531763292 bytes

Difficult contents are the usual electronic music in my collection, and some loudness war songs. Simple contents include speech, classical, ethnic and songs with simple accompaniment.

Simple content (~43.13% compression ratio)

3/25e-2
1799030423 bytes

3/2e-1
1799035729 bytes

3/22e-2
1799046777 bytes

3/1875e-4
1799054794 bytes

3/125e-3
1799080973 bytes

-8
1799116764 bytes

Re: FLAC v1.4.x Performance Tests

Reply #117 – 2022-10-09 18:45:33

Quote from: bennetng on 2022-10-09 16:22:48

With only -8 subdivide_tukey(3/x) I got these figures, from best to worst compression:

Here is where I actually got a weirdness: .21 was worse than both .20 and .22. And the effect was not due to one genre. Tested a few more because in Reply #93 I found .32 to be better than .16 over all three, so the below results point at a parameter slightly less than expected.

Anyway, disregarding .21 and doing (nearly) only your parameters, results are not outrageously far from yours, but slightly different - I suspect your speech content makes some impact?
2e-1 was the best for both classical music and the "other" section. .22 was better than .1875 in these two genre sections. .2 also won the overall.

For my heavier material, go lower: 3/125e-3 is better than .1875 better than .2 better than .22 better than .25
Also checked 1e-1, which narrowly lost to 125e-3.

Impact of choosing "wrong": With your "simple" content, even the difference between the two best was like 3 parts per million. For my classical music, everything from .1875 and up would be within that interval, and same for the "other" genre.
But for your "difficult" material, everything from 125e-3 and up fell within one ppm, and my material needed 4ppm.
Not much still.

The low tapering parameter I found in Reply 115 just underlines that with an additional tukey, you want the two tukeys to be different.

Re: FLAC v1.4.x Performance Tests

Reply #118 – 2022-10-10 04:11:56

Guess this is my last try.
I am using ./configure options also now from Case's suggestion, -Ofast and -fipa-pta suggested elsewhere. -fipa-pta optimizes a tiny bit and saves some kb from the binaries by only the cost of compiling time.

Re: FLAC v1.4.x Performance Tests

Reply #119 – 2022-10-10 08:22:47

On my set of test files, your latest (really hopefully not last) build is right between Case's gcc v12.2 and gcc v7.3 builds:

Code: [Select]

FLAC Binary: flac141-case-haswell.exe (860160 bytes) = gcc v7.3
FLAC Option: -7
 Average time =  25.268 seconds (3 rounds), Encoding speed = 427.89x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-wombat2.exe (784384 bytes)
FLAC Option: -7
 Average time =  25.710 seconds (3 rounds), Encoding speed = 420.54x
 FLAC size = 1.167.014.381 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-case-gcc12.exe (781312 bytes)
FLAC Option: -7
 Average time =  26.100 seconds (3 rounds), Encoding speed = 414.26x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

And, fwiw, I was able to get some speed gain compared to plain -7 (on my test set [classic rock music]) @ almost no cost with smaller block size:

Code: [Select]

FLAC Binary: flac141-case-haswell.exe (860160 bytes)
FLAC Option: -7 -b3584
 Average time =  23.949 seconds (3 rounds), Encoding speed = 451.46x	<= faster encoding (428x -> 451x) [ comparted to -7]
 FLAC size = 1.167.032.442 bytes (= 61,189% of WAV size, ~863 kbps)		<= min. worse compression: 0.001 percent points

Re: FLAC v1.4.x Performance Tests

Reply #120 – 2022-10-10 17:07:28

Added more electronic and loudness war contents, hand-picked to only include the highest bitrate files, but does not contain noise music. Around 74.5% compression ratio.

1.3.1 (Xiph)

-8 -b2304
3200412387 bytes

-8
3202131236 bytes

1.3.2 (Xiph)

-8 -b2304
3200203911 bytes

-8
3201989505 bytes

1.4.1 (Case GCC 12.2.0)

-8 -b2304
3199429338 bytes

-8
3201122995 bytes

-8 -A "tukey(5e-1);partial_tukey(2);punchout_tukey(3)"
3201407279 bytes

Re: FLAC v1.4.x Performance Tests

Reply #121 – 2022-10-10 18:22:45

Yikes, I suck at PowerShell ...

Can anyone hack together for me a script that does the following:

FOR every *.flac IN (D:\given path pattern...\*.flac) DO flac <parameters> with output <same filename except that in E: rather than D>
and measures total CPU time and total time including I/O?

Point being: how much "compression effort" is "free in time" because it compresses while busy writing?

Re: FLAC v1.4.x Performance Tests

Reply #122 – 2022-10-11 17:50:16

lol. Why look so far.

When searching the web for compiler options guess where it leads to?
far far away

Re: FLAC v1.4.x Performance Tests

Reply #123 – 2022-10-11 20:18:51

Quote from: Wombat on 2022-10-11 17:50:16

lol. Why look so far. When searching the web for compiler options guess where it leads to?
far far away

"This is likely to be my last build"

Cue 2022:

Quote from: Wombat on 2022-10-10 04:11:56

Guess this is my last try.

Porcus quoting self:

Quote from: Porcus

rehab is for quitters

Re: FLAC v1.4.x Performance Tests

Reply #124 – 2022-10-11 21:04:20

You got me

but somehow it makes to much fun

Guess i have to try some more and maybe a 'skylake' version for sundance to test when i am at my PC later.

Size	seconds	saved per second	setting
11969604531	833		-8
11968502388	940	10300	-8 -A "tukey(666e-3);subdivide_tukey(3/333e-3)"
11967556575	1155	4399	-8 -A "subdivide_tukey(4)"
11966463371	1555	2733	-8 -A "subdivide_tukey(5)"
11961291433	3003	3572	-8p (note jump in time when using -p)
11960179719	3350	3204	-8p -A "tukey(666e-3);subdivide_tukey(3/333e-3)"
11959250164	5131	522	-8p -A "subdivide_tukey(4)"
11958125424	7796	422	-8p -A "subdivide_tukey(5)"

Notice