Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: FLAC v1.4.x Performance Tests (Read 73270 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #100
In my case, single or multi-thread does not affect speed ranking. For example Case's GCC 7.3.0 Haswell compile is always the slowest in both single and multi-thread tests.

For RAM, I am using a budget motherboard which only supports DDR4, even though the CPU supports DDR5. DDR4 has been mainstream for more than 5 years. I am using 2x8GB DDR4 3200.

As for AVX, AVX2 and FMA3, the 2013 Intel Haswell (4th gen) already supports all of them, and I was using i3-4160 before February this year.

In fact, Intel 12th gen does not officially support AVX-512, but some of the older Core i does, even though flac 1.4.x does not seem to use AVX-512 at all.

I am using this RAM disk:
https://sourceforge.net/projects/imdisk-toolkit/


Re: FLAC v1.4.x Performance Tests

Reply #102
Does this RAM Disk hold your WAVs and FLACs during your performance tests?
Yes, all files are in the RAM drive, but I don't use timer64, I use foobar's console for timing. To enforce a single encoder instance, either combine everything into a single file, or do this in foobar's converter dialog:
https://hydrogenaud.io/index.php/topic,123025.msg1016809.html#msg1016809

Also, if relevant, I always use FAT32 to format the RAM drive, as NTFS is a more complex file system and occupies more space when formatted. The limitation is FAT32 only allows up to 4GB for a single file. If you have 32GB it should be no issue to create at least a 24GB RAM drive, but a single file cannot exceed 4GB if formatted in FAT32.

Make sure "Create virtual disk in physical memory" is selected when creating the RAM disk.

Re: FLAC v1.4.x Performance Tests

Reply #103
Tried in foobar2000 like you suggested (single thread, 40 WAVs):
-> Total encoding time: 0:39.531, 273.51x realtime (single thread)
-> Total encoding time: 0:06.688, 1616.65x realtime (allow multiple threads), around 6x faster (matches the 6 cores)
But the single thread encode is way slower compared to flac.exe started in a console window: 0:25.288

Btw. Tested the RAM disk (NTFS) and the encoding time improved by around 200 msec (0.8%) + you extend your SSDs lifetime...
Going to repeat the test with FAT32...


Re: FLAC v1.4.x Performance Tests

Reply #105
In fact, Intel 12th gen does not officially support AVX-512, but some of the older Core i does, even though flac 1.4.x does not seem to use AVX-512 at all.

Sundance's i7-8gen doesn't either, it seems. Your i3-12gen here, the same instruction set extensions are listed.

However the 12th generation boasts the fancy name of "Gaussian & Neural Accelerator" which, at the risk of just parroting marketing spin,  "is an ultra-low power accelerator block designed to run audio and speed-centric AI workloads. Intel® GNA is designed to run audio based neural networks at ultra-low power, while simultaneously relieving the CPU of this workload."
Not sure if anything will utilize that?!

Re: FLAC v1.4.x Performance Tests

Reply #106
I disabled GNA in BIOS in all tests.

Re: FLAC v1.4.x Performance Tests

Reply #107
Seems I'm quite limited here with my hp BIOS @ Elitedesk 800 G4.
There are no such settings like to disable some of the extended CPU features, I only can toggle "Multithreading" and "VTx"...  :(

Re: FLAC v1.4.x Performance Tests

Reply #108
The FMA intrinsics are compiled with "-ffast-math".
[...]
I am not sure why the SSE and AVX ones are not.
Because the SSE and AVX code is with intrinsics, but the FMA is plain C targeted at FMA. For SSE and AVX instructions need not to be reordered, but with FMA there is this need, so -fassociative-math is needed, which is part of -ffast-math
Does it mean using -Ofast globally can affect something completely irrelevant like progress indicator and such? Are there inline codes to prevent such kinds of global optimizations in certain parts of the codes?

My experience in vectorization is rather limited in GPU shaders and game engines, without touching low level stuff like intrinsics.

Re: FLAC v1.4.x Performance Tests

Reply #109
Btw. Tested the RAM disk (NTFS) and the encoding time improved by around 200 msec (0.8%) + you extend your SSDs lifetime...
Going to repeat the test with FAT32...
I use the softperfect RAMdisk and exFAT is clearly the fastest with it but has a very big overhead due to its 64k cluster size. It shouldn't matter until you use lots of small files on it.
Until lately Windows had a uppercase renaming bug together with exFAT. That was fixed lately.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #110
Seems I'm quite limited here with my hp BIOS @ Elitedesk 800 G4.
There are no such settings like to disable some of the extended CPU features, I only can toggle "Multithreading" and "VTx"...  :(
Motherboards being sold separately like the ones from Asus, Gigabyte, MSI and such usually offer more options.

Re: FLAC v1.4.x Performance Tests

Reply #111
Sizesecondssaved per secondsetting
11969604531 833    -8
11968502388 94010300 -8 -A "tukey(666e-3);subdivide_tukey(3/333e-3)" 
1196755657511554399 -8 -A "subdivide_tukey(4)"
1196646337115552733 -8 -A "subdivide_tukey(5)"
1196129143330033572 -8p (note jump in time when using -p)
1196017971933503204 -8p -A "tukey(666e-3);subdivide_tukey(3/333e-3)" 
119592501645131522 -8p -A "subdivide_tukey(4)"
119581254247796422 -8p -A "subdivide_tukey(5)"
How about this setting on your corpus?
-8 -A "tukey(75e-2);subdivide_tukey(3/25e-2)"
Of course I asked this because it works better on my corpus (about one day of duration), and I adjusted the corpus weighting so that the compression ratio is roughly 55%. You can also try other values which don't require rounding, for example 666e-3 may mean something like 0.66600000858306884765625 in single float.


Re: FLAC v1.4.x Performance Tests

Reply #112
GNA [...] Not sure if anything will utilize that?!
It's like a GPU but much smaller. FLAC won't use it.

Does it mean using -Ofast globally can affect something completely irrelevant like progress indicator and such?
Potentially yes, but in practice most floating-point code doesn't rely on the compiler precisely following the floating-point standards. If the progress indicator is affected, you probably wouldn't be able to see what's different.

The real problem with -Ofast is that it can insert code that switches the CPU into a faster but not standards-compliant mode, and this can affect any program that loads a library compiled with -Ofast.


Re: FLAC v1.4.x Performance Tests

Reply #114
How about this setting on your corpus?
-8 -A "tukey(75e-2);subdivide_tukey(3/25e-2)"
Of course I asked this because it works better on my corpus (about one day of duration),

Improves - because of the 25e-2.  The difference between 75e-2 and 666e-3 in the single tukey is ambiguous over genre (the latter is better in the classical section, the former in the "other"), but the overall impact is less than a part per million.
Tested same with "-p" added.

But lowering the subdivide_tukey tapering parameter helps and I think it should be even lower. I tested and found -A subdivide_tukey(24e-2) to be a good one without the additional -A tukey, but preliminary testing indicates that 25e-2 is "too high" in the presence of that.

The 666 & 333 were not "optimal" choices - they were picked more out of the idea that if I wanted to deviate from 1/2 and 1/2 parameters, then "2/3 and 1/3" would be the next idea. I surely tested both 666&333 and 333&666, but I didn't do any exhaustive testing. So why then state with this three-decimal "accuracy"? Hey, 3/333e-3 is easy to remember. (And then the metal swine selected 666 over 667 for kinda the same reason.)


You can also try other values which don't require rounding, for example 666e-3 may mean something like 0.66600000858306884765625 in single float.
The predictor is rounded off to integer, so decimals beyond some kth will in the very least not matter very often. Quick testing on 11 CD images, starting from your 0.75 & 0.25, I got bit-identical files if I tweaked the fifth decimal, but the fourth would matter. I mean, not "matter" much, but yield different files.

Re: FLAC v1.4.x Performance Tests

Reply #115
Tested: To "-8" and above, added a tukey to a subdivide_tukey, various taperings tested. Do these choices make (much) different impact across genres? (No!)
-8p -A "tukey(Q);subdivide_tukey(N/P)" for N=3, 4, 5 and various P and Q.
Also without "-p".

Of course it doesn't matter much! On one hand, you can shrug it off as nothing by saying that for N=3, the extra tukey - with "optimal" parameters - saves 0.01 percent over standard -8p, and good/bad parameters make for only half of this. Nothing to care about? On the other hand, it is only slightly less than going up to N=4, and slightly more than going from N=4 to N=5. Each of those cost much more time.
So if standard -8p is not enough for you - well for the sports of it I guess - and you are ready to type in some -A manually, you might as well consider this. Same if you want to go up from -8 but without all the way to -8p; then you can just remove the "p" from the below, your material is likely to make more difference than that.

tl;dr: if adding an additional tukey to get ~half the benefit of higher subdivide_tukey at a fraction of the extra time, make its tapering parameter bigger than default (well maybe default if you are at very high compression) and the subdivide_tukey taper parameter very small.
If you like to think in 1/16ths terms: after a bit tweaking, you could try something like 11, 10, 9 or 10, 9, 8 combined with a 1/8 as follows:
N=3: -8p -A "tukey(6875e-4);subdivide_tukey(3/125e-3)" <---- 11/16ths & 1/8th, or reduce the first to 10/16ths for classical music
N=4: -8p -A "tukey(6250e-4);subdivide_tukey(4/125e-3)" <---- 10/16ths & 1/8th, or reduce the first to 9/16ths for classical music. Yes keep the 1/8th.
N=5: -8p -A "tukey(5625e-4);subdivide_tukey(5/125e-3)" <---- 9/16ths & 1/8th, or reduce the first to 5e-1 for classical music. Again keep the 1/8th.

But the genre differences between classical, heavier/metal and "other" didn't cause much drama - not even "relatively" to the very small impact of it all. That is kinda reassuring; even if classical music could use N/2e-1, it gained virtually nothing going down to N/<one eighth>.


So just to explain what I did here:
Also checked (this preliminary): as in Reply 48, combining with a bigger single tukey.
Hypothesis: because single tukey has always had the default parameter 0.5 - this after quite a bit of testing back in the day - there is no good reason that this small tapering should be good for a single tukey run, ==> reason why it works is for the subdivisions ==> if you want to improve, try one with a bigger taper parameter like -5 uses.
This to be tested with -p
I first made the "arbitrary" selection (files with "j" in the name) and then ran the test on the remainder, distinguishing between the classical music, the heavy rock/metal and the "other".
The P and Q are "7e-2", "14e-2" etc., i.e. 0.07 apart, though only the "most reasonable" ones tested on the big corpus. Then tweaked the parameters slightly from the "best", if only to see if small tweaks led to unexpectedly big changes. (They did not.)

Results: Well not unexpected given Reply 48: Make the Q and P tapering parameters quite far from each other as tukey(<big P>);subdivide_tukey(N/<small Q>). The "big" does not mean close to 1, though.
Genre differences: Nothing dramatic - nothing "relatively dramatic" relative to the .01 percent impact either. Sure there is a clear pattern in that the heavier music wants smaller Q, down below 0.1, and also slightly bigger P, but not much - and the classical music calls for slightly lower P. But the "overall" minimum is not far (in kilobytes) from each genre's minimum.

So the first runs ended up with
N=3: -8p -A tukey(70e-2);subdivide_tukey(3/14e-2)
N=4: -8p -A tukey(56e-2);subdivide_tukey(4/14e-2)
N=5: -8p -A tukey(49e-2);subdivide_tukey(5/14e-2)
Tweaking it and looking at genre differences, I ended up with something like up there with the tl;dr. It was the classical music section that made the "56" and "49" win, and it is the heavier section that pulls the other direction. The 14 was a bit too high except for classical music where it mattered very very little, like a few kb on 4 giga.

Re: FLAC v1.4.x Performance Tests

Reply #116
With only -8 subdivide_tukey(3/x) I got these figures, from best to worst compression:

Difficult content (~70.64% compression ratio)

3/1875e-4
2531723115 bytes

3/2e-1
2531723128 bytes

3/22e-2
2531723205 bytes

3/25e-2
2531724460 bytes

3/125e-3
2531724619 bytes

-8
2531763292 bytes

Difficult contents are the usual electronic music in my collection, and some loudness war songs. Simple contents include speech, classical, ethnic and songs with simple accompaniment.

Simple content (~43.13% compression ratio)

3/25e-2
1799030423 bytes

3/2e-1
1799035729 bytes

3/22e-2
1799046777 bytes

3/1875e-4
1799054794 bytes

3/125e-3
1799080973 bytes

-8
1799116764 bytes

Re: FLAC v1.4.x Performance Tests

Reply #117
With only -8 subdivide_tukey(3/x) I got these figures, from best to worst compression:

Here is where I actually got a weirdness: .21 was worse than both .20 and .22. And the effect was not due to one genre. Tested a few more because in Reply #93 I found .32 to be better than .16 over all three, so the below results point at a parameter slightly less than expected.

Anyway, disregarding .21 and doing (nearly) only your parameters, results are not outrageously far from yours, but slightly different - I suspect your speech content makes some impact?
 2e-1 was the best for both classical music and the "other" section. .22 was better than .1875 in these two genre sections. .2 also won the overall.

For my heavier material, go lower: 3/125e-3 is better than .1875 better than .2 better than .22 better than .25
Also checked 1e-1, which narrowly lost to 125e-3.

Impact of choosing "wrong": With your "simple" content, even the difference between the two best was like 3 parts per million. For my classical music, everything from .1875 and up would be within that interval, and same for the "other" genre.
But for your "difficult" material, everything from 125e-3 and up fell within one ppm, and my material needed 4ppm.
Not much still.


The low tapering parameter I found in Reply 115 just underlines that with an additional tukey, you want the two tukeys to be different.

Re: FLAC v1.4.x Performance Tests

Reply #118
Guess this is my last try.
I am using ./configure options also now from Case's suggestion, -Ofast and -fipa-pta suggested elsewhere. -fipa-pta optimizes a tiny bit and saves some kb from the binaries by only the cost of compiling time.

Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #119
On my set of test files, your latest (really hopefully not last) build is right between Case's gcc v12.2 and gcc v7.3 builds:
Code: [Select]
FLAC Binary: flac141-case-haswell.exe (860160 bytes) = gcc v7.3
FLAC Option: -7
 Average time =  25.268 seconds (3 rounds), Encoding speed = 427.89x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-wombat2.exe (784384 bytes)
FLAC Option: -7
 Average time =  25.710 seconds (3 rounds), Encoding speed = 420.54x
 FLAC size = 1.167.014.381 bytes (= 61,188% of WAV size, ~863 kbps)

FLAC Binary: flac141-case-gcc12.exe (781312 bytes)
FLAC Option: -7
 Average time =  26.100 seconds (3 rounds), Encoding speed = 414.26x
 FLAC size = 1.167.014.383 bytes (= 61,188% of WAV size, ~863 kbps)

And, fwiw, I was able to get some speed gain compared to plain -7 (on my test set [classic rock music]) @ almost no cost with smaller block size:
Code: [Select]
FLAC Binary: flac141-case-haswell.exe (860160 bytes)
FLAC Option: -7 -b3584
 Average time =  23.949 seconds (3 rounds), Encoding speed = 451.46x <= faster encoding (428x -> 451x) [ comparted to -7]
 FLAC size = 1.167.032.442 bytes (= 61,189% of WAV size, ~863 kbps) <= min. worse compression: 0.001 percent points

Re: FLAC v1.4.x Performance Tests

Reply #120
Added more electronic and loudness war contents, hand-picked to only include the highest bitrate files, but does not contain noise music. Around 74.5% compression ratio.

1.3.1 (Xiph)

-8 -b2304
3200412387 bytes

-8
3202131236 bytes

1.3.2 (Xiph)

-8 -b2304
3200203911 bytes

-8
3201989505 bytes

1.4.1 (Case GCC 12.2.0)

-8 -b2304
3199429338 bytes

-8
3201122995 bytes

-8 -A "tukey(5e-1);partial_tukey(2);punchout_tukey(3)"
3201407279 bytes

Re: FLAC v1.4.x Performance Tests

Reply #121
Yikes, I suck at PowerShell ...

Can anyone hack together for me a script that does the following:

FOR every *.flac IN (D:\given path pattern...\*.flac) DO flac <parameters> with output <same filename except that in E: rather than D>
and measures total CPU time and total time including I/O?

Point being: how much "compression effort" is "free in time" because it compresses while busy writing?

Re: FLAC v1.4.x Performance Tests

Reply #122
lol. Why look so far. :) When searching the web for compiler options guess where it leads to?
far far away
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #123
lol. Why look so far. :) When searching the web for compiler options guess where it leads to?
far far away
"This is likely to be my last build"

Cue 2022:
Guess this is my last try.

Porcus quoting self:
Quote from: Porcus
rehab is for quitters

 O:)

Re: FLAC v1.4.x Performance Tests

Reply #124
You got me  :-[  but somehow it makes to much fun  :)
Guess i have to try some more and maybe a 'skylake' version for sundance to test when i am at my PC later.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!