Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: FLAC v1.4.x Performance Tests (Read 113893 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

Re: FLAC v1.4.x Performance Tests

Reply #500
-8p down from 6'13" to 3"49 and then a different compile shaves off nearly another minute - not complaining, no.
-8p is now not even 13 times as slow as -5  ;)

Re: FLAC v1.4.x Performance Tests

Reply #501
Thanks. It turns out that some changes I made when working out the 32-bit encoder/decoder did affect the 24-bit part more than I thought. I was under the impression the code paths meant for 32-bit audio were only seldomly used for 24-bit audio, but it turns out certain kinds of 24-bit audio (especially those with a completely empty upper half of the spectrum) do use these code paths a lot, and they are much slower.

So, these changes make the choice between these code paths more strict: that choice was rather made rather roughly (on the safe side of course), but now the encoder goes through a little bit more math to only choose the slow code path when absolutely necessary.

The speed-up is highly dependent on source material. Audio with a high samplerate in which the upper frequencies are fully 'utilised' do not see any change at all, most audio I've tested sees quite some improvement at preset 8, and those where really no audio exists above 20kHz see most improvement at preset 5, but are still slow at preset 8.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #502
(especially those with a completely empty upper half of the spectrum)
Oh, and that maybe makes it even more interesting - those signals appear to be the ones where -e matters, i.e. the model choice algorithm is less reliable. If such ones can be "identified" (not surely, but with enough statistical association), then we might be in for some fun ... uh, assuming that developer and testers have infinite amount of spare time of course.

Re: FLAC v1.4.x Performance Tests

Reply #503
This is a spectrogram of the 24/48KHz wave from my previous post.

X




This wave file for this run is 24-bit/96KHz  1h52m  3.63GiB

flac git-04532802 (2024-05-02)
Code: [Select]
     1 thread     8 threads
-5   0m26.6009s   0m7.253s
-5p  1m44.983s    0m20.706s
-8   2m10.424s    0m25.829s
-8p  17m37.321s   3m30.797s

flac git-cfe3afca (2024-05-16)
Code: [Select]
     1 thread     8 threads
-5   0m20.809s    0m6.847s
-5p  0m35.891s    0m8.379s
-8   1m46.554s    0m21.323s
-8p  11m30.919s   2m20.278s

X


Re: FLAC v1.4.x Performance Tests

Reply #504
-5p from 105 to 36 seconds ...  ;D
I didn't quite get the change, but is it so that the code checks that the residual fits a signed short, and when it does ... much faster?
Has it made changes as to when it can select the 4-bit method?

How does -5e work? And with sizes for -5, -5e, -5p? Never mind whether there is any change in -5e, I am curious about the bang for the buck, and the comparison between -5e and -5p must have been tilted right now.

Re: FLAC v1.4.x Performance Tests

Reply #505
I didn't quite get the change, but is it so that the code checks that the residual fits a signed short, and when it does ... much faster?

When adding the 32bit PCM part of the encoder, I've amended the FLAC spec to include that all residuals must fit a 32 bit signed int. This is to keep decoding simple. The encoder must make sure this is done.

It is possible to calculate that for certain predictors, checking each residual sample separately is not necessary. When this is not possible, each residual sample must be checked, which is slower of course. This calculation was improved, so the slow process of checking all residual samples is needed less often.

I didn't know at the time 24-bit encoding would be affected, but it turns out that for signals with very little noise in the upper frequencies (= smooth signal) the predictor can be of a very high quality, which can lead to the residual spiking for parts of the signal where the predictor doesn't fit. This doesn't normally happen, but it needs to be checked anyway.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #506
Ah. So you have just improved a criterion of the following kind:
"This predictor vector cannot possibly create any too big residual from a history of N-bit samples, so we can save time by bypassing the size checks that we are mathematically sure it would anyway pass"?

Re: FLAC v1.4.x Performance Tests

Reply #507
If I don't assume differences in performance of binaries (same Flac git version, same compiler and version) prepared by different people, then these last optimizations are indeed significant even in my conditions (mobile Intel i5-8250U), but undoubtedly content-dependent.

Audio file 24/44, 1:20 hour, Flac compression "-4":
- Flac commit: 28e4f05, GCC14.1 (NetRanger) - ca. 440-450x
- Flac git-cfe3afca, GCC14.1 (Wombat) - ca. 495-505x
- Flac git-cfe3afca, GCC12.2 (Replica9K) - ca. 480-500x, fluctuates greatly during encoding

..but NetRanger's Clang compile is very slightly but constantly a winner
- Flac commit: 28e4f05, Clang 18.1.4 (NetRanger)- ca. 505-515x

The result of Clang for git-cfe3afca could therefore be quite interesting. However, if I understand correctly, under standard conditions with no external influences (software etc.) the difference between GCC13 and Clang18 should ideally be none. GCC14 is a bit below v.13 in performance for me as well.

Re: FLAC v1.4.x Performance Tests

Reply #508
Compression -4 may be pretty good with a clang compile but high compression and multithreaded stress should be clearly faster with GCC builds.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #509
Ah. So you have just improved a criterion of the following kind:
"This predictor vector cannot possibly create any too big residual from a history of N-bit samples, so we can save time by bypassing the size checks that we are mathematically sure it would anyway pass"?
Yes, that is correct.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #510
Compression -4 may be pretty good with a clang compile but high compression and multithreaded stress should be clearly faster with GCC builds.


Ok, i didn't try that.. but luckily, all others do it here. I choose compression "-4" based on the charts and own tests yeeeeeears ago, all resting compression steps ceased to exist for me, i forgot them completely.. and that continues to this day.
But I understand that the code here should primarily be tested under stress conditions of higher compression, where the changes are more pronounced.

Re: FLAC v1.4.x Performance Tests

Reply #511
Ok, i didn't try that.. but luckily, all others do it here. I choose compression "-4" based on the charts and own tests yeeeeeears ago, all resting compression steps ceased to exist for me, i forgot them completely.. and that continues to this day.
But I understand that the code here should primarily be tested under stress conditions of higher compression, where the changes are more pronounced.
SORRRY! Me is a bit outdated it seems. I just tried the latest flac git together with latest clang and indeed it is faster as everything else even with high compression so far! Guess i must change some things here.
Attached a generic clang 03 compile without much additional optimizations for testing latest git.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #512
First of all thanks for the binary file, Wombat!
I know this is a discussion primarily about encoder code optimization and not about the features and impact of different compilers on the result, so I don't want to clutter it up with a proxy issue here, sorry for that.
So just in short.. somehow it is going crazy on my side, as i'm unable to reproduce the previous results with same setup anymore.. everything is simply slower today.. anyway, the speed-up from "Release 1.4.3" to last git with GCC14.1 does not seem to happen in case of Clang. Rather your binary has the same encoding performance as NetRanger's Clang binary of "Release 1.4.3", at best.
I've to look, whats wrong here and repeat it later.. There doesn't seem to be any background stuff issue, nor fb2k converter settings impact this.

Re: FLAC v1.4.x Performance Tests

Reply #513
You're absolutely right. Only when there really is a final version one day i may try different ways of compiling.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #514
I tried some AVX2 versions on my 5900x and metaflacs replaygain is clearly faster with GCC, 16bit and option disabled asm is clearly faster with GCC and 16bit/24bit combined is both faster with the default AVX2 clang.
All i was able to produce with lto and clang produced slower binaries.
I guess some experienced users can do better with lto or pgo or even a combination of both.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #515
These are the flags i used lately using AVX2 and gcc. Unrolling loops does well. I mostly tested all this with my AMD 5900x using high compression -8p and -8ep.
 
disabled asm optimizations:
-O3 -march=x86-64-v3 -ffast-math -falign-functions=32 -fipa-pta -funroll-loops --param max-unroll-times=10 -fno-stack-protector -fgraphite-identity -floop-nest-optimize -flto -ffat-lto-objects

default:
-O3 -march=x86-64-v3 -ffast-math -falign-functions=32 -fipa-pta -funroll-loops --param max-unroll-times=4 -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 -fno-stack-protector -fgraphite-identity -floop-nest-optimize -flto -ffat-lto-objects -ftracer

falign-functions=16 or omitting it seems to run better with older CPUs. I hope it helps somebody to make faster binaries.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: FLAC v1.4.x Performance Tests

Reply #516
I assume that resulting size is about to distribute like this:

X

I have tested all my audio library (about 31K files) and get a results in file sizes.

Comparing sizes between "-8" and "-8e":
ParametersFilesPercent
-8 < -8e22300,710%
-8 > -8e3104098,822%
-8 = -8e14700,468%

Comparing sizes between "-8" and "-8r8":
ParametersFilesPercent
-8 < -8r8322910,280%
-8 > -8r82061865,642%
-8 = -8r8756324,078%

Comparing sizes between "-8r8" and "-8e":
ParametersFilesPercent
-8r8 < -8e6762,152%
-8r8 > -8e3058797,380%
-8r8 = -8e1470,468%


Is there any chance we can find out he reason for these cases and improve the encoding performance of FLAC?

Re: FLAC v1.4.x Performance Tests

Reply #517
The reason is explained here: https://github.com/xiph/flac/issues/728

TL;DR: FLAC approximates some stuff, which sometimes leads it to not pick the smallest possible representation. "Fixing" this would cause a major slow down.
Music: sounds arranged such that they construct feelings.

Re: FLAC v1.4.x Performance Tests

Reply #518
I didn't know that FLAC -8 had such different varieties. However, the slowdown in the process rate has no effect on the compression ratio. "FLAC -8epr8" I couldn't wait because it took too long.

Code: [Select]
i7 3770k, 16gb, 240gb
Busta Rhymes - 829,962,880 bytes

FLAC -8      33.272s  558,662,099 (67.31%)
FLAC -8r8    38.801s  558,185,865 (67.25%)
FLAC -8p    116.991s  558,016,569 (67.23%)
FLAC -8e    117.558s  558,392,936 (67.27%)
FLAC -8pr8  163,238s  557,504,740 (67.17%)
FLAC -8er8  183.193s  557,894,225 (67.21%)
FLAC -8ep   877.759s  557,731,511 (67.19%)
FLAC -8epr8    -          -

Re: FLAC v1.4.x Performance Tests

Reply #519
Not sure the slowdown for exact-rice is big compared to the patience you need to go -e, but there is cost and benefit, and I'd be surprised if the size improvement is much even compared to the modest improvements from -e.
@hat3k : If you just take the "exceptional" 323, 3229 and 676 files and check the improvement on those only - let me throw a guess in advance: they improve somthing in the ballpark 0.01 percent? Anyway, with the link you got from ktf, you got a build that can do it with the exact calculation ... if you live on the Northern hemisphere it is winter now, and your CPU's excess heat can warm your feet :-)

@Hakan Abbas : Quite a bit of reference flac's encoding efficiency is in first estimating what model and precision to use, and then encoding only that. But there are brute-force switches. If you want a "lighter -8", try -8M or even -7M. The "M" switch will override the brute-force choice of stereo decorrelation (the "-m" that is usually part of -5-and-up) and use a smart heuristic there as well. You can grind down the encoder pretty much to a halt, trying tons of new different predictor vectors and hoping one of them improves. Settings above -8: https://hydrogenaud.io/index.php/topic,123025.msg1016625.html#msg1016625 . Note the last couple of lines relating it to WavPack's -x4

What those -e -p and -r8 do: 
-e: instead of estimating to choose prediction order (history length) and if applicable, choice of windowing function: Try them all. Since flac 1.4.0 it rarely has much impact on CDDA material, but the estimation procedure seems to be fooled for low-passed material (that includes "bogus upsampling", so much more often the impact is visible for higher sampling rates).
The estimation procedure has been honed and improved over flac's lifetime, and back in the age of 1.1.x it was less trustworthy - and so the "-8" preset did invoke -e back then.
-p: There is a trade-off between coefficient precision and the space needed to store them. Brute-forcing that with -p should not give too much improvement, but sometimes it does better simply because the round-off is sometimes beneficial, so in effect there is an element of "trying more predictors", not just "coarser versions of the same".
-r: the partitioning. The presets have -r3 (max eight partitions per subframe) to -r6 (max sixty-four). The actual choice is brute-forced I think, starting from the finest choice allowed and then merging.
But what is not brute-forced, is the choice of Rice exponent for each partition. It is usually not needed, IIRC it is fast to calculate a choice that "is rarely worse than one away from the optimal", and one away doesn't incure much size penalty.

Edit: By the way, I think you use AVX2 in all HALAC builds? FLAC speeds do depend on both compile and CPU architecture. Here are some results comparing builds: https://hydrogenaud.io/index.php/topic,123025.msg1029768.html#msg1029768


Re: FLAC v1.4.x Performance Tests

Reply #520
-r8, -e and -p, mildly relevant: I tried to correlate the percent-wise improvements, in a quick test over one track from each of my 38 CDs. Not going to post the correlation coefficients, as they were so dominated by two outliers or at least near-outliers. In the following, the left column is Kraftwerk and the right is Dream Theater keyboardist Jordan Rudess (electronic keyboard, solo), and the rows are improvements for -r8, -e and -p respectively
 0.15%    0.00%
 0.74%    1.23%
 0.08%     0.55%
1.23 for -e is quite a lot, and so is 0.55 for -p.

For the Jordan Rudess track, the exact-rice build made for reductions of 0.087% resp. 0.097% when encoded with -8 resp. -8pe.

Re: FLAC v1.4.x Performance Tests

Reply #521
@Hakan Abbas : Quite a bit of reference flac's encoding efficiency is in first estimating what model and precision to use, and then encoding only that. But there are brute-force switches. If you want a "lighter -8", try -8M or even -7M. The "M" switch will override the brute-force choice of stereo decorrelation (the "-m" that is usually part of -5-and-up) and use a smart heuristic there as well. You can grind down the encoder pretty much to a halt, trying tons of new different predictor vectors and hoping one of them improves. Settings above -8: https://hydrogenaud.io/index.php/topic,123025.msg1016625.html#msg1016625 . Note the last couple of lines relating it to WavPack's -x4

What those -e -p and -r8 do: 
-e: instead of estimating to choose prediction order (history length) and if applicable, choice of windowing function: Try them all. Since flac 1.4.0 it rarely has much impact on CDDA material, but the estimation procedure seems to be fooled for low-passed material (that includes "bogus upsampling", so much more often the impact is visible for higher sampling rates).
The estimation procedure has been honed and improved over flac's lifetime, and back in the age of 1.1.x it was less trustworthy - and so the "-8" preset did invoke -e back then.
-p: There is a trade-off between coefficient precision and the space needed to store them. Brute-forcing that with -p should not give too much improvement, but sometimes it does better simply because the round-off is sometimes beneficial, so in effect there is an element of "trying more predictors", not just "coarser versions of the same".
-r: the partitioning. The presets have -r3 (max eight partitions per subframe) to -r6 (max sixty-four). The actual choice is brute-forced I think, starting from the finest choice allowed and then merging.
But what is not brute-forced, is the choice of Rice exponent for each partition. It is usually not needed, IIRC it is fast to calculate a choice that "is rarely worse than one away from the optimal", and one away doesn't incure much size penalty.

Edit: By the way, I think you use AVX2 in all HALAC builds? FLAC speeds do depend on both compile and CPU architecture. Here are some results comparing builds: https://hydrogenaud.io/index.php/topic,123025.msg1029768.html#msg1029768
Thank you Porcus for your deep knowledge as always.

HALAC is compiled with “-mavx” flag in encoder and “-msse2” in decoder. I don't normally use “-mavx2” as it won't work on slightly older processors, so all automatic SIMD optimization is left to the compiler. My test machine for the last compilations is i7 3770k and does not support AVX2. “-mavx2” gives a negligible speedup to the encoder. It does not contribute anything to the decoder at this point.

I use GCC, CLANG and ICC as compilers, each of them works faster in different parts of the code. However, the ideal situation cannot be achieved because it is compiled with only one of them. And the result is something close to each other. For example, Rice encoding and decoding is better in GCC, but a bit behind in the other processes.

Neither HALAC nor HALIC uses manual SIMD. If manual SIMD is used efficiently, an extra significant speedup can be seen. But I don't have enough experience and time to deal with this at the moment.

Re: FLAC v1.4.x Performance Tests

Reply #522
Ah, thanks for the correction.
Yes GCC compiles appear to do the Rice writing faster. Reply 347 in that thread, I only did fixed-predictors, so least-squares no Yule-Walker - and by mistake I picked one build "wrong" (i.e. a different one than I thought I used): https://hydrogenaud.io/index.php/topic,123025.msg1030749.html#msg1030749
Then a GCC was also best at some of the heavier than -8 settings, but a CLANG compile won at "plain -8".

All those differences over compiler and CPU - and the number of builds times number of different-behaving reasonably modern CPUs ...

Re: FLAC v1.4.x Performance Tests

Reply #523
@ktf and @Porcus thank you very much for the answers.

I've tested all the library again with "-e". Then i get all the files coded with "exact-rice" smaller than with "inexact-rice". 149 files are something about silence-like tracks.
ParametersFilesPercent
-8e inexact < -8e exact00,000%
-8e inexact > -8e exact3126199,526%
-8e inexact = -8e exact1490,474%


What benefits/drawbacks i have (sizes in bytes):
BinaryAVG SizeAVG SpeedAVG compression
inexact26 189 321995,51069,760%
exact26 183 707435,50069,745%
diff-5 614-56,254%-0,015%


So the benefits are not that impressing. The average speed drops down almost twice. But now we have a multithreading, and if to compare "release with inexact-rice -ep" and "MT version with exact-rice -epr8" then the benefits exist even for 8 threads. Even with no AVX (for my CPU).

Out. SizeCompr.SpeedParametersBinary
24 407 47967,102%15,858-8 -epflac inexact-rice.exe
24 407 48867,102%8,845-8 -epr8flac inexact-rice.exe
24 407 48867,102%63,970-8 -epr8 -j8flac inexact-rice.exe
24 407 48867,102%109,784-8 -epr8 -j16flac inexact-rice.exe
24 407 48867,102%131,809-8 -epr8 -j32flac inexact-rice.exe
24 402 50567,088%4,169-8 -epflac exact-rice.exe
24 402 50367,088%2,935-8 -epr8flac exact-rice.exe
24 402 50367,088%22,026-8 -epr8 -j8flac exact-rice.exe
24 402 50367,088%37,315-8 -epr8 -j16flac exact-rice.exe
24 402 50367,088%43,999-8 -epr8 -j32flac exact-rice.exe


So, in my opinion exact rice gives a stable profit in size and multithreading helps to maintain high encoding speed (comparing to release). Then i think it would be great if we have a CLI key to switch on the exact rice. Is it possible?

 

Re: FLAC v1.4.x Performance Tests

Reply #524
Let's get real here on the size ballpark: You can save about 1/7000, which makes room for about one more CD on a 2 TB drive.

If that is on your agenda - or like a few of us, if you find it interesting to see how much it improves even if it is not practically significant - then you should maybe look for better bang for the buck?
 
Suggestion, if you don't mind running another round for the sake of the sport: Measure speed and size on the same corpus, for each of these runs on the "inexact" build, and post it here? 
flac -8per7 -A "subdivide_tukey(4);flattop"
flac -8er7 --lax -l15 -A "subdivide_tukey(4);flattop"
flac -8pr7 --lax -l15 -A "subdivide_tukey(4);flattop"
flac -8pr7 -A "subdivide_tukey(5);flattop"

To explain, take the following starting point: -8r7 (because r8 made so little impact, try to get it slightly cheaper) and some more different weighting schemes for the data (it will try all and pick the best). That would amount to "the first without the -pe". Then try different variations, in order:
* First one: Adding both the "-e" to brute-force model selection, and the "-p" to save space on predictor coefficients if possible. The combination is very expensive.
* Second one: Keep the "-e", but instead of the "-p" switch: go up in prediction order from max 12 to max 15. That is outside the streamable subset, but likely no problem - and if your player chokes on it, you can convert back later.
* Third one: Like second, but replace "-e" by "-p". Usually better on CDDA, but might depend on your music.
* Fourth one: Scrap the order-15, but instead try more weighting schemes. Thus the difference between the first (very slow) and the last (slow, but not that much!) is that the latter tries more weighting schemes, but the first one instead brute-forces the selection between those slightly fewer.