Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: More multithreading (Read 34041 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

More multithreading

Hi all,

After @cid42 experimented with multithreading in FLAC, WavPack introduced multithreading and I found out TAK can already multithread over a single file it seemed time to get this working in FLAC too.

I have experimented with openMP a few months ago, but that didn't really work. I've now implemented multithreading with pthreads, which means it works on Windows, Mac and Linux, but only with a compiler that has a pthreads implementation, like mingw has winpthreads. See https://github.com/xiph/flac/pull/634

Anyway, there are a few bugs in there still, but these will probably only crop up when using libFLAC directly, not through the command line tool. Still, please be cautious when using the attached binary. Probably best to only use it for testing. Consider it experimental at this stage.

I've also added two graphs, one with wall time and one with CPU time. The wall time one shows you how fast the encoding process goes (which is of course the most interesting bit of data). The graph has 5 lines, one with FLAC 1.4.3, one with the new code but multithreading not enabled (with the option -j1, which is, use 1 thread), one with -j2 (which is use 2 threads), one with -j3 and one with -j4.

My test PC has a cpu with 2 cores and hyperthreading, so 4 threads in theory. As you can see, 4 threads doesn't really add much over 3 threads in my case. The reason using 2 threads does improve much for fast presets and little (or even get slower) for slow presets is because 1 thread does the housekeeping, and all other threads do number crunching. For fast presets this is reasonably balanced (as much housekeeping to do as number crunching) but for the higher presets the housekeeping thread is mostly idling.

The CPU time graph shows 'efficiency' of some sort: it shows total CPU usage over all cores expressed as a percentage. This more or less shows how much overhead multithreading gives.

I hope there are a few people here that would like to give this a go. Results from systems with more cores are highly appreciated :)

P.S.: compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading. For these presets, the number of threads is limited to 2.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #1
Interesting wall time figures; -5 faster than -4, is that because of the -m vs the -M?
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)

Re: More multithreading

Reply #2
Interesting wall time figures; -5 faster than -4, is that because of the -m vs the -M?
Yes, sorry. I said -1 and -3 but that is -1 and -4 of course
Quote
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)
I don't know what you mean?
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #3
Dang, it wasn't the dual mono ones ... I did the wrong mental repair.

Quote
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)
I don't know what you mean?
Because a "naive" way to allocate tasks between threads would be to let one do Left, one do Right, one do Mid and one do Side - or if you got fewer 2 here and 2 there?

Re: More multithreading

Reply #4
Because a "naive" way to allocate tasks between threads would be to let one do Left, one do Right, one do Mid and one do Side - or if you got fewer 2 here and 2 there?
My first try was indeed splitting over subframes. The main advantage of that is, that it is completely invisible to the API user. The main disadvantage is the overhead: FLAC is simply too fast. As can be seen from the graphs, the current approach (multithreading over frames) shows a large impact of blocksize. When multithreading over subframes, the amount of number crunching per 'thread-task' is divided by four (assuming stereo input with full stereo decorrelation, which means 4 subframes are tried), which means the overhead increased by more than a factor 4.

It seems there is a certain minimum amount of work that needs to go in a thread-task, otherwise the overhead completely swamps any possible gain. So, if you set a small blocksize, for example 32, multithreading shows massive negative gains.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #5
@ktf

Can I get a Linux binary (or source) to try out?

Re: More multithreading

Reply #6
Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #7
Binary did work for me, but built a copy from source as well.

Just a quick test with my usual NIN The Fragile album (16/44.1 - 1h 43m) on my Ryzen 5850U

Code: [Select]
./flac -j1 -8p in.wav - 44.137s
./flac -j2 -8p in.wav - 43.312s
./flac -j3 -8p in.wav - 23.812s
./flac -j4 -8p in.wav - 17.291s
./flac -j5 -8p in.wav - 13.835s
./flac -j6 -8p in.wav - 11.868s
./flac -j7 -8p in.wav - 10.579s
./flac -j8 -8p in.wav - 9.676s
./flac -j9 -8p in.wav - 10.357s
./flac -j10 -8p in.wav - 11.655s
./flac -j11 -8p in.wav - 10.751s
./flac -j12 -8p in.wav - 10.061s
./flac -j13 -8p in.wav - 9.499s
./flac -j14 -8p in.wav - 9.007s
./flac -j15 -8p in.wav - 8.620s
./flac -j16 -8p in.wav - 8.227s

Re: More multithreading

Reply #8
15:42 of CDDA on i7-4790K:

-j1 -8ep - 101s
-j2 -8ep - 99s
-j4 -8ep - 34s
-j8 -8ep - 25s

Re: More multithreading

Reply #9
This thing really works! CDDA -8p -V, 5900x, 12 cores, 24 threads, no accurate science - only watching the numbers :)
Code: [Select]
j1 103x
j2 106x
j3 203x
j4 298x
j5 381x
j6 460x
j7 543x
j8 620x
j9 685x
j10 705x
j11 725x
j12 740x
j13 750x
j14 752x
j15 750x
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #10
I'm seeing that 2 threads doesn't seem like much improvement in encoding speed over only 1 thread.  3 to (insert FPU core count here) is the best improvement.

Re: More multithreading

Reply #11
Compiles fine here also. Own AVX2, GCC 13.1.0 version is 880x for my above using j13.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #12
Thank you all for confirming this works reasonably well on systems with a higher CPU core count.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #13
I'm seeing that 2 threads doesn't seem like much improvement in encoding speed over only 1 thread.
Care to test that with -0? For -8p it makes complete sense with the following:

The reason using 2 threads does improve much for fast presets and little (or even get slower) for slow presets is because 1 thread does the housekeeping, and all other threads do number crunching. For fast presets this is reasonably balanced (as much housekeeping to do as number crunching) but for the higher presets the housekeeping thread is mostly idling.

Re: More multithreading

Reply #14
maybe just multithread the queue?
Quis custodiet ipsos custodes?

Re: More multithreading

Reply #15
What queue?
Music: sounds arranged such that they construct feelings.


Re: More multithreading

Reply #17
@ktf: Excellent job!

Tested my set of CDDA-WAVs with "-7 -j[1..12]" on my CPU Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (6 cores, 12 threads).
My first runs ended up inconsistent (the CPU went hot, the fan ran wild and clock speed was throttled). So I decided to add a 10 second delay between the runs to start each of them with a 45°C-ish CPU (room temp atm is 28°C and rising...).
Code: [Select]
-j1:	Average time =  22.941 seconds (3 rounds), Encoding speed = 471.29x
-j2: Average time =  20.543 seconds (3 rounds), Encoding speed = 526.32x
-j3: Average time =  10.931 seconds (3 rounds), Encoding speed = 989.08x
-j4: Average time =   8.504 seconds (3 rounds), Encoding speed = 1271.40x
-j5: Average time =   7.401 seconds (3 rounds), Encoding speed = 1460.88x
-j6: Average time =   6.924 seconds (3 rounds), Encoding speed = 1561.60x
-j7: Average time =   6.315 seconds (3 rounds), Encoding speed = 1712.02x
-j8: Average time =   6.540 seconds (3 rounds), Encoding speed = 1653.21x
-j9: Average time =   7.226 seconds (3 rounds), Encoding speed = 1496.26x
-j10: Average time =   7.258 seconds (3 rounds), Encoding speed = 1489.73x
-j11: Average time =   6.862 seconds (3 rounds), Encoding speed = 1575.56x
-j12: Average time =   6.544 seconds (3 rounds), Encoding speed = 1652.20x
No advantage going beyond -j7 in my case (which is 1 housekeeping and 6 number crunching treads, if I understood ktf correctly). Which makes kinda sense if you have 6 physical cores...

Re: More multithreading

Reply #18
Quote from: ktf
compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading.
Could you elaborate? Do you know why that's the case?

Nice work indeed!

Chris
If I don't reply to your reply, it means I agree with you.

Re: More multithreading

Reply #19
Hi all,

I've done some more tweaking, hopefully decreasing the time various threads are waiting. Would be great if some people with a CPU with a high core count could benchmark this one vs the previous one.


Quote from: ktf
compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading.
Could you elaborate? Do you know why that's the case?
Loose mid side does the full calculation once every few frames (once every 0.4 s or something) and then uses the result for the next few frames. That a dependency between frames and thus threads. Maybe I'll fix that by implementing a different 'loose mid-side algorithm', perhaps the algorithm that ffmpeg uses.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #20
Here are my results for the v2 binary.
I ran the v1 binary again, since the ambient temp now is 18 °C (21:20 local time) and the CPU fan didn't run so fast.

ktf_v1 (MD5: 7b2e91271a02ad9ed00666e8a69710fb):
Code: [Select]
-j1:    Average time =  23.591 seconds (3 rounds), Encoding speed = 458.32x
-j2:    Average time =  20.620 seconds (3 rounds), Encoding speed = 524.35x
-j3:    Average time =  10.757 seconds (3 rounds), Encoding speed = 1005.08x
-j4:    Average time =   7.783 seconds (3 rounds), Encoding speed = 1389.18x
-j5:    Average time =   7.038 seconds (3 rounds), Encoding speed = 1536.23x
-j6:    Average time =   6.827 seconds (3 rounds), Encoding speed = 1583.63x
-j7:    Average time =   6.372 seconds (3 rounds), Encoding speed = 1696.89x
-j8:    Average time =   6.763 seconds (3 rounds), Encoding speed = 1598.70x
-j9:    Average time =   7.168 seconds (3 rounds), Encoding speed = 1508.30x
-j10:   Average time =   7.333 seconds (3 rounds), Encoding speed = 1474.36x
-j11:   Average time =   6.644 seconds (3 rounds), Encoding speed = 1627.25x
-j12:   Average time =   6.461 seconds (3 rounds), Encoding speed = 1673.51x

ktf_v2 (MD5: 08125e8c74864eb66cf810da273c7c73):
Code: [Select]
-j1:    Average time =  22.855 seconds (3 rounds), Encoding speed = 473.06x
-j2:    Average time =  20.627 seconds (3 rounds), Encoding speed = 524.18x
-j3:    Average time =  10.813 seconds (3 rounds), Encoding speed = 999.91x
-j4:    Average time =   8.088 seconds (3 rounds), Encoding speed = 1336.85x
-j5:    Average time =   7.196 seconds (3 rounds), Encoding speed = 1502.43x
-j6:    Average time =   7.021 seconds (3 rounds), Encoding speed = 1539.88x
-j7:    Average time =   6.643 seconds (3 rounds), Encoding speed = 1627.58x
-j8:    Average time =   6.673 seconds (3 rounds), Encoding speed = 1620.18x
-j9:    Average time =   7.135 seconds (3 rounds), Encoding speed = 1515.42x
-j10:   Average time =   7.220 seconds (3 rounds), Encoding speed = 1497.44x
-j11:   Average time =   7.102 seconds (3 rounds), Encoding speed = 1522.46x
-j12:   Average time =   6.323 seconds (3 rounds), Encoding speed = 1709.95x

I would not dare to draw a conclusion here, I can't see any significant differences. But maybe your mods don't show at -7.
But my performance peek at -j7 is gone...

Re: More multithreading

Reply #21
flac git-3e2d9a43 20230712
Same test as yesterday.
Code: [Select]
./flac -j1 -8p in.wav - 44.360s
./flac -j2 -8p in.wav - 41.762s
./flac -j3 -8p in.wav - 23.301s
./flac -j4 -8p in.wav - 16.781s
./flac -j5 -8p in.wav - 13.602s
./flac -j6 -8p in.wav - 11.526s
./flac -j7 -8p in.wav - 10.147s
./flac -j8 -8p in.wav - 9.192s
./flac -j9 -8p in.wav - 10.577s
./flac -j10 -8p in.wav - 11.440s
./flac -j11 -8p in.wav - 10.671s
./flac -j12 -8p in.wav - 10.012s
./flac -j13 -8p in.wav - 9.375s
./flac -j14 -8p in.wav - 8.889s
./flac -j15 -8p in.wav - 8.480s
./flac -j16 -8p in.wav - 8.139s

Re: More multithreading

Reply #22
There are only 6 bytes of difference in header between two binaries?
Speed seems the same.
Might test tomorrow on ancient NUMA system with 2 CPUs x 4 cores x 2 threads = 16 threads

Re: More multithreading

Reply #23
Indeed both versions claim git-ea9a6c00 in Windows details. I only benched -j13 for both and the speed is the same.
The one i can compile atm sourced as pthread2.zip claims to be 1.43.
btw. Clang does clearly worse here for me.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #24
AMD Ryzen 9 5950X (16 Cores 32 Threads)
Code: [Select]
Codec      :     PCM (WAV)
Duration   :     1:41:59.985
Sample rate:     41000 Hz
Channels   :     2
Bits per sample: 16

flac-multithreading-win
Code: [Select]
timer64.exe v1 -j1 -8p -f in.wav
Global Time  =    60.385

timer64.exe v1 -j2 -8p -f in.wav
Global Time  =    59.729

timer64.exe v1 -j3 -8p -f in.wav
Global Time  =    35.290

timer64.exe v1 -j4 -8p -f in.wav
Global Time  =    29.652

timer64.exe v1 -j5 -8p -f in.wav
Global Time  =    22.567

timer64.exe v1 -j6 -8p -f in.wav
Global Time  =    19.046

timer64.exe v1 -j7 -8p -f in.wav
Global Time  =    19.110

timer64.exe v1 -j8 -8p -f in.wav
Global Time  =    14.793

timer64.exe v1 -j9 -8p -f in.wav
Global Time  =    12.196

timer64.exe v1 -j10 -8p -f in.wav
Global Time  =    10.990

timer64.exe v1 -j11 -8p -f in.wav
Global Time  =     9.952

timer64.exe v1 -j12 -8p -f in.wav
Global Time  =     9.068

timer64.exe v1 -j13 -8p -f in.wav
Global Time  =     8.388

timer64.exe v1 -j14 -8p -f in.wav
Global Time  =     7.899

timer64.exe v1 -j15 -8p -f in.wav
Global Time  =     7.362

timer64.exe v1 -j16 -8p -f in.wav
Global Time  =     7.079

flac-multithreading-v2-win
Code: [Select]
timer64.exe v2 -j1 -8p -f in.wav
Global Time  =    60.608

timer64.exe v2 -j2 -8p -f in.wav
Global Time  =    55.049

timer64.exe v2 -j3 -8p -f in.wav
Global Time  =    35.487

timer64.exe v2 -j4 -8p -f in.wav
Global Time  =    27.484

timer64.exe v2 -j5 -8p -f in.wav
Global Time  =    22.866

timer64.exe v2 -j6 -8p -f in.wav
Global Time  =    18.163

timer64.exe v2 -j7 -8p -f in.wav
Global Time  =    14.201

timer64.exe v2 -j8 -8p -f in.wav
Global Time  =    13.424

timer64.exe v2 -j9 -8p -f in.wav
Global Time  =    12.231

timer64.exe v2 -j10 -8p -f in.wav
Global Time  =    10.996

timer64.exe v2 -j11 -8p -f in.wav
Global Time  =     9.890

timer64.exe v2 -j12 -8p -f in.wav
Global Time  =     9.027

timer64.exe v2 -j13 -8p -f in.wav
Global Time  =     8.594

timer64.exe v2 -j14 -8p -f in.wav
Global Time  =     7.857

timer64.exe v2 -j15 -8p -f in.wav
Global Time  =     7.405

timer64.exe v2 -j16 -8p -f in.wav
Global Time  =     6.589