HydrogenAudio

Lossless Audio Compression => FLAC => Topic started by: ktf on 2023-07-11 16:01:59

Title: More multithreading
Post by: ktf on 2023-07-11 16:01:59
Hi all,

After @cid42 experimented with multithreading in FLAC (https://hydrogenaud.io/index.php/topic,123248.0.html), WavPack introduced multithreading (https://hydrogenaud.io/index.php/topic,124188.0.html) and I found out TAK can already multithread over a single file (https://hydrogenaud.io/index.php/topic,124188.msg1028122.html#msg1028122) it seemed time to get this working in FLAC too.

I have experimented with openMP a few months ago, but that didn't really work. I've now implemented multithreading with pthreads, which means it works on Windows, Mac and Linux, but only with a compiler that has a pthreads implementation, like mingw has winpthreads. See https://github.com/xiph/flac/pull/634

Anyway, there are a few bugs in there still, but these will probably only crop up when using libFLAC directly, not through the command line tool. Still, please be cautious when using the attached binary. Probably best to only use it for testing. Consider it experimental at this stage.

I've also added two graphs, one with wall time and one with CPU time. The wall time one shows you how fast the encoding process goes (which is of course the most interesting bit of data). The graph has 5 lines, one with FLAC 1.4.3, one with the new code but multithreading not enabled (with the option -j1, which is, use 1 thread), one with -j2 (which is use 2 threads), one with -j3 and one with -j4.

My test PC has a cpu with 2 cores and hyperthreading, so 4 threads in theory. As you can see, 4 threads doesn't really add much over 3 threads in my case. The reason using 2 threads does improve much for fast presets and little (or even get slower) for slow presets is because 1 thread does the housekeeping, and all other threads do number crunching. For fast presets this is reasonably balanced (as much housekeeping to do as number crunching) but for the higher presets the housekeeping thread is mostly idling.

The CPU time graph shows 'efficiency' of some sort: it shows total CPU usage over all cores expressed as a percentage. This more or less shows how much overhead multithreading gives.

I hope there are a few people here that would like to give this a go. Results from systems with more cores are highly appreciated :)

P.S.: compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading. For these presets, the number of threads is limited to 2.
Title: Re: More multithreading
Post by: Porcus on 2023-07-11 16:19:10
Interesting wall time figures; -5 faster than -4, is that because of the -m vs the -M?
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)
Title: Re: More multithreading
Post by: ktf on 2023-07-11 16:22:11
Interesting wall time figures; -5 faster than -4, is that because of the -m vs the -M?
Yes, sorry. I said -1 and -3 but that is -1 and -4 of course
Quote
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)
I don't know what you mean?
Title: Re: More multithreading
Post by: Porcus on 2023-07-11 16:38:09
Dang, it wasn't the dual mono ones ... I did the wrong mental repair.

Quote
(Offloading a subframe appears a sensible idea to someone who doesn't know squat about compilers ...)
I don't know what you mean?
Because a "naive" way to allocate tasks between threads would be to let one do Left, one do Right, one do Mid and one do Side - or if you got fewer 2 here and 2 there?
Title: Re: More multithreading
Post by: ktf on 2023-07-11 17:30:15
Because a "naive" way to allocate tasks between threads would be to let one do Left, one do Right, one do Mid and one do Side - or if you got fewer 2 here and 2 there?
My first try was indeed splitting over subframes. The main advantage of that is, that it is completely invisible to the API user. The main disadvantage is the overhead: FLAC is simply too fast. As can be seen from the graphs, the current approach (multithreading over frames) shows a large impact of blocksize. When multithreading over subframes, the amount of number crunching per 'thread-task' is divided by four (assuming stereo input with full stereo decorrelation, which means 4 subframes are tried), which means the overhead increased by more than a factor 4.

It seems there is a certain minimum amount of work that needs to go in a thread-task, otherwise the overhead completely swamps any possible gain. So, if you set a small blocksize, for example 32, multithreading shows massive negative gains.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-11 18:21:28
@ktf

Can I get a Linux binary (or source) to try out?
Title: Re: More multithreading
Post by: ktf on 2023-07-11 18:33:30
Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-11 21:18:05
Binary did work for me, but built a copy from source as well.

Just a quick test with my usual NIN The Fragile album (16/44.1 - 1h 43m) on my Ryzen 5850U

Code: [Select]
./flac -j1 -8p in.wav - 44.137s
./flac -j2 -8p in.wav - 43.312s
./flac -j3 -8p in.wav - 23.812s
./flac -j4 -8p in.wav - 17.291s
./flac -j5 -8p in.wav - 13.835s
./flac -j6 -8p in.wav - 11.868s
./flac -j7 -8p in.wav - 10.579s
./flac -j8 -8p in.wav - 9.676s
./flac -j9 -8p in.wav - 10.357s
./flac -j10 -8p in.wav - 11.655s
./flac -j11 -8p in.wav - 10.751s
./flac -j12 -8p in.wav - 10.061s
./flac -j13 -8p in.wav - 9.499s
./flac -j14 -8p in.wav - 9.007s
./flac -j15 -8p in.wav - 8.620s
./flac -j16 -8p in.wav - 8.227s
Title: Re: More multithreading
Post by: rutra80 on 2023-07-12 00:47:22
15:42 of CDDA on i7-4790K:

-j1 -8ep - 101s
-j2 -8ep - 99s
-j4 -8ep - 34s
-j8 -8ep - 25s
Title: Re: More multithreading
Post by: Wombat on 2023-07-12 01:40:09
This thing really works! CDDA -8p -V, 5900x, 12 cores, 24 threads, no accurate science - only watching the numbers :)
Code: [Select]
j1 103x
j2 106x
j3 203x
j4 298x
j5 381x
j6 460x
j7 543x
j8 620x
j9 685x
j10 705x
j11 725x
j12 740x
j13 750x
j14 752x
j15 750x
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-12 01:59:09
I'm seeing that 2 threads doesn't seem like much improvement in encoding speed over only 1 thread.  3 to (insert FPU core count here) is the best improvement.
Title: Re: More multithreading
Post by: Wombat on 2023-07-12 04:37:03
Compiles fine here also. Own AVX2, GCC 13.1.0 version is 880x for my above using j13.
Title: Re: More multithreading
Post by: ktf on 2023-07-12 07:50:40
Thank you all for confirming this works reasonably well on systems with a higher CPU core count.
Title: Re: More multithreading
Post by: Porcus on 2023-07-12 08:56:03
I'm seeing that 2 threads doesn't seem like much improvement in encoding speed over only 1 thread.
Care to test that with -0? For -8p it makes complete sense with the following:

The reason using 2 threads does improve much for fast presets and little (or even get slower) for slow presets is because 1 thread does the housekeeping, and all other threads do number crunching. For fast presets this is reasonably balanced (as much housekeeping to do as number crunching) but for the higher presets the housekeeping thread is mostly idling.
Title: Re: More multithreading
Post by: 2tec on 2023-07-12 09:35:36
maybe just multithread the queue?
Title: Re: More multithreading
Post by: ktf on 2023-07-12 09:47:02
What queue?
Title: Re: More multithreading
Post by: danadam on 2023-07-12 10:18:26
My test PC has a cpu with 2 cores and hyperthreading, so 4 threads in theory. As you can see, 4 threads doesn't really add much over 3 threads in my case.
Not sure if related but maybe you'll find the comments there interesting, a github issue for zstd: Detect count of logical cores instead of physical (https://github.com/facebook/zstd/issues/2071).
Title: Re: More multithreading
Post by: sundance on 2023-07-12 10:50:47
@ktf: Excellent job!

Tested my set of CDDA-WAVs with "-7 -j[1..12]" on my CPU Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz (6 cores, 12 threads).
My first runs ended up inconsistent (the CPU went hot, the fan ran wild and clock speed was throttled). So I decided to add a 10 second delay between the runs to start each of them with a 45°C-ish CPU (room temp atm is 28°C and rising...).
Code: [Select]
-j1:	Average time =  22.941 seconds (3 rounds), Encoding speed = 471.29x
-j2: Average time =  20.543 seconds (3 rounds), Encoding speed = 526.32x
-j3: Average time =  10.931 seconds (3 rounds), Encoding speed = 989.08x
-j4: Average time =   8.504 seconds (3 rounds), Encoding speed = 1271.40x
-j5: Average time =   7.401 seconds (3 rounds), Encoding speed = 1460.88x
-j6: Average time =   6.924 seconds (3 rounds), Encoding speed = 1561.60x
-j7: Average time =   6.315 seconds (3 rounds), Encoding speed = 1712.02x
-j8: Average time =   6.540 seconds (3 rounds), Encoding speed = 1653.21x
-j9: Average time =   7.226 seconds (3 rounds), Encoding speed = 1496.26x
-j10: Average time =   7.258 seconds (3 rounds), Encoding speed = 1489.73x
-j11: Average time =   6.862 seconds (3 rounds), Encoding speed = 1575.56x
-j12: Average time =   6.544 seconds (3 rounds), Encoding speed = 1652.20x
No advantage going beyond -j7 in my case (which is 1 housekeeping and 6 number crunching treads, if I understood ktf correctly). Which makes kinda sense if you have 6 physical cores...
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-12 18:46:04
Quote from: ktf
compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading.
Could you elaborate? Do you know why that's the case?

Nice work indeed!

Chris
Title: Re: More multithreading
Post by: ktf on 2023-07-12 18:57:35
Hi all,

I've done some more tweaking, hopefully decreasing the time various threads are waiting. Would be great if some people with a CPU with a high core count could benchmark this one vs the previous one.


Quote from: ktf
compression presets -1 and -3 (edit: that is -1 and -4 of course) use "loose mid-side" which doesn't work well with multithreading.
Could you elaborate? Do you know why that's the case?
Loose mid side does the full calculation once every few frames (once every 0.4 s or something) and then uses the result for the next few frames. That a dependency between frames and thus threads. Maybe I'll fix that by implementing a different 'loose mid-side algorithm', perhaps the algorithm that ffmpeg uses.
Title: Re: More multithreading
Post by: sundance on 2023-07-12 20:23:55
Here are my results for the v2 binary.
I ran the v1 binary again, since the ambient temp now is 18 °C (21:20 local time) and the CPU fan didn't run so fast.

ktf_v1 (MD5: 7b2e91271a02ad9ed00666e8a69710fb):
Code: [Select]
-j1:    Average time =  23.591 seconds (3 rounds), Encoding speed = 458.32x
-j2:    Average time =  20.620 seconds (3 rounds), Encoding speed = 524.35x
-j3:    Average time =  10.757 seconds (3 rounds), Encoding speed = 1005.08x
-j4:    Average time =   7.783 seconds (3 rounds), Encoding speed = 1389.18x
-j5:    Average time =   7.038 seconds (3 rounds), Encoding speed = 1536.23x
-j6:    Average time =   6.827 seconds (3 rounds), Encoding speed = 1583.63x
-j7:    Average time =   6.372 seconds (3 rounds), Encoding speed = 1696.89x
-j8:    Average time =   6.763 seconds (3 rounds), Encoding speed = 1598.70x
-j9:    Average time =   7.168 seconds (3 rounds), Encoding speed = 1508.30x
-j10:   Average time =   7.333 seconds (3 rounds), Encoding speed = 1474.36x
-j11:   Average time =   6.644 seconds (3 rounds), Encoding speed = 1627.25x
-j12:   Average time =   6.461 seconds (3 rounds), Encoding speed = 1673.51x

ktf_v2 (MD5: 08125e8c74864eb66cf810da273c7c73):
Code: [Select]
-j1:    Average time =  22.855 seconds (3 rounds), Encoding speed = 473.06x
-j2:    Average time =  20.627 seconds (3 rounds), Encoding speed = 524.18x
-j3:    Average time =  10.813 seconds (3 rounds), Encoding speed = 999.91x
-j4:    Average time =   8.088 seconds (3 rounds), Encoding speed = 1336.85x
-j5:    Average time =   7.196 seconds (3 rounds), Encoding speed = 1502.43x
-j6:    Average time =   7.021 seconds (3 rounds), Encoding speed = 1539.88x
-j7:    Average time =   6.643 seconds (3 rounds), Encoding speed = 1627.58x
-j8:    Average time =   6.673 seconds (3 rounds), Encoding speed = 1620.18x
-j9:    Average time =   7.135 seconds (3 rounds), Encoding speed = 1515.42x
-j10:   Average time =   7.220 seconds (3 rounds), Encoding speed = 1497.44x
-j11:   Average time =   7.102 seconds (3 rounds), Encoding speed = 1522.46x
-j12:   Average time =   6.323 seconds (3 rounds), Encoding speed = 1709.95x

I would not dare to draw a conclusion here, I can't see any significant differences. But maybe your mods don't show at -7.
But my performance peek at -j7 is gone...
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-12 20:41:58
flac git-3e2d9a43 20230712
Same test as yesterday.
Code: [Select]
./flac -j1 -8p in.wav - 44.360s
./flac -j2 -8p in.wav - 41.762s
./flac -j3 -8p in.wav - 23.301s
./flac -j4 -8p in.wav - 16.781s
./flac -j5 -8p in.wav - 13.602s
./flac -j6 -8p in.wav - 11.526s
./flac -j7 -8p in.wav - 10.147s
./flac -j8 -8p in.wav - 9.192s
./flac -j9 -8p in.wav - 10.577s
./flac -j10 -8p in.wav - 11.440s
./flac -j11 -8p in.wav - 10.671s
./flac -j12 -8p in.wav - 10.012s
./flac -j13 -8p in.wav - 9.375s
./flac -j14 -8p in.wav - 8.889s
./flac -j15 -8p in.wav - 8.480s
./flac -j16 -8p in.wav - 8.139s
Title: Re: More multithreading
Post by: rutra80 on 2023-07-12 22:07:09
There are only 6 bytes of difference in header between two binaries?
Speed seems the same.
Might test tomorrow on ancient NUMA system with 2 CPUs x 4 cores x 2 threads = 16 threads
Title: Re: More multithreading
Post by: Wombat on 2023-07-13 02:00:09
Indeed both versions claim git-ea9a6c00 in Windows details. I only benched -j13 for both and the speed is the same.
The one i can compile atm sourced as pthread2.zip claims to be 1.43.
btw. Clang does clearly worse here for me.
Title: Re: More multithreading
Post by: music_1 on 2023-07-13 05:23:40
AMD Ryzen 9 5950X (16 Cores 32 Threads)
Code: [Select]
Codec      :     PCM (WAV)
Duration   :     1:41:59.985
Sample rate:     41000 Hz
Channels   :     2
Bits per sample: 16

flac-multithreading-win
Code: [Select]
timer64.exe v1 -j1 -8p -f in.wav
Global Time  =    60.385

timer64.exe v1 -j2 -8p -f in.wav
Global Time  =    59.729

timer64.exe v1 -j3 -8p -f in.wav
Global Time  =    35.290

timer64.exe v1 -j4 -8p -f in.wav
Global Time  =    29.652

timer64.exe v1 -j5 -8p -f in.wav
Global Time  =    22.567

timer64.exe v1 -j6 -8p -f in.wav
Global Time  =    19.046

timer64.exe v1 -j7 -8p -f in.wav
Global Time  =    19.110

timer64.exe v1 -j8 -8p -f in.wav
Global Time  =    14.793

timer64.exe v1 -j9 -8p -f in.wav
Global Time  =    12.196

timer64.exe v1 -j10 -8p -f in.wav
Global Time  =    10.990

timer64.exe v1 -j11 -8p -f in.wav
Global Time  =     9.952

timer64.exe v1 -j12 -8p -f in.wav
Global Time  =     9.068

timer64.exe v1 -j13 -8p -f in.wav
Global Time  =     8.388

timer64.exe v1 -j14 -8p -f in.wav
Global Time  =     7.899

timer64.exe v1 -j15 -8p -f in.wav
Global Time  =     7.362

timer64.exe v1 -j16 -8p -f in.wav
Global Time  =     7.079

flac-multithreading-v2-win
Code: [Select]
timer64.exe v2 -j1 -8p -f in.wav
Global Time  =    60.608

timer64.exe v2 -j2 -8p -f in.wav
Global Time  =    55.049

timer64.exe v2 -j3 -8p -f in.wav
Global Time  =    35.487

timer64.exe v2 -j4 -8p -f in.wav
Global Time  =    27.484

timer64.exe v2 -j5 -8p -f in.wav
Global Time  =    22.866

timer64.exe v2 -j6 -8p -f in.wav
Global Time  =    18.163

timer64.exe v2 -j7 -8p -f in.wav
Global Time  =    14.201

timer64.exe v2 -j8 -8p -f in.wav
Global Time  =    13.424

timer64.exe v2 -j9 -8p -f in.wav
Global Time  =    12.231

timer64.exe v2 -j10 -8p -f in.wav
Global Time  =    10.996

timer64.exe v2 -j11 -8p -f in.wav
Global Time  =     9.890

timer64.exe v2 -j12 -8p -f in.wav
Global Time  =     9.027

timer64.exe v2 -j13 -8p -f in.wav
Global Time  =     8.594

timer64.exe v2 -j14 -8p -f in.wav
Global Time  =     7.857

timer64.exe v2 -j15 -8p -f in.wav
Global Time  =     7.405

timer64.exe v2 -j16 -8p -f in.wav
Global Time  =     6.589
Title: Re: More multithreading
Post by: ktf on 2023-07-13 07:51:42
Sorry, I indeed made a mistake. The v2 binary is functionally identical to the first one, I think I did a copy-paste the wrong way. I'll get back with a new binary.

edit: Here's a new one. Sorry for wasting your time with the previous one.
Title: Re: More multithreading
Post by: Case on 2023-07-13 09:04:09
I hope you don't let this experiment slow down regular single threaded encoding. At least my use case is encoding several files at a time and doing multiple files in separate threads is faster than spreading single file encoding over multiple threads.
Title: Re: More multithreading
Post by: ktf on 2023-07-13 10:14:17
The goal of course is to not let this affect single-threading. As you can see from the graphs attached to the first post, that goal has been achieved: FLAC 1.4.3 and this new binary with -j1 perform exactly the same. Of course, there are plenty of environments without POSIX threads (pthreads) so single threading performance remains very important.

Indeed, doing multiple files at once is faster than multithreading within a single file, but the latter is more transparent to the user and to me it seemed easier to properly implement in the flac command line tool. Also, multithreading over files is possible with tools like GNU parallel, so this approach is complementary to that.
Title: Re: More multithreading
Post by: sundance on 2023-07-13 10:44:58
Results for v3 binary:
Code: [Select]
-j1:    Average time =  22.844 seconds (3 rounds), Encoding speed = 473.30x
-j2:    Average time =  18.255 seconds (3 rounds), Encoding speed = 592.27x
-j3:    Average time =   9.570 seconds (3 rounds), Encoding speed = 1129.82x
-j4:    Average time =   6.603 seconds (3 rounds), Encoding speed = 1637.35x
-j5:    Average time =   6.646 seconds (3 rounds), Encoding speed = 1626.76x
-j6:    Average time =   7.094 seconds (3 rounds), Encoding speed = 1524.18x
-j7:    Average time =   6.446 seconds (3 rounds), Encoding speed = 1677.41x
-j8:    Average time =   6.539 seconds (3 rounds), Encoding speed = 1653.46x
-j9:    Average time =   7.046 seconds (3 rounds), Encoding speed = 1534.42x
-j10:   Average time =   7.123 seconds (3 rounds), Encoding speed = 1517.90x
-j11:   Average time =   6.800 seconds (3 rounds), Encoding speed = 1589.92x
-j12:   Average time =   6.286 seconds (3 rounds), Encoding speed = 1719.92x
Scales almost perfectly at the beginning (1 nc thread = 18.3 sec, 2 nc threads = 9.6 sec, 3 nc threads = 6.6 sec), but after that nothing/little is gained. Does the thread management take all the extra time the additional cores could provide?
Title: Re: More multithreading
Post by: ktf on 2023-07-13 11:54:52
Does the thread management take all the extra time the additional cores could provide?
There's no thread management really, it just dispatches as much work as it can. Maybe the problem is the housekeeping thread can't keep up. Could you try what happens if you run with the undocumented option --no-md5-sum to see if scaling continues for thread 4, 5, 6 etc.? If that is the case, then maybe MD5 needs to run in its own thread.
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-13 12:09:37
Very promising results on v3 by sundance, I'd say. But Case raises an important point. I'm using single-threaded flac.exes to convert multiple files in parallel in foobar2000. If multithreaded encoding is to become the default in the FLAC executable one day, we should inform Peter et al. to change the predefined FLAC conversion dialog in foobar2000 to disable multithreading when two or more files are being converted to FLAC simultaneously. And since, IIRC, the -j switch didn't exist in previous versions, I fear that the multithreaded-by-default flac.exe would break compatibility with older versions in foobar?

Quote from: ktf
Loose mid side does the full calculation once every few frames (once every 0.4 s or something) and then uses the result for the next few frames. That a dependency between frames and thus threads. Maybe I'll fix that by implementing a different 'loose mid-side algorithm', perhaps the algorithm that ffmpeg uses.

Sounds like a great idea. I've got some time this week, let me know if you can use some assistance in trying out an alternative approach.

By the way, I noticed that FLAC preset -6 deviates a bit from the convex speed-performance hull in your plots. It seems that, by using -r 5 instead of -r 6 in preset 6, one can shift that operating point leftward (i.e., towards faster) along the speed axis, with almost zero degradation of the compression ratio (at least in my experiments). Attached a painted-in estimate of how that would change your plot. Comments (by anyone, that is) appreciated.

Chris
Title: Re: More multithreading
Post by: ktf on 2023-07-13 12:36:26
If multithreaded encoding is to become the default in the FLAC executable one day
It isn't. FLAC/libFLAC is not only being used on desktops. There is a wide range of hardware this runs on (embedded devices and microcontrollers for example), and the intention is to keep it that way.

Sounds like a great idea. I've got some time this week, let me know if you can use some assistance in trying out an alternative approach.
Anyone who wants to contribute code is welcome to do so. A patch through the mailing list or a PR at Github are preferred.

Quote
By the way, I noticed that FLAC preset -6 deviates a bit from the convex speed-performance hull in your plots.
While I did propose a retune at this forum at some point, I'm not sure anymore whether striving for a convex hull is worth changing settings. As you said, people rely on defaults and certain settings giving a certain results. Changing -r 6 to -r 5 probably won't hurt much, but the result is pretty much 'cosmetic'. Also, it could very well be this graph looks different on a different CPU, or even a different architecture. Maybe changing settings so the graph approaches the ideal on my CPU makes it less ideal on another CPU. The heavy-hitters in x86 code are subtly different from ARM64.
Title: Re: More multithreading
Post by: Wombat on 2023-07-13 12:45:06
Again CDDA -8p -V, 5900x, 12 cores, 24 threads
Code: [Select]
v1 vs v3
j1 103x  104x
j2 106x  115x
j3 203x  225x
j4 298x  326x
j5 381x  426x
j6 460x  521x
j7 543x  615x
j8 620x  670x
j9 685x  710x
j10 705x  625x
j11 725x  670x
j12 740x  680x
j13 750x  675x
j14 752x  670x
j15 750x  650x
I triple checked the dip at j10. The first version scaled more even here.
It is very nice to see ~150x speed for -8ep in CUETools :)
Title: Re: More multithreading
Post by: sundance on 2023-07-13 14:09:46
@ktf: v3 binary with --no-md5-sum:
Code: [Select]
-j1:    Average time =  20.276 seconds (3 rounds), Encoding speed = 533.25x
-j2:    Average time =  18.350 seconds (3 rounds), Encoding speed = 589.20x
-j3:    Average time =   9.644 seconds (3 rounds), Encoding speed = 1121.11x
-j4:    Average time =   6.803 seconds (3 rounds), Encoding speed = 1589.22x
-j5:    Average time =   5.412 seconds (3 rounds), Encoding speed = 1997.66x
-j6:    Average time =   4.863 seconds (3 rounds), Encoding speed = 2223.47x
-j7:    Average time =   6.105 seconds (3 rounds), Encoding speed = 1771.10x
-j8:    Average time =   4.902 seconds (3 rounds), Encoding speed = 2205.78x
-j9:    Average time =   4.737 seconds (3 rounds), Encoding speed = 2282.62x
-j10:   Average time =   4.898 seconds (3 rounds), Encoding speed = 2207.28x
-j11:   Average time =   4.925 seconds (3 rounds), Encoding speed = 2195.18x
-j12:   Average time =   4.860 seconds (3 rounds), Encoding speed = 2224.69x
Another thing that came to mind to explain the performance plateau here: Since I am reading ~2GB of WAV and write 1.1GB of FLAC to an SSD drive (Samsung Evo 860 @ SATA III) in each encoding session, a considerable amount of time might be needed for that. I don't think that this SSD setup is faster than some 600 MB/sec.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-13 15:18:21
Does the thread management take all the extra time the additional cores could provide?
There's no thread management really, it just dispatches as much work as it can. Maybe the problem is the housekeeping thread can't keep up. Could you try what happens if you run with the undocumented option --no-md5-sum to see if scaling continues for thread 4, 5, 6 etc.? If that is the case, then maybe MD5 needs to run in its own thread.

So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping?  When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU.  If I use -j8, I have 7 threads at 100%, and one thread at 35%.
Title: Re: More multithreading
Post by: ktf on 2023-07-13 15:40:55
So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping?  When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU.  If I use -j8, I have 7 threads at 100%, and one thread at 35%.

Yes, that is correct. The thing is, you are running with setting -8p, which means each thread has lots to crunch and there is relatively little to do for the first thread (MD5 checksumming, preparing data etc.) sundance is running setting -7, which is much faster, which means the 'housekeeping thread' has much more to do, and scaling stops earlier. When running preset -0, I guess scaling already stops at 2 threads.

To fix this, MD5 calculation would needs its own thread, but when to 'add' that thread depends on how much number crunching needs to be done. For a fast preset like -0 through -5, the 3rd thread should probably already be dedicated to MD5. For presets -6 and -7 that would the 4th thread, for -8 the 5th thread and for settings like -8p or -8e that would be something like the 16th thread.

I'm not sure whether there is a better way to fix this imbalance really.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-13 16:20:17
So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping?  When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU.  If I use -j8, I have 7 threads at 100%, and one thread at 35%.

Yes, that is correct. The thing is, you are running with setting -8p, which means each thread has lots to crunch and there is relatively little to do for the first thread (MD5 checksumming, preparing data etc.) sundance is running setting -7, which is much faster, which means the 'housekeeping thread' has much more to do, and scaling stops earlier. When running preset -0, I guess scaling already stops at 2 threads.

To fix this, MD5 calculation would needs its own thread, but when to 'add' that thread depends on how much number crunching needs to be done. For a fast preset like -0 through -5, the 3rd thread should probably already be dedicated to MD5. For presets -6 and -7 that would the 4th thread, for -8 the 5th thread and for settings like -8p or -8e that would be something like the 16th thread.

I'm not sure whether there is a better way to fix this imbalance really.

Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?
Title: Re: More multithreading
Post by: music_1 on 2023-07-13 17:49:17
flac-multithreading-v3-win
Code: [Select]
timer64.exe v3 -j1 -8p -f in.wav
Global Time  =    55.756

timer64.exe v3 -j2 -8p -f in.wav
Global Time  =    53.016

timer64.exe v3 -j3 -8p -f in.wav
Global Time  =    34.281

timer64.exe v3 -j4 -8p -f in.wav
Global Time  =    31.115

timer64.exe v3 -j5 -8p -f in.wav
Global Time  =    23.207

timer64.exe v3 -j6 -8p -f in.wav
Global Time  =    18.717

timer64.exe v3 -j7 -8p -f in.wav
Global Time  =    15.722

timer64.exe v3 -j8 -8p -f in.wav
Global Time  =    13.413

timer64.exe v3 -j9 -8p -f in.wav
Global Time  =    12.010

timer64.exe v3 -j10 -8p -f in.wav
Global Time  =    10.612

timer64.exe v3 -j11 -8p -f in.wav
Global Time  =     9.801

timer64.exe v3 -j12 -8p -f in.wav
Global Time  =     8.832

timer64.exe v3 -j13 -8p -f in.wav
Global Time  =     8.255

timer64.exe v3 -j14 -8p -f in.wav
Global Time  =     7.622

timer64.exe v3 -j15 -8p -f in.wav
Global Time  =     7.135

timer64.exe v3 -j16 -8p -f in.wav
Global Time  =     6.927
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-13 21:34:56
Quote from: Replica9000 link=msg=1030063
Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?
It must be calculated frame-by-frame, otherwise one would have to store the entire audio input in memory (since no disk might be accessible during encoding), which would make FLAC's RAM consumption unbound.

I'm not an expert in multithreading implementations, but couldn't the MD5 calculation (and bitstream writing, if not already done so) be moved into the housekeeping/management thread, at least for presets where that thread is mostly idle?

... the intention is to keep it that way (single-threaded)

... Changing -r 6 to -r 5 probably won't hurt much, but the result is pretty much 'cosmetic'. Also, it could very well be this graph looks different on a different CPU, or even a different architecture.
The single-threaded and cosmetic aspects make sense, but I doubt the overall shape of the curve will look much different above preset 2 on different platforms/CPUs. The numbers make perfect sense and are well described by O(n) complexity estimation. The main contributors to encoding runtime are the max. LPC order and number of apodizations tried, on any platform, and I didn't change that part of the configuration.

Chris
Title: Re: More multithreading
Post by: Porcus on 2023-07-13 22:36:41
Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?
MD5 works in chunks.
Precisely when in the process reference FLAC does that calculation I don't know, but as MD5 is calculated from the uncompressed PCM input, it could in principle be "at any time". Most likely when the chunk is loaded into memory.
The verify option will decode the FLAC bitstream to PCM, which is then MD5'ed.


Edited.
As for some other topics that came up here:

Default: I agree that multithreading should not be a default. But, it will be harder for a novice user to have to give the appropriate options - indeed, I guess that those who are barely used to .exe files and not so much to command-lines, would want to drag and drop.
In that case, one should maybe just make a flac-multithread.exe that defaults to a multi-threading option ...?
WavPack has a way to rename the executable to invoke options: https://hydrogenaud.io/index.php/topic,122626 . Yeah David credits me for the idea, but it was because there was already since long such a way to invoke debugging. FLAC and WavPack don't have the same history ...

-6 and the convex hull:
I have fallen prey to "eyeballing" the chart myself, not thinking over that time is on a log scale. As far as convexity is a concern, it should be on an un-logged time scale: If I am willing to double the running time from 1 minute to 2 minutes to save B bytes, then nothing says I am willing to wait for 14 more minutes (an octupling of the 2) to save another 3*B bytes.

Some considerations I made on -6: https://hydrogenaud.io/index.php/topic,123025.msg1016398.html#msg1016398
Point is, it is "as heavy as predictor order 8 goes".
I did test -6r5 vs -6r6 though, and the -r made very little size impact.

Title: Re: More multithreading
Post by: Porcus on 2023-07-14 00:00:22
-j2 was often bad on this CPU (i5-1135G7, four cores and eight threads) with the first build. Others have posted results where it doesn't make much of a difference, but here it often outright slows it down. The limited results I have with version 1 vs version 3 indicates that the latter is an improvement.

I did a few runs, also let it cool off to "ensure" that -j2 isn't too much affected by some throttling induced by running -j1 right before. Will do more, but reporting -0 figures here.
Table is a bit cryptic: For each -0 -j<N> I did
* pause for 2 minutes to allow the CPU to cool down
* ran the first build three consecutive encodes of the 38 CDs in my signature.
* new pause for 2 minutes
* three consecutive encodes with version 3 of the exe
Then advance the "j".

Numbers quoted are the number of seconds on the "from cool", and then under the "next": how much more the next two runs took, on a presumably hotter CPU. "more" ... with one exception.

-0-j1next-j2next-j3next-j4next-j5next-j6next-j7next-j8next-j9next
v1124+3,+6141+1,+9105+7,+7102+9,+9100+15,+1096+16,+19113−2,+5104+5,+18104+9,+16
v3120+5,+7109+1,+1105+10,+9100+12,+9101+6,+10103+6,+10103+8,+11108+18,+9102+9,+5
So high thread count was kinda useless with -0. -j3 ... hard to tell from this alone that the "success" of -j3 is merely what happened to -j2.

I also tried -0b4096 --no-md5-sum, and here the "next" on j1 were negative, meaning they took shorter time than the one that had two minutes cooldown first - it might have been that it hadn't "idled" whatever it was doing when I started the .bat and left the computer:
.-j1next-j2next-j3next-j4next-j5next-j6next-j7next-j8next-j9next
v191−8,−490+8,+254+7,+1263+9,+158+12,+1760+7,+1159+10,+1459+8,+1459+8,+16
v393−9,−668+7,+751+9,+1354+15,+1458+9,+1858+12,+959+10,+1761+13,+1360+8,+12
Not as strikingly bad -j2, but whatever happened to it, it is much better in version 3. Now the evidence that -j3 is the sweet spot (for the fast fixed-predictor setting!) is slightly clearer.


I'm putting on a -0b4096 (with MD5) as well as more common settings for an overnight or over-week-end job.
Title: Re: More multithreading
Post by: rutra80 on 2023-07-14 00:12:45
15:42 of CDDA on i7-4790K:

-j1 -8ep - 101s
-j2 -8ep - 99s
-j4 -8ep - 34s
-j8 -8ep - 25s

V3:
-j1 -8ep - 103s
-j2 -8ep - 118s
-j4 -8ep - 37s
-j8 -8ep - 26s
:(
Title: Re: More multithreading
Post by: Porcus on 2023-07-14 00:20:35
Oh my, the jury is sent out again on what makes -j2 worse.
Title: Re: More multithreading
Post by: Wombat on 2023-07-14 01:17:01
Oh my, the jury is sent out again on what makes -j2 worse.
-j2 is not worse with Repllica9000, music_1 and my Ryzens.
It may be even down to some choice of a modern compiler why older intels do a bit uneven.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-14 03:00:08
1h 43m 16/44.1 file, Ryzen 5850U.
flac git-3e2d9a43 20230712
Code: [Select]
      -0      -1      -2      -3      -4      -5      -6      -7       -8
 j1:  3.717s  3.936s  4.175s  4.318s  4.947s  5.872s  8.217s  10.183s  15.206s
 j2:  2.404s  2.395s  2.351s  2.262s  2.879s  3.822s  6.057s   8.070s  13.112s
 j3:  2.525s  2.415s  2.511s  2.270s  2.884s  2.397s  3.407s   4.500s   7.349s
 j4:  2.529s  2.443s  2.564s  2.318s  2.904s  2.529s  2.754s   3.370s   5.439s
 j5:  2.558s  2.385s  2.588s  2.420s  2.944s  2.560s  2.795s   2.933s   4.440s
 j6:  2.604s  2.407s  2.660s  2.393s  3.000s  2.579s  2.797s   2.960s   3.853s
 j7:  2.631s  2.416s  2.640s  2.380s  2.991s  2.558s  2.823s   2.971s   3.438s
 j8:  2.612s  2.433s  2.659s  2.444s  3.043s  2.602s  2.838s   2.967s   3.603s
 j9:  2.684s  2.441s  2.637s  2.385s  3.026s  2.540s  2.874s   3.003s   3.850s
j10:  2.613s  2.425s  2.678s  2.425s  3.019s  2.551s  2.864s   2.977s   3.633s
j11:  2.681s  2.439s  2.753s  2.490s  2.993s  2.537s  2.824s   2.976s   3.566s
j12:  2.691s  2.401s  2.692s  2.420s  3.011s  2.571s  2.805s   3.011s   3.492s
j13:  2.631s  2.440s  2.627s  2.462s  3.009s  2.565s  2.817s   3.000s   3.556s
j14:  2.646s  2.448s  2.648s  2.429s  3.043s  2.595s  2.818s   2.957s   3.576s
j15:  2.672s  2.457s  2.762s  2.419s  3.003s  2.577s  2.848s   2.954s   3.518s
j16:  2.623s  2.473s  2.657s  2.475s  3.024s  2.573s  2.917s   2.953s   3.692s

Code: [Select]
      -0p     -1p     -2p     -3p     -4p     -5p     -6p      -7p      -8p
 j1:  3.806s  3.987s  4.224s  5.433s  6.403s  8.354s  16.345s  19.972s  44.046s
 j2:  2.516s  2.466s  2.434s  3.317s  4.293s  6.288s  14.358s  17.957s  42.445s
 j3:  2.586s  2.474s  2.645s  2.498s  4.430s  3.694s   8.293s  10.356s  23.904s
 j4:  2.615s  2.462s  2.732s  2.732s  4.492s  2.954s   6.167s   7.687s  17.831s
 j5:  2.705s  2.491s  2.812s  2.633s  4.470s  2.987s   5.050s   6.239s  14.655s
 j6:  2.712s  2.676s  2.765s  2.674s  4.478s  2.989s   4.385s   5.452s  12.682s
 j7:  2.721s  2.679s  2.771s  2.625s  4.444s  2.974s   3.967s   4.989s  11.348s
 j8:  2.745s  2.563s  2.816s  2.623s  4.404s  3.007s   3.859s   4.485s  10.559s
 j9:  2.736s  2.622s  2.754s  2.633s  4.431s  3.007s   4.558s   5.438s  11.801s
j10:  2.721s  2.581s  2.755s  2.638s  4.415s  2.991s   4.307s   5.215s  12.191s
j11:  2.756s  2.641s  2.824s  2.623s  4.415s  3.025s   4.043s   4.908s  11.589s
j12:  2.769s  2.818s  2.802s  2.628s  4.454s  3.027s   3.968s   4.663s  10.990s
j13:  2.797s  2.669s  2.841s  2.645s  4.450s  2.990s   4.084s   4.509s  10.508s
j14:  2.776s  2.575s  2.781s  2.601s  4.465s  3.018s   4.084s   4.441s  10.017s
j15:  2.738s  2.566s  2.889s  2.598s  4.482s  3.003s   4.148s   4.507s   9.646s
j16:  2.800s  2.569s  2.822s  2.623s  4.443s  3.046s   4.138s   4.515s   9.299s
Title: Re: More multithreading
Post by: ktf on 2023-07-14 07:31:12
Thank you all for the results. I do have a few ideas on what can be changed to improve performance further. Might take a while though.

As many are asking for specifics, I'll try to outline the process. The flac command line tool isn't changed much. It accepts the new option and parses it, then passes it to libFLAC. Nothing else is changed. The real magic happens in libFLAC.

libFLAC accepts chunks of PCM data through the FLAC__stream_encoder_process function call. When single threading, this function directly processes the data. As soon as it has got enough samples to fill a single frame, it will process those samples into a frame and write that frame. This involves adding data to the verify queue (if applicable), calculating the MD5 sum, creating a FLAC frame and writing it.

When multithreading, the FLAC__stream_encoder_process call does the adding to the verify queue and the MD5 sum calculating, but then copies the data to a separate data structure and sends a signal to a thread to pick it up. It also checks whether the 'oldest' bit of data has finished processing so it can be written. Sometimes one thread runs faster than the other (because one is interrupted by the OS for example) but we must make sure the oldest thread writes it data first, otherwise the audio data is no longer in the right order. When there is nothing left to be done, the FLAC__stream_encoder_process returns to the client process.

So, there is one thread (the main thread, which I've called housekeeping thread before) that does the dispatching, MD5 calculation and writing the finished frames, and a bunch of thread that do the converting of PCM samples to FLAC frames. The main problem is that these are almost never balanced: for very fast presets like -0, the MD5 sum calculation takes as much time as converting PCM samples to FLAC frames, so there is no use for more than 1 extra thread. However, for presets like -8p, the main thread has pretty much nothing to do, so when invoked with a low thread number, one thread is idling all the time.

The only way to fix this problem is to no longer specialise thread too much. I don't want to "cheat" by adding an extra thread when the first one has nothing to do nor add an extra thread for MD5 which may or may not be necessary: the number of thread the user asks for must be the number of threads that is actually spawned.

So my idea is to create two work queues: the main thread adds work to an MD5 queue (which must be picked up by one particular thread, because MD5 calculation cannot happen in parallel) and to a frame queue (which can be picked up by any thread in parallel). That means the main thread has even less to do than it has now, so as soon as the queue is full, it can start working on a frame by itself. As soon as it is finished with that frame, it will go back to managing the other threads. I'll also make sure one thread can leapfrog another, because that is currently not possible. That might improve performance running on CPUs with both performance and efficiency cores like the newest Intel CPUs and a lot of ARM CPUs.
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-14 09:57:24
Quote from: Porcus link=msg=1030091
Some considerations I made on -6: https://hydrogenaud.io/index.php/topic,123025.msg1016398.html#msg1016398
Point is, it is "as heavy as predictor order 8 goes".
I did test -6r5 vs -6r6 though, and the -r made very little size impact.
Thanks, Porcus, for pointing me to that study of yours. Quoting you: "Why is the difference to -5 small and the difference to -7 large?  It is not the -r5 to -r6. In the 38 CD corpus in my signature, -5 -r6 improved 0.0044 percent." That is much too little improvement for quite a few percent encoder slowdown, if you ask me.

Thanks for the explanation, ktf. Your plan sounds worth trying.

Chris
Title: Re: More multithreading
Post by: Porcus on 2023-07-14 13:30:51
Some more assorted comments:


@ktf on the plan forward:
* Although you want a "-j that works no matter settings", is that really an imperative? If there is very little to gain from multi-threading, then say "-j 4 will consider using 4 threads; it may use less if it doesn't think it is worth it"?
I'd say that if there is nothing gained in splitting the housekeeping task, then don't do it.
Maybe - like how "-M" tries to be smart and "-m" does it brute-force - there could be a -J4 for "allow up to 4 threads, the encoder decides if it is worth it", while -j4 uses 4 threads (likely not useful for settings below xx, but if user really wants ...)
* Also, if user goes outside presets, then they cannot expect options to be doing a good job. If I try -8 -b4095 and am surprised that the result is so much worse than plain -8 - I have done precisely that - then that is up to me to learn why it wasn't much efficient.
But this makes for a case to tune presets so that they are more multi-threading friendly, at least if that doesn't hurt single-threading much. Say, block size for -0, -1, -2. And maybe also retune the scope of -M, so that it fits multi-threading.


@Wombat on "old" intels: The CPU in question was launched 2020 Q3, that isn't ... old. Maybe there is something weird about it, but it isn't that it is lacking the last three generations of instruction sets.
And that is why the bad result on -j2 surprises me, as nobody else has posted anything that bad.


That is much too little improvement for quite a few percent encoder slowdown, if you ask me.
But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.
YMMV on material and CPU, but I tried one 1.3 GB compilation (same as used here, same computer too) (https://hydrogenaud.io/index.php/topic,124188.msg1029916.html#msg1029916)
-5 is a 30 second job. The difference between -5r5 and -5r6 was half a second. The difference between -5r1 and -5r5 was half a second too. But going -5r7 cost a few seconds.
-6 is a 40 second job. So it isn't the -r (up to 6).
Title: Re: More multithreading
Post by: Wombat on 2023-07-14 14:51:32
@Wombat on "old" intels: The CPU in question was launched 2020 Q3, that isn't ... old. Maybe there is something weird about it, but it isn't that it is lacking the last three generations of instruction sets.
And that is why the bad result on -j2 surprises me, as nobody else has posted anything that bad.
Somehow i was thinking about a i5-7500T you used in a different test. Your newer one even has AVX 512 support.
Sundance and his older 8700 has also a faster j2.
Title: Re: More multithreading
Post by: Porcus on 2023-07-14 15:19:58
The difference between -5r5 and -5r6 was half a second. The difference between -5r1 and -5r5 was half a second too. But going -5r7 cost a few seconds.
Hm well, not sure about the latter, after a couple of re-runs. Maybe r7 is cheap too.

@ktf : Is it so that if fine partitioning is not needed (so that size impact of -r<high> is small) then time impact is by and large small as well? I think you once explained that -r 8 does indeed partition in 2^8 whether or not that helps, indicating that the "time cost" is sunk before one knows whether it was any use of it.

Anyway if someone feels like testing it: https://hydrogenaud.io/index.php/topic,123025.msg1030124.html#msg1030124


@Wombat : Yeah, and I also use an i5-6300U, launched 2015. Used in a WavPack multithreading test. (https://hydrogenaud.io/index.php/topic,124188.msg1029916.html#msg1029916)
Title: Re: More multithreading
Post by: Porcus on 2023-07-14 15:36:51
Tried -0b4096, -5, -7 and -8. Confirming that v3 improves -j2. And the changes are so much that I won't bother to do any more comparisons between the two.

Again, cooldown and three consecutive runs.

j:-j1next-j2next-j3next-j4next-j5next-j6next-j7next-j8next-j9next
0b4096, v1118+0,+2101+7,+489+4,+788+1,+1087+1,+786+4,+1486+2,+1187+3,+687+5,+6
0b4096, v3118+1,+390+5,+888+5,+887+8,+1086+9,+885+7,+888+2,+487+2,+786+5,+6
-5, v1169+4,+3164+0,-2104+3,+793+5,+896+2,+693+5,+1198+2,+593+8,+1194+6,+6
-5, v3172+3,-1143+4,+0101-0,+392+5,+993+4,+1093+7,+1292+8,+1092+7,+1092+8,+10
-7, v1272+1,+3275+1,+0166+0,+1136+7,+13126+7,+8117+1,+6111+9,+13111+1,+8109+8,+29
-7, v3272+8,+2245-0,−2156+7,+9135+9,+12119+7,+9106+1,+12108+1,+11110+8,+12108+7,+12
-8, v1(*)404(*)+something486+2,+9250+7,+6213+9,+12197+6,+8174+2,+19164+6,+10141+1,+15143+6,+8
-8, v3423+1,+1391+2,+6255+4,+3222+1,+5188+2,+37184-0,+−5149+1,+18139+9,+12140+1,+13
(*) Unreliable "404", it was a re-run on maybe an even colder CPU, because at first I got a nonsense results where it took like 437 seconds and then less on the two immediately following runs (on a heated CPU). Something must have kept the CPU busy during those 437.
Since it had more time to cool down when I redid it, the 404 might be reading a bit low. Since the suspiciously high -8j1 still was ten percent faster than the fastest -8j2, it does anyway confirm that j2 was much slower in the version 1 exe.
Title: Re: More multithreading
Post by: ktf on 2023-07-14 15:38:36
* Although you want a "-j that works no matter settings", is that really an imperative?
That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j. I could say: the user asked for four threads and one is mostly idling, so I'll spawn a fifth to compensate for the idling, but I don't want that.

Quote
If there is very little to gain from multi-threading, then say "-j 4 will consider using 4 threads; it may use less if it doesn't think it is worth it"?
The goal is to try to make this scale as well as it possibly can, without touching single-threaded behaviour.

Quote
I'd say that if there is nothing gained in splitting the housekeeping task, then don't do it.
There is potentially a lot of gain possible.

But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.
I agree, it is probably the two extra apodizations that makes this so much slower.

@ktf : Is it so that if fine partitioning is not needed (so that size impact of -r<high> is small) then time impact is by and large small as well? I think you once explained that -r 8 does indeed partition in 2^8 whether or not that helps, indicating that the "time cost" is sunk before one knows whether it was any use of it.
Depends. If an incompatible blocksize is chosen, max -r is capped anyway. With a compatible blocksize, the largest part of the time impact is looking for the optimal ordering, so yes, that time is spent anyway, independent of whether it is used. However, the bitwriter is a little slower with more partitions. I can't remember whether that is at all measurable without instrumentation.
Title: Re: More multithreading
Post by: bennetng on 2023-07-14 16:11:50
i3-12100, 16GB RAM, NVMe SSD (~2.7GB/s write, ~3.3GB/s read), recompress a CDDA flac image to a new file, using PowerShell measure-command totalseconds.

v3 -8
wrote 460350140 bytes
j1 13.889325
j2 10.5771965
j3 5.3922851
j4 3.9220238
j5 4.0600122
j6 4.0264503
j7 4.1554986
j8 4.0284002

v3 -8p
wrote 460143727 bytes
j1 41.0064016
j2 37.8299355
j3 18.7853751
j4 13.4384533
j5 13.5461772
I think there is no need to test up to j8.

v2 -8
wrote 460350140 bytes
j1 14.3544608
j2 10.7919867
j3 5.6689622
j4 3.9546462
j5 4.0161195

v2 -8p
wrote 460143727 bytes
j1 41.0112598
j2 37.7006967
j3 18.9425906
j4 12.8517866
j5 10.4393538
j6 10.5004978
Oh, v2 is better for me with -8p.
Title: Re: More multithreading
Post by: bennetng on 2023-07-14 19:03:02
Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.
Code: [Select]
$ ./flacv1
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./flacv1)
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./flacv1)
Code: [Select]
$ sudo apt-get install libc6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libc6 is already the newest version (2.31-13+deb11u6).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
What should I do? Thanks.

Title: Re: More multithreading
Post by: Replica9000 on 2023-07-14 19:18:37
Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.
Code: [Select]
$ ./flacv1
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./flacv1)
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./flacv1)
Code: [Select]
$ sudo apt-get install libc6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libc6 is already the newest version (2.31-13+deb11u6).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
What should I do? Thanks.



Update your OS.  Are you running Debian old stable?
Title: Re: More multithreading
Post by: bennetng on 2023-07-14 19:34:26
Update your OS.  Are you running Debian old stable?
It is already the latest one I can download and use, does it mean I must use a different distribution?
https://mxlinux.org/download-links/
MX-21.3_x64 “ahs”, an “Advanced Hardware Support” release for very recent hardware, with 6.0 kernel and newer graphics drivers and firmware. 64 bit only. Works for all users, but especially if you use AMD Ryzen, AMD Radeon RX graphics, or 9th/10th/11th generation Intel hardware.
X
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-14 20:29:45
Update your OS.  Are you running Debian old stable?
It is already the latest one I can download and use, does it mean I must use a different distribution?
https://mxlinux.org/download-links/
MX-21.3_x64 “ahs”, an “Advanced Hardware Support” release for very recent hardware, with 6.0 kernel and newer graphics drivers and firmware. 64 bit only. Works for all users, but especially if you use AMD Ryzen, AMD Radeon RX graphics, or 9th/10th/11th generation Intel hardware.
[attach type=image]26523[/attach]

I'm not familiar with MX, but it appears to be based on Debian stable.  Debian 11 is now old stable.  You might need to update your repositories.  Current Debian stable had libc 2.36.
Title: Re: More multithreading
Post by: bennetng on 2023-07-14 20:51:05
OK, thanks, and I just found the page below, so looks like there is no need to do everything from scratch.
https://mxlinux.org/wiki/system/upgrading-from-mx-21-to-mx-23-without-reinstalling/
Title: Re: More multithreading
Post by: bennetng on 2023-07-15 08:04:32
i3-12100, 16GB RAM, NVMe SSD (~2.7GB/s write, ~3.3GB/s read), recompress a CDDA flac image to a new file, using PowerShell measure-command totalseconds.
Windows v1
-8
wrote 460350140 bytes
j1 13.7896849
j2 10.8577313
j3 5.6010285
j4 4.0480883
j5 4.0507501

-8p
wrote 460143727 bytes
j1 40.9393991
j2 37.7043669
j3 19.8624566
j4 12.3921615
j5 10.8594597
j6 10.7756123

Linux v1, using "time" command showing "real"
-8
wrote 460350151 bytes
j1  12.924s
j2  10.143s
j3  5.227s
j4  4.136s
j5  4.113s

-8p
wrote 460143729 bytes
j1 36.962s
j2 35.696s
j3 18.382s
j4 13.629s
j5 11.572s
j6 12.001s

So yes, v3 is worse than v2 and v1 in -8p.
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-15 11:56:41
But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.
I agree, it is probably the two extra apodizations that makes this so much slower.
Hmm, on my (ancient, agreed) mobile Intel i7 M620 with two cores and HyperThreading and 3 WAVs encoded in parallel in foobar, I saw about 5% speedup, in several tries, when going from -r 6 to -r 5 with FLAC 1.4.3. But alright then, apparently less than that on other systems, and like I wrote earlier, the main contributors to encoding runtime are the max. LPC order and number of apodizations tried, on any platform. For the record, using LPC order 10 instead of 8 (indeed the obvious approach) made preset 6 exactly as slow as preset 7 with order 12 on my laptop. I assume that's because max. order 12 is heavily code-optimized and max. order 10 isn't? Does anyone know?

Btw, back to the topic subject: In doubt, I very much prefer aiming for maximum possible multithreading efficiency with few (like 4 or so) threads than with more than 8 threads. With a dozen CPU cores or more available, one should probably use file parallel encoding, anyway. In the video coding project that I'm currently contributing to (VVenC (https://github.com/fraunhoferhhi/vvenc), in case anyone's interested), we noticed that the multithreading performance doesn't scale too well at high thread and core counts on some CPUs. Since, on other CPUs, it does, the CPU architecture itself might be the reason.

Chris
Title: Re: More multithreading
Post by: sundance on 2023-07-15 17:04:15
Quote
I hope you don't let this experiment slow down regular single threaded encoding. At least my use case is encoding several files at a time and doing multiple files in separate threads is faster than spreading single file encoding over multiple threads.
Just my 2 cents: Mulitple file threading (e.g. in foobar) is faster here too, but I would definitely vote "YEA" for ktf's efforts in single file multithreading, especially if single file performance is not affected. I guess I'm not the only one who uses "flac.exe" on a simple command line or in a script or simple tool the calls the flac binary. All those scenarios benefit a lot from a flac.exe that runs 4x its single thread speed...
Title: Re: More multithreading
Post by: cid42 on 2023-07-16 13:02:18
Well done building multithreading into libflac, bet that was a pain :P

One of input, output, MD5 and encode is the bottleneck at any given time (probably the same thing throughout), ideally the threading model would automatically prioritise the bottleneck to minimise idle time.

My first thought is no specialised threads (aside from housekeeping at the start of FLAC__stream_encoder_process to make sure that the first thread to read is the one if any that has partial unprocessed data from the previous call). If a thread handles a frame it does everything required for that frame to keep things in the fastest cache possible, preferably L1/L2. Mutex for input output and MD5 to keep them serial (they're the only serial things). Keep track of a frames start location and global I/O/MD5 location to determine what should be prioritised as tasks get completed. MD5 should probably take priority over encode as it's serial, but if one thread is hashing another can encode first (and even write before hashing if it's still not their turn to hash).

I think that's as good as it gets when each state is handled in frame-sized chunks. MD5 and output both being serial but arbitrary order started me down a line of thinking about requiring a heuristic for optimal priority, however if either is the bottleneck the opportunity to pick quickly disappears as the wavefront for each will be on different threads.

There may be a benefit to two working frames per thread so that a thread can work on something while the other frame is stalled for whatever reason. It might be beneficial when the bottleneck changes over time, but mostly it papers over the idle time that would otherwise be present when extra threads cannot be spawned and we're dealing with frame-sized chunks always. Alternatively ignore this complexity and the user can DIY this behaviour by setting a higher thread count than they have hardware threads.

Haven't considered verify step, but that should be easy enough to add to the above model.

It seems there is a certain minimum amount of work that needs to go in a thread-task, otherwise the overhead completely swamps any possible gain. So, if you set a small blocksize, for example 32, multithreading shows massive negative gains.
Maybe that can be countered by choosing some arbitrary minimum number of samples a thread handles per iteration, for example min_samples=4096 blocksize=32 would mean a single thread handles the next chunk of 128 frames.
Title: Re: More multithreading
Post by: Porcus on 2023-07-16 22:03:41
* Although you want a "-j that works no matter settings", is that really an imperative?
That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j.
But you could spawn less threads?
Say, choose to implement -j7 to mean "up to 7", where the selection algorithm could be subject to change. And then maybe let -j7,7 mean "I ordered seven!", like -r7,7 works. This of course depends on whether you are comfortable about releasing a 1.5.0 with a "crude" selection algorithm.

You have quite some choice here, because reference FLAC is not at all consistent in applying numerical arguments. -l7 and -r7 mean "at most 7", but there is no "-l7,7"; on the other hand, -q7 means "exactly 7" and there is no -q6,7 to force a range. There is however a -q0 for "let encoder decide".
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-17 00:08:17
* Although you want a "-j that works no matter settings", is that really an imperative?
That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j.
But you could spawn less threads?
Say, choose to implement -j7 to mean "up to 7", where the selection algorithm could be subject to change. And then maybe let -j7,7 mean "I ordered seven!", like -r7,7 works. This of course depends on whether you are comfortable about releasing a 1.5.0 with a "crude" selection algorithm.

You have quite some choice here, because reference FLAC is not at all consistent in applying numerical arguments. -l7 and -r7 mean "at most 7", but there is no "-l7,7"; on the other hand, -q7 means "exactly 7" and there is no -q6,7 to force a range. There is however a -q0 for "let encoder decide".

The selection algorithm could only choose what is optimal before the task starts.  If I'm not mistaken, once a task is using x amount of threads, that can't be changed until the task ends.  Maybe having -j7 use 7 threads, and have -j0 be automatic/optimal (probably run 1 thread per physical core/fpu).
Title: Re: More multithreading
Post by: ktf on 2023-07-17 08:16:46
Well done building multithreading into libflac, bet that was a pain :P
Not really. I'm dreading implementation of Ogg FLAC metadata editing way more actually.

Quote
If a thread handles a frame it does everything required for that frame to keep things in the fastest cache possible, preferably L1/L2.
I think threading overhead is way more important than having stuff in L1/L2. The occasional stalls (and associated context switch) is way more expensive than having to load stuff from main memory more often.

Quote
Mutex for input output and MD5 to keep them serial (they're the only serial things).
As far as I know mutexes are not meant to keep things serial, they are to lock things.

Quote
There may be a benefit to two working frames per thread
That is the exact difference between v1 and v3.

But you could spawn less threads?
Yes, but I'd rather first try to get this to scale properly.
Title: Re: More multithreading
Post by: Porcus on 2023-07-17 11:37:34
The selection algorithm could only choose what is optimal before the task starts.  If I'm not mistaken, once a task is using x amount of threads, that can't be changed until the task ends.
I'd guess that there is room for a pretty good solution even if you constrain yourself to making the choice once, when the executable is started, from the other options passed (like "-0").
(Except, the application has so many options that taking them all as input to the threads selection in a smart way, would be quite a job. But I'd say it would be good enough, if you got something that handles the numerical presets -0 to -8 (and above that, just go full steam I guess?) and with/without --verify?)
Title: Re: More multithreading
Post by: cid42 on 2023-07-17 11:52:38
As far as I know mutexes are not meant to keep things serial, they are to lock things.
When multiple threads want to use a resource simultaneously, locking is the easiest way to ensure that they form a queue instead of a free-for-all. If a global variable input_loc kept track of where the next read is in samples, a lock ensures that 4 threads trying to simultaneously read blocksize=1000 see the correct one of input_loc=0,1000,2000,3000 and fread in the right order, instead of them all probably seeing input_loc=0 and freading in arbitrary order.

Keep track of the location of I/O/MD5, when it's time for a worker to interact with one of them lock it first. Mutex required for input, technically output and md5 don't require mutexes as we're keeping track of unique frame locations and only one thread should interact at a time, but I believe explicitly using a mutex updates the thread-local view of a global variable which may otherwise be an old cached value and may result in a stall or at least a delay (could be wrong on that point).

Rough pseudocode ignoring verify step and not including wake/sleep mechanism:
Code: [Select]
enum{IDLE, INPUT_READ, ENCODED, WRITTEN};

struct{
mutex input_m, output_m, md5_m;
uint64_t input_loc, output_loc, md5_loc;
} globals;

struct{
int status;
uint64_t frame_loc;
} worker;

while(1){//worker loop
if(status==IDLE)
lock input
frame_loc=input_loc
read the next frames worth of input
input_loc+=blocksize
unlock input
status=INPUT_READ
else if(frame_loc==md5_loc)
lock md5
update hash
md5_loc+=blocksize
unlock md5
else if(status==INPUT_READ)
encode
status=ENCODED
else if(status==ENCODED && output_loc==frame_loc)
lock output
write
output_loc+=blocksize
unlock output
status=WRITTEN
else if(status==WRITTEN && frame_loc<md5_loc)
status=IDLE
else
;//waiting for its turn to md5 or write
}
This ensures the input->encode->output order for a frame and ensures the seriality of I/O/MD5 but keeps the md5 stage of a frame floating (could be done after any of input/encode/output) to try and minimise idle time.

In an actual implementation along the lines of the above there would also be a worker with PARTIAL status containing unencoded samples from the previous FLAC__stream_encoder_process call. This FLAC__stream_encoder_process call would have to make sure that the PARTIAL worker if present is the first to read the input.
Title: Re: More multithreading
Post by: rutra80 on 2023-07-18 14:52:05
15:42 of CDDA on 2x Xeon E5620 NUMA system:

V1:
-j1 -8ep - 342s
-j2 -8ep - 337s
-j4 -8ep - 116s
-j8 -8ep - 87s
-j9 -8ep - 74s
-j10 -8ep - 65s
-j11 -8ep - 59s
-j12 -8ep - 55s
-j13 -8ep - 49s
-j14 -8ep - 50s
-j15 -8ep - 43s
-j16 -8ep - 40s

V3:
-j1 -8ep - 353s
-j2 -8ep - 328s
-j4 -8ep - 122s
-j8 -8ep - 79s
-j9 -8ep - 70s
-j10 -8ep - 62s
-j11 -8ep - 56s
-j12 -8ep - 51s
-j13 -8ep - 47s
-j14 -8ep - 44s
-j15 -8ep - 41s
-j16 -8ep - 39s

On 8 threaded i7-4790K I was able to shed another 2s by running 16 threads (HyperThreading inefficiency?) but it seems to be the limit - how about removing it?
Title: Re: More multithreading
Post by: ktf on 2023-07-20 16:03:15
After spending quite a bit of time trying some other approaches, here is a new binary.

A word of warning first: the multithreading code got quite a bit more complicated here, and I haven't tested thoroughly yet, so it might hang or create corrupt files every now and then. Please use with caution and only for benchmarking/testing.

The changes mainly focus on making threading more flexible, better using CPU resources. Whereas previous binaries saw (almost) no speed boost with settings like -8p -j2 because 1 thread was mostly idle, it should now pretty much fully utilize 2 cores. Also, Using more than 2 cores for fast presets like -0 should now help, because MD5 is split of from the main thread into a worker thread.

As you can see in the PDF, v4 improves -j2 for all settings, except settings -1 and -4. Improving those requires a separate solution that will be rolled out later.
Title: Re: More multithreading
Post by: music_1 on 2023-07-20 17:44:00
flac-multithreading-v4-win
AMD Ryzen 9 5950X (16 Cores 32 Threads)
Code: [Select]
timer64.exe v4 -j1 -8p -f in.wav
Global Time  =    60.407

timer64.exe v4 -j2 -8p -f in.wav
Global Time  =    37.956

timer64.exe v4 -j3 -8p -f in.wav
Global Time  =    25.652

timer64.exe v4 -j4 -8p -f in.wav
Global Time  =    19.207

timer64.exe v4 -j5 -8p -f in.wav
Global Time  =    16.313

timer64.exe v4 -j6 -8p -f in.wav
Global Time  =    14.022

timer64.exe v4 -j7 -8p -f in.wav
Global Time  =    12.405

timer64.exe v4 -j8 -8p -f in.wav
Global Time  =    10.840

timer64.exe v4 -j9 -8p -f in.wav
Global Time  =    10.399

timer64.exe v4 -j10 -8p -f in.wav
Global Time  =     8.999

timer64.exe v4 -j11 -8p -f in.wav
Global Time  =     8.374

timer64.exe v4 -j12 -8p -f in.wav
Global Time  =     7.724

timer64.exe v4 -j13 -8p -f in.wav
Global Time  =     7.558

timer64.exe v4 -j14 -8p -f in.wav
Global Time  =     6.814

timer64.exe v4 -j15 -8p -f in.wav
Global Time  =     7.532

timer64.exe v4 -j16 -8p -f in.wav
Global Time  =     6.840

Title: Re: More multithreading
Post by: Replica9000 on 2023-07-20 18:28:59
flac git-1357f844 20230720

run with -8p
Code: [Select]
 -j1: 0m43.870s
 -j2: 0m24.211s
 -j3: 0m17.975s
 -j4: 0m14.690s
 -j5: 0m12.686s
 -j6: 0m11.325s
 -j7: 0m10.291s
 -j8: 0m9.530s
 -j9: 0m9.571s
-j10: 0m9.460s
-j11: 0m9.365s
-j12: 0m9.213s
-j13: 0m9.131s
-j14: 0m9.076s
-j15: 0m9.003s
-j16: 0m8.999s
Title: Re: More multithreading
Post by: ktf on 2023-07-20 19:28:42
flac-multithreading-v4-win
AMD Ryzen 9 5950X (16 Cores 32 Threads)
[...]

I don't think the numbers are reliable enough to draw conclusions just by themselves, but comparing with v3, it seems v4 doesn't scale deeper, but it gets there with significantly less threads. -j9 with v3 has about the same time as -j7 with v4, and all higher thread counts seem to follow the same pattern: v4 does things in the same time as v3 with 2 threads less. This is better than I'd hoped for.

flac git-1357f844 20230720

run with -8p
[...]

This seems to get there with 1 thread less until 8 threads, which is what I expected. It looks like the behaviour observed previously, where using a number of threads higher than the core count increases the time used, is no longer there.

All in all, not bad I'd say.
Title: Re: More multithreading
Post by: sundance on 2023-07-20 20:28:36
My results with the v4 binary:
Code: [Select]
-j1:	Average time =  22.865 seconds (3 rounds), Encoding speed = 472.86x
-j2: Average time =  12.113 seconds (3 rounds), Encoding speed = 892.62x
-j3: Average time =   8.367 seconds (3 rounds), Encoding speed = 1292.17x
-j4: Average time =   6.518 seconds (3 rounds), Encoding speed = 1658.88x
-j5: Average time =   5.357 seconds (3 rounds), Encoding speed = 2018.29x
-j6: Average time =   4.886 seconds (3 rounds), Encoding speed = 2213.00x
-j7: Average time =   4.840 seconds (3 rounds), Encoding speed = 2233.73x
-j8: Average time =   4.724 seconds (3 rounds), Encoding speed = 2288.90x
Excellent scaling here up to -j6 (having 6 cores here...)
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-20 21:03:12
A test with 25 random files, decoded from FLAC with a stable version of FLAC, re-encoded with git-1357f844 using -j8, and decoded again with the stable version.

Code: [Select]
a2e5ffbacccec5eeb055a9d8b86aa407  Alice In Chains - Junkhead.orig.wav
a2e5ffbacccec5eeb055a9d8b86aa407  Alice In Chains - Junkhead.wav
ea2b96ccb1700203cff1febcf3583b4f  Assemblage 23 - Pages.orig.wav
ea2b96ccb1700203cff1febcf3583b4f  Assemblage 23 - Pages.wav
e68c9d86855cdeba088fdd90c9edb386  Blutengel - Ich Bin Das Feuer.orig.wav
e68c9d86855cdeba088fdd90c9edb386  Blutengel - Ich Bin Das Feuer.wav
c154800d3fd2ede0bed7e0a25b582bdb  Chimaira - Left For Dead.orig.wav
c154800d3fd2ede0bed7e0a25b582bdb  Chimaira - Left For Dead.wav
d237c825964d7b2718b287a36a5911be  Chimaira - The Flame.orig.wav
d237c825964d7b2718b287a36a5911be  Chimaira - The Flame.wav
34a76a7eba7069fb8ebcdfe2995f5cfa  Eisbrecher - Nein Danke.orig.wav
34a76a7eba7069fb8ebcdfe2995f5cfa  Eisbrecher - Nein Danke.wav
7d0e21fc23630e5c559b2ebb2ab300b5  Eisbrecher - Unschuldsengel.orig.wav
7d0e21fc23630e5c559b2ebb2ab300b5  Eisbrecher - Unschuldsengel.wav
0b601ad4ea94c43e58182736f94d9e07  Five Finger Death Punch - The Agony Of Regret.orig.wav
0b601ad4ea94c43e58182736f94d9e07  Five Finger Death Punch - The Agony Of Regret.wav
bca9d450c42bfef2c9620b5b1c68e81a  Five Finger Death Punch - You.orig.wav
bca9d450c42bfef2c9620b5b1c68e81a  Five Finger Death Punch - You.wav
e84ea2925ec4eb1d55f3f878813bba3b  KMFDM - Last Things.orig.wav
e84ea2925ec4eb1d55f3f878813bba3b  KMFDM - Last Things.wav
5261e8a315ac5f6f3862376e343dacb3  Linkin Park - Shadow Of The Day.orig.wav
5261e8a315ac5f6f3862376e343dacb3  Linkin Park - Shadow Of The Day.wav
7fb6a83e650f642e4061984e63797929  Megadeth - I Know Jack.orig.wav
7fb6a83e650f642e4061984e63797929  Megadeth - I Know Jack.wav
bd0e25c7b0f69667d907f24d42258475  Megadeth - The Right To Go Insane.orig.wav
bd0e25c7b0f69667d907f24d42258475  Megadeth - The Right To Go Insane.wav
ed2905920e3d3d9d3c2071cf14255332  Metallica - Holier Than Thou.orig.wav
ed2905920e3d3d9d3c2071cf14255332  Metallica - Holier Than Thou.wav
eb4072314e7641ab14907cbb5a183976  Nine Inch Nails - All The Pigs, All Lined Up.orig.wav
eb4072314e7641ab14907cbb5a183976  Nine Inch Nails - All The Pigs, All Lined Up.wav
06ed26ea67393804675ef2e6305f77bd  Nine Inch Nails - Head Like A Hole (Clay).orig.wav
06ed26ea67393804675ef2e6305f77bd  Nine Inch Nails - Head Like A Hole (Clay).wav
e90672c5b36aed09b38647d907d3586f  Project Pitchfork - Schalt Und Rauch.orig.wav
e90672c5b36aed09b38647d907d3586f  Project Pitchfork - Schalt Und Rauch.wav
daaa63180a26a2716c886f8eee07d7e9  Sepultura - Slaves Of Pain.orig.wav
daaa63180a26a2716c886f8eee07d7e9  Sepultura - Slaves Of Pain.wav
ac19a7d52d6e00545da33668b6ea26c8  Sepultura - We Who Are Not As Others.orig.wav
ac19a7d52d6e00545da33668b6ea26c8  Sepultura - We Who Are Not As Others.wav
0aa52479725fe6e755e00ffda36d1191  Spineshank - 40 Below.orig.wav
0aa52479725fe6e755e00ffda36d1191  Spineshank - 40 Below.wav
a186a0036687c2e96c17e20a71d9116a  Stone Temple Pilots - Sin.orig.wav
a186a0036687c2e96c17e20a71d9116a  Stone Temple Pilots - Sin.wav
c969721b69ca499c50e4643ef6cdddda  Tantric - I'll Stay Here.orig.wav
c969721b69ca499c50e4643ef6cdddda  Tantric - I'll Stay Here.wav
e24100e76d3af18b19a23d9156986331  Taproot - Myself.orig.wav
e24100e76d3af18b19a23d9156986331  Taproot - Myself.wav
0b6a1765390f6451588e0984595ddec9  The Crystal Method - Jaded.orig.wav
0b6a1765390f6451588e0984595ddec9  The Crystal Method - Jaded.wav
ec2cf54651c01c1ecec17a0f5fb18225  Van Halen - One Foot Out The Door.orig.wav
ec2cf54651c01c1ecec17a0f5fb18225  Van Halen - One Foot Out The Door.wav
Title: Re: More multithreading
Post by: Porcus on 2023-07-20 21:39:59
TL;DR of Reply 73: All match.
Title: Re: More multithreading
Post by: rutra80 on 2023-07-20 22:08:12
V3:
-j1 -8ep - 103s
-j2 -8ep - 118s
-j4 -8ep - 37s
-j8 -8ep - 26s
:(
V4:
-j1 -8ep - 100s
-j2 -8ep - 51s
-j4 -8ep - 29s
-j8 -8ep - 22s
 :-*

Abusing 16 threads gives the same time as real 8.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-20 22:44:08
flac git-f8cb7f08.  The times are very similar to flac git-1357f844. 

Code: [Select]
 -j1: 0m43.917s
 -j2: 0m24.247s
 -j3: 0m18.118s
 -j4: 0m14.752s
 -j5: 0m12.700s
 -j6: 0m11.321s
 -j7: 0m10.283s
 -j8: 0m9.496s
 -j9: 0m9.571s
-j10: 0m9.408s
-j11: 0m9.359s
-j12: 0m9.231s
-j13: 0m9.094s
-j14: 0m9.077s
-j15: 0m8.990s
-j16: 0m8.954s

Title: Re: More multithreading
Post by: Wombat on 2023-07-21 01:45:07
Again my simple numbers. Looks fast and scaling is fine.
Code: [Select]
v1 vs v3 vs v4
j1 103x  104x  106x
j2 106x  115x  207x
j3 203x  225x  306x
j4 298x  326x  402x
j5 381x  426x  492x
j6 460x  521x  566x
j7 543x  615x  645x
j8 620x  670x  705x
j9 685x  710x  768x
j10 705x  625x  765x
j11 725x  670x  699x
j12 740x  680x  688x
j13 750x  675x  683x
j14 752x  670x  676x
j15 750x  650x  675x
Title: Re: More multithreading
Post by: ktf on 2023-07-21 06:23:38
My results with the v4 binary:
[...]
Excellent scaling here up to -j6 (having 6 cores here...)
Good to see. As you can imagine, getting this right for faster presets is more difficult than for slower presets. Of course, -7 isn't particularly fast, but it is quite a bit faster than -8p. Also, I think this scales better on Linux (where pthreads is native) than on Windows (where pthreads is 'emulated') so these numbers on Windows are very nice I'd say.

For example, scaling for the really fast presets like -0 and -3 stopped after 2 threads already because MD5 was 'blocking'. With these changes, using 3 threads is almost 3x as fast as 1 thread, which I think is a big win. Mainly theoretically of course, because I don't think many people will use such a fast preset with multithreading, but still, it is nice that it works.

flac git-f8cb7f08.  The times are very similar to flac git-1357f844. 
Yes, the only change was a small fix for building with multithreading disabled.

[...]
Abusing 16 threads gives the same time as real 8.
I still don't know why -j2 was so much slower than -j1 on your system with v3, but good to see this has been fixed.

Again my simple numbers. Looks fast and scaling is fine.
[...]
This seems to contradict the results of music_1 though. The systems of you and music_1 have the highest physical core count. I don't know what causes this difference.

Title: Re: More multithreading
Post by: Wombat on 2023-07-22 02:15:32
When going above j12 with my 12core/24thread CPU -8ep still sees small benefits up to j16 but from j17 on it  becomes extremely slow to 1 thread it seems. Was there a mention of this limit i did overread?
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-22 03:17:55
When going above j12 with my 12core/24thread CPU -8ep still sees small benefits up to j16 but from j17 on it  becomes extremely slow to 1 thread it seems. Was there a mention of this limit i did overread?

My CPU only has 16 threads.  If I try to use more, I get: "WARNING, cannot set number of threads: too many"

I thought maybe FLAC refuses to use more than the available threads, but I see this in the code:
Code: [Select]
#define FLAC__STREAM_ENCODER_MAX_THREADS 16
#define FLAC__STREAM_ENCODER_MAX_THREADTASKS 34


I changed it to 32 and 68 respectively, and I can use up to 32 threads now.
Title: Re: More multithreading
Post by: Wombat on 2023-07-22 03:21:25
Nice find, thanks. Lets wait for ktf what is the reason for this.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-22 03:45:59
I built this from my phone, so can't test. 

FLAC 64-bit Windows.  Static binary, 32 threads enabled. 
Edit: built with no ASM optimizations. (Faster 16-bit encoding)
Title: Re: More multithreading
Post by: Wombat on 2023-07-22 04:49:17
Cool, thanks!
It still scales a little up to j24. -8ep -V
j12 182x
j16 196x
j24 206x
Title: Re: More multithreading
Post by: ktf on 2023-07-22 06:33:49
Nice find, thanks. Lets wait for ktf what is the reason for this.
I have to put a limit somewhere, because some memory allocation happens statically. Seemed reasonable to put it at 16. Looking at your data, twice the number of threads for 10% gain doesn't seem worthwhile really, so it still seems pretty reasonable.
Title: Re: More multithreading
Post by: cid42 on 2023-07-22 09:46:46
It might make sense for the thread count to track consumer x86_64 physical cores, which currently tops out at 24 with the 13900k, or track consumer threads which currently tops at 32. The biggest x86 server chip is bergamo zen4c with 128 cores, but if any of us interact with it it's likely only a few cores at a time in the cloud. Is there any scope to reduce the per-core memory footprint?
Title: Re: More multithreading
Post by: rutra80 on 2023-07-22 11:23:02
@Replica9000 for CDDA with -8ep your compile seems 13-28% slower on my i7-4790K.
Title: Re: More multithreading
Post by: ktf on 2023-07-22 11:57:26
It might make sense for the thread count to track consumer x86_64 physical cores, which currently tops out at 24 with the 13900k, or track consumer threads which currently tops at 32.
At such high thread counts, there is a tremendous amount of overhead. As Wombats results showed, there is very little gain. Sure, I could increase max number of threads, but would it make sense really? I said static memory allocation is a problem, but now that I've checked, it isn't really a problem. Increasing max thread count by 1 results in static allocation of 3 extra pointers (which are 8 bytes each).
Quote
Is there any scope to reduce the per-core memory footprint?
I don't think that is really necessary. FLAC uses memory already very efficiently. Memory measurements are rather erratic, but I've tried anyway.

Results with -8
Code: [Select]
~$ sleep 1; for I in {1..3}; do for J in {1..16}; do echo -n "$J "; /usr/bin/time -v ./flac-v4 -fsj$J -8 /media/test.wav /media/test.wav /media/test.wav /media/test.wav /media/test.wav 2>&1 | grep "Maximum resident"; done; done
1 Maximum resident set size (kbytes): 3584
2 Maximum resident set size (kbytes): 7264
3 Maximum resident set size (kbytes): 7816
4 Maximum resident set size (kbytes): 8372
5 Maximum resident set size (kbytes): 8164
6 Maximum resident set size (kbytes): 11196
7 Maximum resident set size (kbytes): 12600
8 Maximum resident set size (kbytes): 13516
9 Maximum resident set size (kbytes): 14052
10 Maximum resident set size (kbytes): 16736
11 Maximum resident set size (kbytes): 19920
12 Maximum resident set size (kbytes): 18620
13 Maximum resident set size (kbytes): 19632
14 Maximum resident set size (kbytes): 23604
15 Maximum resident set size (kbytes): 24572
16 Maximum resident set size (kbytes): 25028

With larger blocksizes this increases quite a bit. With a blocksize of 32768:
Code: [Select]
1 	Maximum resident set size (kbytes): 5456
4 Maximum resident set size (kbytes): 26384
8 Maximum resident set size (kbytes): 63704
12 Maximum resident set size (kbytes): 78156
16 Maximum resident set size (kbytes): 108120

With a blocksize of 32768 and -r 15 this increases even more
Code: [Select]
1 	Maximum resident set size (kbytes): 7552
4 Maximum resident set size (kbytes): 45320
8 Maximum resident set size (kbytes): 84968
12 Maximum resident set size (kbytes): 126768
16 Maximum resident set size (kbytes): 159012

So, memory usage is already highly dynamic. I wouldn't know where I could cut down. Also, 25MB for 16 cores isn't much really.
Title: Re: More multithreading
Post by: cid42 on 2023-07-22 12:22:29
Wombats CPU is 12c24t. The 13900k is 8p+16e aka 24c32t. The p/e core thing muddies the waters as the e cores are lower clocked, but there's a good chance that there's decent scaling up to 24 flac threads.

A rule of thumb for hyperthreading is that it normally provides a -5 to +30% benefit relative to no SMT depending on the workload, with outliers in both directions. It's no surprise that wombat shows a +13% benefit from -j12 to -j24.
Title: Re: More multithreading
Post by: ktf on 2023-07-22 13:57:17
Wombats CPU is 12c24t.
Forgot about that bit.

Quote
The 13900k is 8p+16e aka 24c32t.
I'm curious as to whether this code properly scales on such heterogeneous architectures in general. In v1 and v3, threads couldn't 'leapfrog' each other, so threads would need to be rotated over P and E cores to stay in sync, or else threads would have to idle. With v4, threads can in fact leapfrog each other (one thread can do three frames while another does two for example), so this should scale reasonably well.

On my 4 core Linux PC (i7-4710MQ), it does scale very well. For setting -8, using 4 threads gives a 3.9x speedup and on -5 it gives a 3.75x speedup.
Title: Re: More multithreading
Post by: Wombat on 2023-07-22 15:04:14
A rule of thumb for hyperthreading is that it normally provides a -5 to +30% benefit relative to no SMT depending on the workload, with outliers in both directions. It's no surprise that wombat shows a +13% benefit from -j12 to -j24.
-j16 and -j24 trigger the same 142Watt power limit here so any benefit highers the efficiency imho.
Title: Re: More multithreading
Post by: Wombat on 2023-07-22 15:37:24
@Replica9000 for CDDA with -8ep your compile seems 13-28% slower on my i7-4790K.
Replica9000 compiled without asm-optimizations he mentions. That gives a good performance boost with 16bit audio on some modern CPUs like my Ryzen 5900x.
The compiler does well there. Unfortunately some older CPUs can't benefit and this makes them slower.
Our member sundance experienced and benched that already together with an intel 8700.
The same thing happens to a smaller degree when using the additional compiler flag -falign-functions=32 (default 16) in the GCC compiler.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-22 15:58:20
Same build as above, with ASM optimizations.
Static Win64 binary.
Title: Re: More multithreading
Post by: Porcus on 2023-07-22 18:10:12
First, this question for development:
If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead? Sure the audio will have different lenghts, but still: If I call (possibly with options) flac -8p *.wav or possibly for that matter, flac -2ef  flacfileencodedwith_-0b56789.flac flacfileenccodedwith_-Mb32_-l23.flac longfile.rf64 outrageouslylongfile.w64 veryshortaudiofilewith2GBheaders.wav, and the executable can spawn multiple threads, then what?

Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers. That depends on purpose I guess:
timer64, "Global time" surely but also "Process time" - or some other utility?
Powershell measure-command returns execution time, but nothing else?
Title: Re: More multithreading
Post by: bennetng on 2023-07-22 19:01:49
timer64, "Global time" surely but also "Process time" - or some other utility?
Powershell measure-command returns execution time, but nothing else?
WavPack for example has built-in benchmark so flac can try this too, at least for test builds.

Another thing is that my Linux vs Windows benchmarks indicate that Linux seems to perform better with lower thread count while Windows do the opposite, I don't know if it is expected or due to differences in measurement methods. With a built-in benchmark I don't need to worry about this.
https://hydrogenaud.io/index.php/topic,124437.msg1030148.html#msg1030148
Title: Re: More multithreading
Post by: ktf on 2023-07-22 21:13:27
If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead?
Yes, of course. But that would only benefit the flac command line tool, and I was worried how console output should be made easy to understand. Also, multithreading over files is already possible with various utilities like GNU parallel. The approach with multithreading over a single file can benefit all libFLAC users, and is not achievable with other tools.

Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time. Second, wall time with 1 thread divided by wall time with X number of threads. On my machine, this gives me:
For setting -8, using 4 threads gives a 3.9x speedup and on -5 it gives a 3.75x speedup.
I'd say these are the only numbers that are interesting to the end user: how much do we gain, and how efficient is it.

edit:
Another thing is that my Linux vs Windows benchmarks indicate that Linux seems to perform better with lower thread count while Windows do the opposite, I don't know if it is expected or due to differences in measurement methods. With a built-in benchmark I don't need to worry about this.
I don't think measuring wall time is particularly complicated, so I don't think there is much difference in such a measurement. CPU time is difficult, of course. However, threading is something heavily dependent on kernel, and I've seen quite different behaviour, with some bugs only showing up on Linux and others only showing up on Windows. Can't differentiate between what is kernel and what is libFLAC, but I don't the timer utility is to blame here.
Title: Re: More multithreading
Post by: sundance on 2023-07-23 09:40:53
Out of curiosity I ran my test files with ktf's v4 binary with lower settings:
-5:
Code: [Select]
-j1:    Average time =  14.054 seconds (3 rounds), Encoding speed = 769.30x
-j2:    Average time =   7.637 seconds (3 rounds), Encoding speed = 1415.74x
-j3:    Average time =   5.364 seconds (3 rounds), Encoding speed = 2015.79x
-j4:    Average time =   4.172 seconds (3 rounds), Encoding speed = 2591.36x
-j5:    Average time =   4.166 seconds (3 rounds), Encoding speed = 2595.30x
-j6:    Average time =   4.817 seconds (3 rounds), Encoding speed = 2244.71x
-j7:    Average time =   5.061 seconds (3 rounds), Encoding speed = 2136.34x
-j8:    Average time =   5.175 seconds (3 rounds), Encoding speed = 2089.41x
-0
Code: [Select]
-j1:    Average time =   9.710 seconds (3 rounds), Encoding speed = 1113.53x
-j2:    Average time =   5.570 seconds (3 rounds), Encoding speed = 1941.00x
-j3:    Average time =   4.194 seconds (3 rounds), Encoding speed = 2578.17x
-j4:    Average time =   5.593 seconds (3 rounds), Encoding speed = 1933.02x
-j5:    Average time =   6.210 seconds (3 rounds), Encoding speed = 1740.97x
-j6:    Average time =   6.525 seconds (3 rounds), Encoding speed = 1657.01x
-j7:    Average time =   6.838 seconds (3 rounds), Encoding speed = 1581.09x
-j8:    Average time =   6.995 seconds (3 rounds), Encoding speed = 1545.68x
No matter what compression level I used, I couldn't get it faster that some 4.2 seconds. But the scaling flattens later/earlier.
Btw. the mere time to copy the 40 WAVs (2 GB) to a different folder on the same SSD is ~ 0.3-0.4 secs (copy *.wav wav2 /q), calculation of MD5s is in the 3 seconds ballpark.

P.S.: with "-5 --no-md5-sum" the speed limit here is 3.492 seconds @ -j4.
Title: Re: More multithreading
Post by: rutra80 on 2023-07-23 11:35:06
Lets be careful not to wander into FLACCL territory where it encodes at 999999x rate but initializes for several seconds on every file, ending down slower than FLAC.
Title: Re: More multithreading
Post by: rutra80 on 2023-07-23 12:45:57
Lets be careful not to wander into FLACCL territory where it encodes at 999999x rate but initializes for several seconds on every file, ending down slower than FLAC.
1:02:56 of CDDA on i7-4790K with NVMe:

-j8:
-8 - 3,67s
-7 - 2,86s
-6 - 2,80s
-5 - 2,60s
-4 - 3,96s
-3 - 2,20s
-2 - 3,70s
-1 - 2,96s

-j4:
-8 - 4,91s
-7 - 3,01s
-6 - 2,63s
-5 - 2,04s
-4 - 3,87s
-3 - 2,18s
-2 - 2,84s
-1 - 3,28s

-j2:
-8 - 8,10s
-7 - 5,36s
-6 - 4,81s
-5 - 3,45s
-4 - 3,90s
-3 - 2,61s
-2 - 2,71s
-1 - 2,97s

Yep, somethings funky with the scaling already, with -j1 it's fine.
Title: Re: More multithreading
Post by: Porcus on 2023-07-23 13:01:29
If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead?
Yes, of course. But that would only benefit the flac command line tool, and I was worried how console output should be made easy to understand.
Suggestion for that case with four concurrent files:

file1 started encoding
file2 started encoding
file3 started encoding
file4 <uses the last thread, output as usual counting up>
file2: wrote 12345678 bytes, ratio=0,543
file1: 33% complete, ratio=0,628 <this a single status report>
file3: 11% complete, ratio=1,000 <this a single status report>
file5 <uses the last thread, output as usual counting up>
file4: wrote 23456789 bytes, ratio=0,555
file6 <uses the last thread, output as usual counting up>
file1: wrote 33333333 bytes, ratio=0,666
file5: wrote 11111111 bytes, ratio=0,567
file3: wrote 98765432 bytes, ratio=1,000-  < <-- I propose a "-" to signify that it is smaller than the original even if that is beyond the third decimal. And a "+" for say 1,00001. But I don't miss the old failure report.>

Also, multithreading over files is already possible with various utilities like GNU parallel. The approach with multithreading over a single file can benefit all libFLAC users, and is not achievable with other tools.
Yes - multithreading over multiple files was not at all meant as a substitute for multithreading over a single file. But, if certain single files are hard to improve upon, consider if that will make a difference to the user.
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset. Possibly you could consider the following line of arguments - subject to being anywhere remotely close to the fact, I am quiiiite ignorant here:
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?


Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time.
Obviously for the end result. But for testing, you don't get much useful extra information from including anything else?
Title: Re: More multithreading
Post by: ktf on 2023-07-23 13:53:18
Yep, somethings funky with the scaling already, with -j1 it's fine.
What do you mean? That -1 and -4 are different? This has been mentioned in the thread start, reply #2, #18, #19, #30 and #68. Otherwise, I don't know what you mean.

Suggestion for that case with four concurrent files:
[...]
Very messy. I would get lost in that.

Quote
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.
Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup. With preset -0 I still get a 1.4x speedup with 4 threads. So the overhead of setting up and destroying threads isn't too much.

Quote
Possibly you could consider the following line of arguments - subject to being anywhere remotely close to the fact, I am quiiiite ignorant here:
  • If it is just one single file, it will be done in one second anyway, you can get it down to half a second but who cares if you cannot get it down to a third of a second - end-users get impatient over seconds to wait, not over percentages;
We're not talking about seconds here, but milliseconds with current CPUs. Seriously, encoding 1600 files of such 1-second files takes 10 seconds in total when single threading with preset -8, and 5 seconds with -j4.

Also, a program should act somewhat predictable to an end user. If the command line tool uses a different ordering from what the user supplies as input to improve throughput, that is going to be confusing.

Quote
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?
The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.

Quote
Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time.
Obviously for the end result. But for testing, you don't get much useful extra information from including anything else?
I don't know any.
Title: Re: More multithreading
Post by: Porcus on 2023-07-23 15:03:49
Quote
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.
Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup.
I was thinking about the -0 end and not the -8 end ... but anyway:
Testing a compilation album with short songs, not atypical for the genre: https://nocleansinging.bandcamp.com/album/hold-fast-grindviolence-compilation - free download for anyone to replicate the experiment on their computers
30:26 long, 20 tracks. 23:49 is CDDA, 997 kbit/s at -5 (yes noisy), 4:07 is 44.1/24 at 1698, and 2:31 is 96/24 at 3409.
for %j IN (1,2,4,8,16) DO (timeout /t 8 & \bin\timer64.exe flac-multithreading-v4.exe -ss -j%j -f <setting> *.flac )
8 seconds is maybe not much cooldown time, but quite a lot compared to the busy times. And -j16 is supposed to be useless on a 4core8thread i5-1135G7 (throwing it in just to verify it it doesn't make a mess out of anything):

--lax -0b16384 --no-md5-sum where -j4 takes more time than -j2 even if occupying more threads. Times -j1 -j2 -j4 -j8 -j16 are:
     2.955    2.166    2.193    2.316    2.337   
-0b4096  where again -j4 takes more time than -j2
     3.414    2.144    2.311    2.310    2.444   
-2e and at this stage I wonder if I should have run -j3 and -j5 and the whole thing
     4.118    2.640    2.895    2.990    3.091   
-5 and finally -j4 catches -j2, but -j8 doesn't improve over -j4
     4.554    3.045    2.859    2.953    3.068   
-8 and here -j4 does save considerable time.
     9.578    5.624    3.895    3.844    3.826   

So up to -5-ish, running -j4 / -j8 (/-j16) just means that it fires 4 / all 8 (/ditto) threads to do the same work as -j2 does.
Do I interpret it correctly as follows? That means that you fire up 2 or 6 extra threads only to do the extra work from the overhead? That is a waste. If I want to put my CPU to work for two seconds, and can get it done in four threadseconds, then spending sixteen threadseconds probably makes for several times as much heat - which would translate to a huge increase in duration if I were to run this for a weeklong job where the CPU would be pretty much throttled over the heat?


Quote
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?
The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.
And that just makes my argument even better (for you): If multi-threading multi-files means that users are not going to invoke <particular single-file scenario> so often, you don't need to worry so much about it, as you would have if you just presume that all multi-threading is run on single files.
Title: Re: More multithreading
Post by: bennetng on 2023-07-23 16:00:02
I think the main drawback of wall time is it includes everything, antivirus updating in background, Windows telemetry, and something like that.

Honestly, at this moment I hope the main focus is still single file multithreading, with a secret (now I mentioned so no longer secret) wish of variable blocksize development that may utilize some threads, which is much more rewarding than the pathetic -pe combination.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-23 17:53:31
I don't know how Windows calculates the time a process uses.  On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:
Code: [Select]
real    0m43.619s
user    0m43.106s
sys     0m0.512s
user + sys = 0m43.618s.  I don't really have anything else on my system using resources other than the browser.


Running with two threads to ramdisk:
Code: [Select]
real    0m23.948s
user    0m47.475s
sys     0m0.376s
(user + sys) / jobs = 0m23.925s


Running with 8 threads to ramdisk:
Code: [Select]
real    0m8.575s
user    1m7.709s
sys     0m0.568s
(user + sys) / jobs = 0m8.535s


Running with 8 threads to disk (zfs):
Code: [Select]
real    0m40.068s
user    1m14.153s
sys     0m2.573s
(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).
Title: Re: More multithreading
Post by: Porcus on 2023-07-23 23:49:35
Warning for information overload here.

I ran a variety of settings through version 3 and version 4 (note: only -j1, -j2, -j4 and -j8). Every figure is after a 120 seconds pause for cooldown. I suspect that wasn't always enough.
Times were recorded with the timer64 utility. I don't know what process time is worth, but those figures are surprising: there are big differences from version 3 to version 4, where the latter frequently measures up much higher; in two computers with fans, that happens at -5 settings and -2 settings, but for my fanless (hence throttling) home desktop it happens at the heavier -8xx settings.
But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?

Ran on three computers, all with Intels 4cores8threads CPUs.
Common observation: for the -0 settings, one can stick to -j2.

Results from a HP Prodesk with i5-7500T (same as here) (https://hydrogenaud.io/index.php/topic,123025.msg1030498.html#msg1030498). In version 4, -j8 slows down Global time compared to -j4 (and sometimes made -j8 slower than version 3).
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 638639 667633 683224 688189
version4 617618 648325 665168 690176
-8er7
version3 692692 718683 743243 747207
version4 670671 694348 728184 750191
-8r7
version3 176177 190161 19157 19151
version4 172172 18091 18648 21356
-8r0
version3 156157 171144 17763 16946
version4 156157 16182 16643 19852
-5q14
version3 6767 7344 7530 7531
version4 6667 7036 7321 10830
-5q6
version3 6768 7444 7530 7531
version4 6768 7036 7321 10230
-2er7
version3 105106 12283 12034 12539
version4 101102 10654 11130 12133
-0mr0
version3 5152 5734 6535 6535
version4 5152 5529 8128 11742
-0Mr0
version3 4748 5533 5533 5634
version4 4848 5936 6036 6036
-0r0
version3 4647 5433 6036 5936
version4 4646 5026 7828 12044
You notice that there are some -j8 settings where version 4 boosts "Process time" quite a lot: The "-5" settings and the "-0" settings, except the "-0Mr0" (the "soft" mid/side).


Same test ran on a Dell business laptop, i7-1185G7. Here -j8 is a good thing for the -8-based; but, compare to version 3 at the -8j8 settings.
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 581592 713683 799300 964153
version4 688696 849435 860229 1008150
-8er7
version3 716723 751719 852319 1018162
version4 692698 902458 903237 1197173
-8r7
version3 170178 203185 25897 23150
version4 198210 218126 20971 23947
-8r0
version3 154174 167138 19981 21445
version4 145161 194108 19866 22449
-5q14
version3 6277 7160 7335 7342
version4 6177 7753 7935 9831
-5q6
version3 5865 7262 7441 7538
version4 5972 7855 7834 10232
-2er7
version3 95108 118101 13457 13045
version4 94106 12275 12846 17244
-0mr0
version3 4460 5744 5845 5844
version4 4560 5438 6237 11344
-0Mr0
version3 4050 5247 5343 5138
version4 4057 5454 5452 5453
-0r0
version3 3955 5036 5039 5143
version4 4563 4943 6437 12158
Process time numbers jump on the same spots in the table, but also -2er7.
The top-left result (-8pr7 -j1 on version 3) was the first that was run, and if 2 minutes cooldown was too little (which I suspect), it might be reading too low due to starting from longer cooldown when I fiddled a little back and forth.

Now on my usual fanless desktop which throttles at will and produces unreliable numbers (CPU: i5-1135G7), the bottom of the table deviates slightly:
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 449451 442452 378194 603143
version4 441450 475248 546160 978135
-8er7
version3 475485 461470 398200 698156
version4 473481 509260 645178 1037137
-8r7
version3 123123 105110 6342 8931
version4 123123 11763 8437 16231
-8r0
version3 107107 9397 4837 8128
version4 107107 10659 6533 13628
-5q14
version3 4147 2534 2324 2424
version4 4151 3428 1118 1320
-5q6
version3 4246 2434 2423 2323
version4 4146 3428 1018 1420
-2er7
version3 6468 3358 1326 2828
version4 6467 5540 4525 2522
-0mr0
version3 3034 1223 2325 2225
version4 3036 923 817 1331
-0Mr0
version3 2732 1424 1324 1324
version4 2732 1630 1630 1430
no-md5
version3 1823 619 1115 1115
version4 1823 1415 816 816
TAK -p0
-md5 55 40 41
no MD5 47 32 28 n/a for TAK
Here it says "no-md5", that is -0r0 --no-md5-sum, instead of the ordinary -0r0 I ran above.
But anyway, here the high "Process" times are on the -8 settings.
Also included, for comparison: TAK at its fastest setting, -p0. MD5 summing is optional in TAK, and seems to remove some of the benefits from the multithreading, which for TAK is capped at 4 threads. Times here were recorded differently, with echo:|time .
Title: Re: More multithreading
Post by: Porcus on 2023-07-24 09:18:21
Two remarks on apparently "slow" speeds: TAK and the Dell laptop.

TAK. I had expected it to run faster, but it boils down to how fast (single-threaded) flac has become. Bragging rights to @ktf here.
In ktf's comparison studies, nothing encodes as fast as TAK -p0 - also verified on a couple of Intel CPUs in addition to the main study (https://hydrogenaud.io/index.php/topic,122508.msg1024512.html#msg1024512). Here it didn't run any faster than flac -5. (Curiously too, on these eleven CDs - the *j*.wav part of my signature - it didn't even compress better. But that doesn't generalize ...)
So I casually ran 1.3.4 at -5. Process/global times 52 and 58 seconds, indicating that the new builds are 1/6th faster. And -0Mr0: 35 and 47. Ran again and got exactly the same.
So on this computer, TAK -p0 was tied to old flac -0Mr0. But the fixed-predictor speedups since 1.3.4 are quite formidable, so finally TAK -p0 is getting dethroned at plain speed ... at least on a modern CPU.

Then the Dell laptop in the middle table is surprisingly slow given that the CPU is supposed to be better at every parameter: https://www.cpubenchmark.net/compare/2917vs3793vs3830 . I see it is set up with a pagefile, but if I/O were a concern it should be much more visible at the -0 settings. RAM is 16 GB on all.
There must be some BIOS-controlled more aggressive throttling going on, to save user's lap from getting burned I guess. Whereas the fanless computer, which has a heatsink body around a NUC board, runs too hot to touch ... maybe that actually dissipates more heat than an awfully noisy laptop fan would do, but I am surprised over the impact. Maybe I should check if I can downclock it slightly.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-24 19:50:08
I don't know how Windows calculates the time a process uses.  On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:
Code: [Select]
real    0m43.619s
user    0m43.106s
sys     0m0.512s
user + sys = 0m43.618s.  I don't really have anything else on my system using resources other than the browser.


Running with two threads to ramdisk:
Code: [Select]
real    0m23.948s
user    0m47.475s
sys     0m0.376s
(user + sys) / jobs = 0m23.925s


Running with 8 threads to ramdisk:
Code: [Select]
real    0m8.575s
user    1m7.709s
sys     0m0.568s
(user + sys) / jobs = 0m8.535s


Running with 8 threads to disk (zfs):
Code: [Select]
real    0m40.068s
user    1m14.153s
sys     0m2.573s
(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).

So after some testing, it seems instead of dividing user+system by the jobs run, I should have divided by the percentage of CPU actually used by the jobs run.  When writing to ramdisk, there's no I/O bottleneck, so running FLAC with higher settings will get each thread to (nearly) 100%.  When writing to disk, the process is waiting on I/O to catch up (might not happen so much with smaller files), so each thread might only be using 50% or 25%, etc.  Also using lower presets won't cause each thread to run at 100% either.  So it seems for a process that actively uses CPU for the duration of the task, the real (wall) time and user times will be the same (within a few milliseconds).  Only if a process sits idle during it's task will the real time and user time differ.  I always test on ramdisk and use the real time to show performance.  Looks like that is still the best way without any extra math involved.  Hope that makes sense, I'm awful at explaining things.
Title: Re: More multithreading
Post by: ktf on 2023-07-26 13:05:41
But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?
It was supposed to wait for work, but by mistake it did 'busy waiting'.

Anyway, attached is a new win64 binary. It should be much more efficient when the user asks for (way) too much threads. It lets threads properly waits when out of work, and also pauses threads for a long time when they have to wait often. That dramatically reduces the amount of overhead. Also, it raises the number of max threads to 64.

In my own tests, asking for 16 threads on a 4 core, 8 thread machine with preset -0 results in a 10% slower time than the sweet spot at 4 threads, whereas the previous binary could get **much** slower, sometimes even getting slower than single threaded.

This new version should not change much for slow presets like -8 with a sane number of threads, but makes a huge difference when selecting a number of threads that is way too high and with fast presets. I think it will also make quite a difference when run on a CPU that is already intermittently busy, because it scales up and down the number of active threads based on how well they run. This is difficult to measure however.
Title: Re: More multithreading
Post by: Porcus on 2023-07-26 14:31:01
Questions before firing up the next FOR loops - in case there is anything that could be omitted / should be included:

* Is -M still at this stage limited to two threads? No matter what other settings? Anything else particular about -m vs -M vs --no-mid-side?
(Above I just didn't bother to make an exception for -M in the FOR loop, but -0 would anyway max out speed at low threads count.)

* Anything special about re-encoding? (Decoding is fast, but is it fast enough not to matter much for the housekeeping thread under any reasonable circumstances? Should that be tested?)

* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM? (AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum and compare to the one for the source - but in principle, flac -f --verify flacfile.flac doesn't need to compute MD5 from source if that is stored in the source file ... not saying it is worth it, if users ask for -8pel32 they might want to test source first rather than waiting eons just to be told that nah source was corrupted.)

* Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)
Title: Re: More multithreading
Post by: Wombat on 2023-07-26 15:20:32
Own standard compille of v4 without limit vs own v5, again 12core/24thread 5900x, -8ep -V
v4  vs v5
j12 173x 173x
j16 183x 183x
j24 193x 194x

For this scenario it works well, thanks!
Title: Re: More multithreading
Post by: ktf on 2023-07-26 16:05:22
* Is -M still at this stage limited to two threads? No matter what other settings?
Yes and yes.

Quote
but -0 would anyway max out speed at low threads count.
It did max out at 3 threads with v4 in my tests, now it does at 4. But that CPU only has 4 cores anyway.

Quote
* Anything special about re-encoding?
Yes, decoding does hold up encoding on (very) fast presets.

Quote
* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM?
The first thread crosses the API boundary, and is for (1) the internals of the flac command line program, (2) the WAV reading or FLAC decoding, (3) verify decoding and (4) some internal copying and moving of data. If this thread is idle, it will start working on a frame. One of the other threads does MD5 calculation on the data that is to be encoded, and the others create frames.

I just found out that the flac command line program does NOT calculate and/or check MD5 of the original file on reencoding. It only calculates a new MD5. It also doesn't check whether the original MD5 and the new one are the same. Probably something that should be fixed at some point.

Quote
(AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum
No, it does not. It decodes and checks whether each and every decoded sample is the same as every input sample. It does not verify the stored MD5.

Quote
Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)
I think that is a feature of the getopt functions. If an 'abbreviation' is unique, it will accept that. So --no-md will also work. --no-m does not work because it is ambiguous, it could also mean --no-mid-side
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-26 20:48:14
flac git-5500690f 20230726

Code: [Select]
        -0      -1      -2     -3      -4      -5      -6      -7     -8 
 -j1    3.82s   3.99s   4.25s   4.37s   5.02s   6.02s   8.18s   10.27s 15.35s
 -j2    2.13s   3.16s   2.35s   2.37s   4.20s   3.28s   4.55s   5.59s  8.16s
 -j3    1.61s   3.19s   1.74s   1.77s   4.21s   2.41s   3.32s   4.09s  6.13s
 -j4    1.64s   3.18s   1.66s   1.60s   4.22s   2.00s   2.72s   3.32s  4.99s
 -j5    1.71s   3.18s   1.70s   1.64s   4.20s   1.83s   2.37s   2.85s  4.27s
 -j6    1.66s   3.18s   1.71s   1.66s   4.21s   1.82s   2.07s   2.55s  3.84s
 -j7    1.72s   3.21s   1.72s   1.64s   4.22s   1.83s   2.05s   2.31s  3.47s
 -j8    1.74s   3.17s   1.78s   1.64s   4.22s   1.84s   2.06s   2.21s  3.23s
 -j9    1.71s   3.31s   1.77s   1.64s   4.21s   1.87s   2.08s   2.22s  3.16s
 -j10   1.72s   3.17s   1.75s   1.62s   4.21s   1.85s   2.09s   2.24s  3.10s
 -j11   1.73s   3.18s   1.82s   1.69s   4.21s   1.87s   2.10s   2.33s  3.04s
 -j12   1.78s   3.16s   1.87s   1.63s   4.27s   1.93s   2.11s   2.24s  2.97s
 -j13   1.76s   3.21s   1.80s   1.69s   4.20s   1.89s   2.11s   2.29s  2.93s
 -j14   1.70s   3.17s   1.79s   1.66s   4.22s   1.88s   2.13s   2.32s  2.91s
 -j15   1.82s   3.18s   1.85s   1.67s   4.21s   1.92s   2.11s   2.30s  2.85s
 -j16   1.76s   3.20s   1.91s   1.65s   4.23s   1.89s   2.12s   2.27s  2.84s

Code: [Select]
        -0p     -1p     -2p     -3p     -4p     -5p     -6p     -7p    -8p
 -j1    3.82s   3.99s   4.24s   5.43s   6.37s   8.30s   16.39s  20.10s 44.02s
 -j2    2.16s   3.17s   2.34s   3.02s   5.57s   4.61s   9.19s   11.06s 24.25s
 -j3    1.61s   3.17s   1.74s   2.17s   5.57s   3.40s   6.76s   8.20s  18.19s
 -j4    1.64s   3.21s   1.65s   1.79s   5.60s   2.77s   5.48s   6.69s  14.85s
 -j5    1.65s   3.19s   1.68s   1.76s   5.59s   2.39s   4.75s   5.76s  12.82s
 -j6    1.63s   3.18s   1.69s   1.75s   5.57s   2.13s   4.21s   5.12s  11.44s
 -j7    1.65s   3.16s   1.73s   1.78s   5.58s   2.09s   3.86s   4.68s  10.38s
 -j8    1.69s   3.17s   1.78s   1.80s   5.58s   2.11s   3.54s   4.33s  9.59s
 -j9    1.75s   3.19s   1.78s   1.78s   5.60s   2.15s   3.55s   4.27s  9.58s
 -j10   1.78s   3.16s   1.79s   1.75s   5.58s   2.12s   3.48s   4.24s  9.72s
 -j11   1.77s   3.18s   1.77s   1.73s   5.57s   2.17s   3.44s   4.17s  9.51s
 -j12   1.75s   3.17s   1.84s   1.79s   5.60s   2.18s   3.39s   4.17s  9.45s
 -j13   1.72s   3.19s   1.87s   1.84s   5.57s   2.16s   3.35s   4.06s  9.22s
 -j14   1.78s   3.17s   1.87s   1.79s   5.57s   2.15s   3.32s   3.99s  9.16s
 -j15   1.76s   3.18s   1.82s   1.82s   5.59s   2.19s   3.28s   3.95s  9.09s
 -j16   1.76s   3.25s   1.81s   1.82s   5.59s   2.22s   3.30s   3.93s  9.03s

Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.
Title: Re: More multithreading
Post by: Porcus on 2023-07-26 20:57:15
Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.
-M --> it doesn't utilize more. Reply 110 above and last sentence original post.
Title: Re: More multithreading
Post by: Replica9000 on 2023-07-26 21:01:38
Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.
-M --> it doesn't utilize more. Reply 110 above and last sentence original post.

I read that and didn't think to check which presets were using it.  Doh! Seems my brain is on vacation this week.
Title: Re: More multithreading
Post by: Porcus on 2023-07-29 13:34:18
@Wombat made a couple of builds from the same source as the above version 5, and here follow some measurements against the one with "v3" flags, requiring AVX2 but not AVX512 (did I get that right)? This on a HP Deskpro which cannot run the AVX512 thing.

Compiles compare kinda how they should .... ? At least, no nasty surprises and no miracles, just a mild improvement from the instruction set of 3 to 5 percent on most settings - although yes some exceptions to either direction, and far down-right in the table there are a few positive numbers where the Wombat v3 build takes slightly more time.

That explains the  (Wv3) line in the table: time difference in percent (negative means faster), against ktf's latest build, which appears in the "main" line.
That line first has compression time in seconds. Then I thought, why not represent the others as penalty relative to the benchmark where speed is proportional to number of cores. Say, if times are not 40/20/10 but 40/21/12, the penalties of 1 and 2 seconds show up as 5% (of the 20) and 20% (of the 10).

Although those %s might be misleading when numbers become small (I mean, it is the seconds that make us impatient!), that is anyway where the -j4 doesn't unleash much. E.g. -0r1 -j2 is was done in 27 seconds and -j4 saves less than four more.
Also, I manually deleted the two -1j4 ones because -M caps the -j at 2 anyway. That in turn is because the multitreading is not (yet) optimized for -M, which is also pretty clear from the overhead on the two -1j2's.


... why this choice of settings? Because it seems the "-r" makes a difference between Clang and GCC compiles (https://hydrogenaud.io/index.php/topic,123025.msg1030706.html#msg1030706) (#349 explains a mistake), so why not try a very fine partitioning and a very coarse. Not so much for the number of seconds, more to verify that it doesn't behave unexpectedly stupid under -r variations.

j1j2 ovrhj4 ovrh
8pl3235523 %7 %
(Wv3)-5 %-5 %-5 %
8per753173 %8 %
(Wv3)-5 %-5 %-5 %
8pr76463 %7 %
(Wv3)-7 %-5 %-5 %
8er76884 %10 %
(Wv3)-5 %-5 %-5 %
8r71765 %11 %
(Wv3)-5 %-4 %-4 %
8r11534 %11 %
(Wv3)-5 %-3 %-3 %
5r7717 %22 %
(Wv3)-3 %-2 %-2 %
5r1668 %25 %
(Wv3)-3 %-2 %-4 %
2er71057 %16 %
(Wv3)-4 %-4 %-4 %
2er16111 %52 %
(Wv3)-7 %-6 %1 %
2r76111 %48 %
(Wv3)-3 %-3 %2 %
2r15212 %74 %
(Wv3)-4 %-3 %-3 %
1r75358 %
(Wv3)-3 %-5 %-5 %
1r14851 %
(Wv3)-3 %-6 %-7 %
0r75113 %76 %
(Wv3)-3 %-2 %1 %
0r14715 %98 %
(Wv3)-3 %-3 %2 %
Times are median of 3. No CPU cooldown, rather the opposite: three -j1 runs discarded, then -j1, -j2, -j4 and Wombat build -j1, -j2, -j4. If anything, that would mean that Wombat -j1 got different conditions, not the Wombat -j4.
Title: Re: More multithreading
Post by: sundance on 2023-07-29 18:51:38
To whom it may concern...
Just for the fun of it, I ran my -7 test on an up-to-date computer, which i happen to have for a couple of days to set it up.
It has a Raptor Lake Intel Core i9 13900F, 5.6 GHz max., 8/16 performance/efficient cores, 32 threads.
Code: [Select]
-j1:	Average time =  15.172 seconds (3 rounds), Encoding speed = 712.63x
-j2: Average time =   8.047 seconds (3 rounds), Encoding speed = 1343.55x
-j3: Average time =   5.704 seconds (3 rounds), Encoding speed = 1895.62x
-j4: Average time =   4.434 seconds (3 rounds), Encoding speed = 2438.43x
-j5: Average time =   3.658 seconds (3 rounds), Encoding speed = 2955.71x
-j6: Average time =   3.182 seconds (3 rounds), Encoding speed = 3397.51x
-j7: Average time =   2.885 seconds (3 rounds), Encoding speed = 3747.23x
-j8: Average time =   2.808 seconds (3 rounds), Encoding speed = 3850.43x
-j10: Average time =   2.807 seconds (3 rounds), Encoding speed = 3851.80x
-j12: Average time =   2.841 seconds (3 rounds), Encoding speed = 3806.15x
-j14: Average time =   2.868 seconds (3 rounds), Encoding speed = 3769.87x
-j16: Average time =   2.935 seconds (3 rounds), Encoding speed = 3683.40x
Test were done with ktf's v5 binary (flac git-5500690f 20230726)
Title: Re: More multithreading
Post by: sundance on 2023-07-29 22:03:51
And since the performance plateau was reached @ -j7 in the test above, I tried with some heavier loads, too:
-8:
Code: [Select]
-j1:    Average time =  23.259 seconds (3 rounds), Encoding speed = 464.85x
-j2:    Average time =  12.256 seconds (3 rounds), Encoding speed = 882.18x
-j3:    Average time =   8.570 seconds (3 rounds), Encoding speed = 1261.61x
-j4:    Average time =   6.566 seconds (3 rounds), Encoding speed = 1646.66x
-j5:    Average time =   5.374 seconds (3 rounds), Encoding speed = 2011.78x
-j6:    Average time =   4.679 seconds (3 rounds), Encoding speed = 2310.59x
-j7:    Average time =   4.207 seconds (3 rounds), Encoding speed = 2569.80x
-j8:    Average time =   3.908 seconds (3 rounds), Encoding speed = 2766.87x
-j10:   Average time =   3.732 seconds (3 rounds), Encoding speed = 2896.85x
-j12:   Average time =   3.672 seconds (3 rounds), Encoding speed = 2944.44x
-j14:   Average time =   3.657 seconds (3 rounds), Encoding speed = 2956.79x
-j16:   Average time =   3.706 seconds (3 rounds), Encoding speed = 2917.17x
Performance peak here is -j10 .. -j12, so even when the CPU ran out of performance cores there is some benefit.

-8p:
Code: [Select]
-j1:    Average time =  74.252 seconds (3 rounds), Encoding speed = 145.61x
-j2:    Average time =  37.895 seconds (3 rounds), Encoding speed = 285.32x
-j3:    Average time =  26.564 seconds (3 rounds), Encoding speed = 407.02x
-j4:    Average time =  20.738 seconds (3 rounds), Encoding speed = 521.36x
-j5:    Average time =  17.305 seconds (3 rounds), Encoding speed = 624.79x
-j6:    Average time =  15.060 seconds (3 rounds), Encoding speed = 717.94x
-j7:    Average time =  13.406 seconds (3 rounds), Encoding speed = 806.52x
-j8:    Average time =  12.321 seconds (3 rounds), Encoding speed = 877.50x
-j10:   Average time =  12.201 seconds (3 rounds), Encoding speed = 886.13x
-j12:   Average time =  11.442 seconds (3 rounds), Encoding speed = 944.94x
-j14:   Average time =  10.573 seconds (3 rounds), Encoding speed = 1022.64x
-j16:   Average time =   9.777 seconds (3 rounds), Encoding speed = 1105.86x
-j18:   Average time =   9.352 seconds (3 rounds), Encoding speed = 1156.12x
-j20:   Average time =   8.942 seconds (3 rounds), Encoding speed = 1209.17x
-j22:   Average time =   8.547 seconds (3 rounds), Encoding speed = 1264.96x
-j24:   Average time =   8.219 seconds (3 rounds), Encoding speed = 1315.49x
-j26:   Average time =   7.949 seconds (3 rounds), Encoding speed = 1360.17x
-j28:   Average time =   7.850 seconds (3 rounds), Encoding speed = 1377.32x
-j30:   Average time =   7.836 seconds (3 rounds), Encoding speed = 1379.73x
-j32:   Average time =   7.791 seconds (3 rounds), Encoding speed = 1387.76x
-j34:   Average time =   7.746 seconds (3 rounds), Encoding speed = 1395.82x
-j38:   Average time =   7.819 seconds (3 rounds), Encoding speed = 1382.73x
-j42:   Average time =   7.928 seconds (3 rounds), Encoding speed = 1363.72x
-j46:   Average time =   7.963 seconds (3 rounds), Encoding speed = 1357.78x
Performance peak at -j32..-j34 here.
Title: Re: More multithreading
Post by: music_1 on 2023-07-30 07:26:53
flac-multithreading-v5-win
Code: [Select]
timer64.exe v5 -j1 -8p -f in.wav
Global Time  =    55.150

timer64.exe v5 -j2 -8p -f in.wav
Global Time  =    30.851

timer64.exe v5 -j3 -8p -f in.wav
Global Time  =    25.706

timer64.exe v5 -j4 -8p -f in.wav
Global Time  =    19.132

timer64.exe v5 -j5 -8p -f in.wav
Global Time  =    16.910

timer64.exe v5 -j6 -8p -f in.wav
Global Time  =    13.622

timer64.exe v5 -j7 -8p -f in.wav
Global Time  =    12.661

timer64.exe v5 -j8 -8p -f in.wav
Global Time  =    10.662

timer64.exe v5 -j9 -8p -f in.wav
Global Time  =    10.145

timer64.exe v5 -j10 -8p -f in.wav
Global Time  =     8.773

timer64.exe v5 -j11 -8p -f in.wav
Global Time  =     8.469

timer64.exe v5 -j12 -8p -f in.wav
Global Time  =     7.719

timer64.exe v5 -j13 -8p -f in.wav
Global Time  =     7.503

timer64.exe v5 -j14 -8p -f in.wav
Global Time  =     6.735

timer64.exe v5 -j15 -8p -f in.wav
Global Time  =     6.678

timer64.exe v5 -j16 -8p -f in.wav
Global Time  =     6.413]
Title: Re: More multithreading
Post by: C.R.Helmrich on 2023-07-30 18:21:41
Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads. That local optimum doesn't show in e.g. Replica9000's statistics for preset 7/8, if I'm not mistaken. Anyway, good multithreading performance at and below 8 threads with v5!

Chris
Title: Re: More multithreading
Post by: ktf on 2023-07-30 21:07:33
Performance peak at -j32..-j34 here.
Thanks! I think I can conclude from that that the 'leapfrogging' works correctly. One would assume a P core can do more work in the same amount of time than an E core, so if they were bound to the same number of frames before having to wait (which is the case with v1 and v3) that should have been visible from the results.

What is a bit vague though is how the scheduler works. Does it first saturate the P cores, than the E cores, than the 'hyperthreading system of the P cores'? Or does it first to the P cores, than hyperthreading, then E cores? Anyway, seeing no regressions I'd say it works pretty well, even if it isn't very efficient anymore at those high thread counts.

Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads.
Thanks! I think that local minimum is because at that point the first and second thread don't have to switch context too often. Those first two threads are less 'specialized' than the other threads and this brings some inefficiency.

Anyway, good multithreading performance at and below 8 threads with v5!
I agree! I think I did pretty well, not having done multithreaded programming before.  :D

Of course, thank you all for benchmarking. This has helped tremendously!
Title: Re: More multithreading
Post by: Porcus on 2023-07-31 00:05:20
If anyone cares about my table overloads, then first there is this a mistake of mine: The i5-7500T in the top table here, is 4 cores 4 threads. So for that, consider the -j8 a "sanity check".

I let some builds loose on the i5-1135G7 equipped fanless desktop again. (A table comparing with Wombat's builds posted here (https://hydrogenaud.io/index.php/topic,123025.msg1030848.html#msg1030848).)
Lower figures better. Negative numbers better for the new build. Only one run per setting per build, so take times with a grain of salt.

I have two different things into the table here as well.
* The "timediff" rows are the running time of the version 5 vs the version 4: negative numbers are speedups, positive are slowdowns.
* The other rows, then the "ovrhd/diff" colums show "overhead penalty": the idealized time would be "j1 time / # of threads", and I compare actual time taken and quote the percent extra. Say, the 53% in the top j5 cell: Idealized time would be 25-ish seconds, and it took 53 percent more, i.e. 38-ish. The worst number, "452%" means it took 5.5 times the ideal,
The "j9" column is adjusted to match the j1 time / 8, since there are only 8 threads on this CPU. -j9 was ran a sanity check.
Percent penalty is quite useless on the -2r0 settings, where j4 was fastest in wall time.

Also the percent penalty is quite useless in the rightmost column. The two colums to the end are ran with -M, where -j is capped at 2 because -M isn't good for multithreading. No shit, Sherlock: -2Mer7 -j2 took more time than -2Mer7.

-8:j1 time/diffj2 ovrhd/diffj3 ovrhd/diffj4 ovrhd/diffj5 ovrhd/diffj8 ovrhd/diffj9 ovrhd/diff-Mj1 time/diffj2 ovrhd/diff
v412419%20%35%53%131%120%9667%
v51218%23%40%55%124%138%8397%
time v5 vs v4−2%−12%0%1%−1%−6%6%−14%2%
-5:
v44823%49%86%104%271%287%4284%
v54924%56%80%105%239%246%4088%
time v5 vs v41%1%6%−2%1%−8%−10%−4%−1%
-3:
v43627%57%112%166%366%386%3979%
v53726%68%100%148%312%307%3892%
time v5 vs v41%1%9%−4%−6%−10%−15%−2%5%
-2er7:
v46919%34%60%76%175%238%51100%
v57016%38%52%75%180%212%50102%
time v5 vs v41%−1%5%−4%1%3%−6%−1%0%
-2r0:
v45421%25%93%127%452%393%3680%
v54424%57%85%153%377%320%3891%
time v5 vs v4−18%−16%3%−22%−9%−29%−30%6%13%
Title: Re: More multithreading
Post by: ktf on 2023-07-31 08:10:27
The i5-7500T in the top table here
I am confused, which table do you mean?
Title: Re: More multithreading
Post by: Porcus on 2023-07-31 11:02:21
which table
Forgot to linkify it. Here, reply 104. (https://hydrogenaud.io/index.php/topic,124437.msg1030551.html#msg1030551)

Top table: -j8 takes about the same time as -j4. To be expected, there are only four threads.
Next tables: -j8 takes shorter time than -j4 for the -8 based presets, but not -5 nor fixed-predictor and that is on 4core8threads CPUs.

For version 5 "-j8" performance on the i5-1135G7 (Intel data here) (https://ark.intel.com/content/www/us/en/ark/products/208658/intel-core-i51135g7-processor-8m-cache-up-to-4-20-ghz.html), these are time in seconds from the same data as the table in the previous reply 120.
Only -j1, -j4, -j8
Code: [Select]
seconds -j1 -j4 -j8
-8:     121  42  34
-5:      49  22  21
-3:      37  18  19
-2er7:   70  27  24
-2r0:    44  20  26
Although timings are to be taken with a grain of salt, I confirmed the relations between the -j's with two Wombat builds (https://hydrogenaud.io/index.php/topic,123025.msg1030848.html#msg1030848) too, so I'm believing that -8 benefits from going -j4 to -j8, the others only marginally so or get slower; reservation: this on an already-hot computer, passively cooled. That could potentially be a small benefit to -j1 more than -j4 more than -j8, since immediately before -j1 resp -j4 resp -j8 there was a -Mj2 resp -j3 resp -j5 (the latter heavier). But there was hardly a particular benefit for the top-left element, as I first ran v4, and so the CPU had been running -8jx then -8Mjx encoding for fifteen minutes process time: it would have been hot.

Larger benefits at heavier jobs might be expected from a well-cooled computer, but not on a passively cooled and runs hot to touch: then I would rather expect that trying to increase the workload by employing more threads, would cause throttling and diminish the speedup and even more so on heavier jobs when a thread isn't idling so much.

To point out how even this fanless computer - when running hot for a long time - still utilize multi-threading:
Here are results from multithreading -8pel32. A two-day job on the now-obsolete version 4.
Code: [Select]
versionv4-8epl32-j1 

Commit   =     14224 KB  =     14 MB
Work Set =     15328 KB  =     15 MB

Kernel Time  =    19.031 =    0%
User Time    = 60512.734 =   98%
Process Time = 60531.765 =   98%
Global Time  = 61573.994 =  100%
 
versionv4-8epl32-j2

Commit   =     16772 KB  =     17 MB
Work Set =     19516 KB  =     20 MB

Kernel Time  =    16.609 =    0%
User Time    = 93122.156 =  196%
Process Time = 93138.765 =  196%
Global Time  = 47361.310 =  100%
 
versionv4-8epl32-j4

Commit   =     18588 KB  =     19 MB
Work Set =     20804 KB  =     21 MB

Kernel Time  =    25.937 =    0%
User Time    = 98499.734 =  385%
Process Time = 98525.671 =  385%
Global Time  = 25562.915 =  100%
 
versionv4-8epl32-j8

Commit   =     22164 KB  =     22 MB
Work Set =     23360 KB  =     23 MB

Kernel Time  =   172.843 =    0%
User Time    = 138908.546 =  771%
Process Time = 139081.390 =  772%
Global Time  = 17994.765 =  100%
Sure process  time was twice as large on -j8 than on -j1, but this was version 4.

So I'd say this is better than expected: it was run on a computer where multithreading benefits would be expected smaller and still it speeds up quite a lot.
Title: Re: More multithreading
Post by: ktf on 2023-07-31 13:13:11
Yes, looks good. For now, there seems no obvious way to improve efficiency or scaling further, so I think it is time to write some documentation and get this merged.
Title: Re: More multithreading
Post by: Porcus on 2023-07-31 16:27:59
For now, there seems no obvious way to improve efficiency or scaling further
Except -M? Which probably will not be prioritized, maybe carry a "WARNING: -M limits multithreading to -j2"? 
Anyway, on this computer it didn't seem that -Mj2 would even improve over -Mj1, but to get more data now I have tried a big number of -1fj1 vs -1fj2 runs on this and on two Dell laptops, and it looks slightly more optimistic. (Global) time saved by going -j2 was measured to 4% on this computer over twenty-something runs unattended, and then 10% and 20% those two laptops - and although I didn't fire up more runs ton the desktop in the above table, (https://hydrogenaud.io/index.php/topic,124437.msg1030783.html#msg1030783) it could even be slightly more. At least it got the right sign, although you would have had to expect Github issues with "multithreading doesn't work!" had you implemented a previous suggestion of putting a -M in the -0 preset ;)

(Only Intel CPUs tested here, I should add.)

so I think it is time to write some documentation and get this merged.
In the course of that, there is a decision coming up - or one may postpone it: "-j0" should signify what? Suggestion:
Implement (or at least, make no decision that makes a future case against it) -j0 for allowing multithreading, let encoder decide. It could for now invoke -j1 but, thinking aloud and proposing something that actually multithreads but not too aggressive:
 -j0 invokes -j2, except if there is -M, then it single-threads.
My loose idea behind that was to flag to users that -j0 is not supposed to be synonymous to any of the others - so stop whining when you find out it is neither -j1 nor -j2 - nor when it changes! Users cannot expect it to stay constant when it is tuned to be something smarter than a fixed number, so make it "smarter than -j2" from day one.

I take it that for the sake of applications that pass one file to one thread (like fb2k), the default will be -j1.
Title: Re: More multithreading
Post by: Wombat on 2023-08-01 11:51:12
"Currently, passing a value of 0 is synonymous with a value of 1, but this might change in the future"
https://github.com/ktmf01/flac/commit/bd389908f8d698fe17eae8d25f1e94e88573f258
This is a good idea imho.
Title: Re: More multithreading
Post by: Porcus on 2023-08-01 19:58:18
I agree that it is a good idea to warn users it will change in the future. My thinking-aloud was ... why not make that warning clearer - because it isn't synonymous to any other setting, it cannot be expected to remain synonymous to any other setting. Just to warn users about that from jay day one. But I'm not crying over it currently being same as -j1.


Anyway, I see the tests here are CDDA material. Maybe high resolution and multichannel? More for sanity checking - see if there is anything completely bonkers going on - than for sheer numbers.
But the numbers ... might be unreliable, so maybe I should ask, were there any "general" changes made from v1 to v5?
Asking because when I tried the below hi-res corpus on -8Mper7 -j1/-j2 (yes, "M" - stupid setting but the point was to test "-M") then I got the following times:
v1   2574 vs 2550
v5   2538 vs 2534
(Wombat's build: slightly faster. I didn't run more 8pe, not included below.)


So, on to the tests; v1 (first one posted in this thread) vs v5 vs Wombat's build from same source as v5, but with AVX-512 compiler flag ("v4", but I left that out not to confuse with ktf's builds):

High resolution. 62 minutes, 2848MiB compressed (-5): size contributions are 1008 of 192/24 from the 2L testbench and Linn Records, 619 of that infamous 768kHz Carmen Gomes PR stunt, 350 of DXD (that's only 5 minutes!) from the 2L testbench - and then the rest is various-rate 32-bit integer from that sample rate converter site, and then one track in 192/16 from some French stoner band.
 
First two cells were redone later, computer was apparently still busy when I started. Also did -j9 for a sanity check (got only 8 threads) -  virtually the same times as -j8, omitted from the table.
One surprise: v1 doing -0b4096 -j2 so well. Confirmed by a few re-runs.
But generally, v5 is superior on this material too. Take individual times with a grain of salt.

This time I used seconds. You can see where the benefits flatten out:
.j1j2j3j4j5j8Mj1Mj2
v1@-0r02223181717172222
v52112121113112219
Wombat2113121212132319
v1@-02120181717172321
v52213111112112320
Wombat2213121212122320
v1@-0b40962115161616162215
v52112101110102219
Wombat2112111112112218
v1@-0er73130181717173232
v53117141211123229
Wombat3017151212123129
v1@-5332416171616
v5331814111211
Wombat321814121212
v1@-810011054383831
v51005339323230
Wombat975238303128
v1@-8e477484249183178153
v5481253183150151145
Wombat476247181149148141
v1@-8pr7638644335244255206
v5642332241201201194
Wombat631327239197197189
(Wombat builds produce slightly different files, size differences +/- 0.01%. )


5.1 multichannel. About an hour, DVD-sourced at 48kHz (avoiding high resolution here, one test at the time).
Since -M was off the table, I cut down to fewer -j options too. Again I have omitted a -j9 done for sanity checking, it produced numbers consistent with -j8.

There are some weirdnesses for -0; I cannot rule out that the computer might have been not completely done with some other job or whatever.  Also I cannot access that computer at the moment to re-run it (I am on the road, I had it output numbers to a text file in the cloud).
j1j2j4j8
v1@-0139914
v51321156
Wombat147710
v1@-2er71718911
v5171087
Wombat20977
v1@-514999
v514867
Wombat14967
v1@-534322410
v53320119
Wombat3119118
v1@-8e1181175935
v5121924834
Wombat115914732
v1@-8pr71621687948
v51611296746
Wombat153834744
So, since I was looking for anomalies, and am away from that computer since firing up that multichannel test ... well, I would have hoped I didn't have to re-run anything due to results like those -0. But with that minor reservation, I think the picture (on this computer with some Intel 4 cores 8 threads) is getting quite clear.  v5 does behave sane, but I should count myself lucky if I save much time going beyond -j4.
Title: Re: More multithreading
Post by: Wombat on 2023-08-02 01:13:20
Current git of the multithreading version c1fc2c91, CPU generic.
Title: Re: More multithreading
Post by: Wombat on 2023-08-24 14:36:10
I used that version above in my little j5005 machine and did recompress more than 300GB over the time using --threads=3 since it runs anyway. These are mixed bitrates.
All files correctly bit-compare.
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Title: Re: More multithreading
Post by: Porcus on 2023-08-24 15:48:58
Like 1.4.0, release it 09-09 in order not to confuse ISO-8601-illiterate Americans?  ;)
a final and multithreaded 1.4.4
Question arises, are the changes so minor that it will stay "1.4"?
Possible relevance: a "1.5.0" might justify some more discussion on what is to be included.

Speaking of the second digit:
1.3.4 has this error in Rice partitions with escape code zero (testbench file 64). And 1.3.4 is the last of the 1.3 series.
Is there a risk that 1.3.4 will be kept in production because the breaking changes to 1.4.0? If so, should there be a maintenance 1.3.5 with this bugfix?
Title: Re: More multithreading
Post by: ktf on 2023-08-24 17:47:30
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Not really, no. Merge with master probably in a few weeks, release might be next year.

Question arises, are the changes so minor that it will stay "1.4"?
The reason to bump the 4 to 5 would be because of a breaking API change. That isn't the case here.

1.3.4 has this error in Rice partitions with escape code zero (testbench file 64). And 1.3.4 is the last of the 1.3 series.
Is there a risk that 1.3.4 will be kept in production because the breaking changes to 1.4.0? If so, should there be a maintenance 1.3.5 with this bugfix?
No, that won't happen. I don't have time to backport all fixes that have happened in the meantime.
Title: Re: More multithreading
Post by: Porcus on 2023-08-24 19:12:46
No, that won't happen. I don't have time to backport all fixes that have happened in the meantime.
Fair enough - it is not that it creates invalid files (I think?)
But the changelog could maybe have been clearer recommending up- or downgrade if flac.exe version 1.3.4 errs out on a file.
Title: Re: More multithreading
Post by: Wombat on 2023-08-24 19:52:00
@ktf Thanks for the info.
Title: Re: More multithreading
Post by: VEG on 2023-08-25 13:54:14
"Currently, passing a value of 0 is synonymous with a value of 1, but this might change in the future"
Maybe it would be better to change its meaning to "sets to amount of available cores"?
Title: Re: More multithreading
Post by: ktf on 2023-08-26 12:29:24
There is no platform-independent way to determine the 'amount of available cores': this is different for Windows, MacOS, *nixes, microcontrollers etc. Might also differ between CPU architectures. Also, with the advent of performance and efficiency cores, using all cores might not be beneficial. Same goes for hyperthreading and similar technologies.

So auto-selecting isn't as simple as it might seem.
Title: Re: More multithreading
Post by: cid42 on 2023-08-26 12:58:19
It's a de-facto standard that 0 means "as many cores as available", if you don't want to do that I suggest removing 0 as an option entirely. Either way the default should be 1. IMO if a user requests "as many cores as available", it's on them if that's not the most effective option.
Title: Re: More multithreading
Post by: Porcus on 2023-08-26 13:41:14
It's a de-facto standard that 0 means "as many cores as available",
Wasn't it so that "-threads 0" in ffmpeg means "let application decide"?

Either way the default should be 1.
Obviously. Say fb2k will spawn one instance per available thread.
Title: Re: More multithreading
Post by: Wombat on 2023-08-26 15:36:37
You may also relate 0=default and this is still 1 thread.
Title: Re: More multithreading
Post by: hat3k on 2023-09-30 13:38:20
First thanks for starting to develop multithreading.

I performed some tests and compared it to https://www.rarewares.org/files/mp3/fpMP3Enc.zip multithreading behavior.
fpFLAC2FLAC used all cores @100% by default (with no options added) but flac.exe with maximum opted threads uses CPU in this way:X

And i would like to ask if there is any chance to add some timers to CLI text for testing purposes?
Title: Re: More multithreading
Post by: Wombat on 2023-09-30 14:56:44
AMD Ryzen 5900x, 24 threads (--threads=24)
Title: Re: More multithreading
Post by: ktf on 2023-09-30 19:36:00
but flac.exe with maximum opted threads uses CPU in this way:
Please explain what options you used.

FLAC encodes very fast, the system calls used to enable multithreading take some time to execute and some things cannot be multithreaded. This means that when multitheading with a large number of threads, full CPU usage can only be reached when using a slow FLAC preset, like -8p. If you used the default compresion level of 5, then what you are seeing is probably due to parsing or decoding (which cannot be multithreaded) being a bottleneck.
Title: Re: More multithreading
Post by: hat3k on 2023-09-30 20:36:08
Please explain what options you used.
5950x automatically runs @~4,40-4,55 GHz settings -8 -V -j32 X
5950x auto throttles down to ~3,25-3,35 GHz settings -8 -V -e -p -j32 X
Title: Re: More multithreading
Post by: hat3k on 2023-09-30 22:09:38
5950x auto @4,45 GHz settings -8 -V -p -j32 X
Title: Re: More multithreading
Post by: itisljar on 2024-01-18 13:42:37
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Not really, no. Merge with master probably in a few weeks, release might be next year.

Any news on this? I really liked the idea of MT FLAC, I've made profile for CUETools for speedier encoding of music... :)
Title: Re: More multithreading
Post by: ktf on 2024-01-18 13:48:31
Fuzzing found some exotic/rare bugs in this code that probably nobody will ever encounter, which I will need to fix. I don't have time to do that soon though. Especially since multithreaded coded is much harder to debug.
Title: Re: More multithreading
Post by: ktf on 2024-02-27 18:19:09
I have pushed some changes to the multithreading code. It is too soon to say for sure, but it seems to fix some of the problems.

While the changes could impact performance, in my own tests it doesn't seem measurable. If anyone wants to double-check, the last two compiles here (https://hydrogenaud.io/index.php/topic,123176.msg1040227.html) are probably usable for that.
Title: Re: More multithreading
Post by: Hakan Abbas on 2024-02-27 20:03:54
Health for your labor. I quickly did a small encoding test. I guess FLAC decoding is not multithread yet. I tried it, but I didn't see the difference.

Intel i7 3770k(4 core, 8 thread), 16 gb ram, 256 gb ssd
FLAC git-7f7da558 20240226 - "flac.exe -o output -x --no-md5 --totally-silent -jx -f input"
HALAC 0.2.6 Normal - "halac_encode input output -y -mt=x"

X
Code: [Select]
WAV : 1,857,654,566 bytes (Merged 3 Music album)
-------------------
HALAC Normal mt=1 : 10.359
HALAC Normal mt=2 :  6.578
HALAC Normal mt=4 :  4.328
HALAC Normal mt=8 :  3.672
HALAC Normal mt=16 : 3.609
1,245,704,379 bytes
-------------------
FLAC -0 j1 : 10.390
FLAC -0 j2 :  5.937
FLAC -0 j4 :  6.172
FLAC -0 j8 :  5.687
FLAC -0 j16 : 6.109
1,318,502,972 bytes
-------------------
FLAC -1 j1 : 11.015
FLAC -1 j2 :  6.484
FLAC -1 j4 :  6.469
FLAC -1 j8 :  7.125
FLAC -1 j16 : 6.765
1,293,667,655 bytes
-------------------
FLAC -2 j1 : 12.297
FLAC -2 j2 :  6.687
FLAC -2 j4 :  6.062
FLAC -2 j8 :  6.406
FLAC -2 j16 : 6.515
1,288,861,797 bytes
-------------------
FLAC -3 j1 : 16.453
FLAC -3 j2 :  8.750
FLAC -3 j4 :  6.000
FLAC -3 j8 :  5.562
FLAC -3 j16 : 5.219
1,254,819,663 bytes
-------------------
FLAC -4 j1 : 19.843
FLAC -4 j2 : 16.203
FLAC -4 j4 : 16.312
FLAC -4 j8 : 16.218
FLAC -4 j16 :16.406
1,221,587,898 bytes
-------------------
FLAC -5 j1 : 27.124
FLAC -5 j2 : 14.109
FLAC -5 j4 :  8.140
FLAC -5 j8 :  7.015
FLAC -5 j16 : 7.328
1,218,712,751 bytes
-------------------
Title: Re: More multithreading
Post by: ktf on 2024-02-27 21:13:05
I guess FLAC decoding is not multithread yet.
The FLAC format is unfit for multithreaded decoding. That is, because reliably finding the next frame involves parsing the current one. One could offload MD5 calculation to a separate thread, and maybe parsing and decoding could be done in separate threads, but I currently don't really see a way to  make more than 4 threads have any benefit at all, and even then, the workload would be very uneven.
Title: Re: More multithreading
Post by: Wombat on 2024-02-28 01:17:37
I have pushed some changes to the multithreading code. It is too soon to say for sure, but it seems to fix some of the problems.

While the changes could impact performance, in my own tests it doesn't seem measurable. If anyone wants to double-check, the last two compiles here (https://hydrogenaud.io/index.php/topic,123176.msg1040227.html) are probably usable for that.

Not sure about the problems but for many 16/44.1 and several 24/96 files i didn't find any.
The recent git when i compile it as the one before gives ~191x speed for -j20 -8ep with 16/44.1 against former ~192x speed on my 5900x.
No speed concerns for sure.
Title: Re: More multithreading
Post by: ktf on 2024-02-28 07:06:07
Not sure about the problems but for many 16/44.1 and several 24/96 files i didn't find any.
It would show up randomly about once in every 3.000.000 executions. This is the kind of bugs we have to thank Google's oss-fuzz project for finding. You can probably imagine it took me quite a while to find a proper fix for it.

Quote
No speed concerns for sure.
Great, thanks for checking.
Title: Re: More multithreading
Post by: Porcus on 2024-02-28 09:50:20
The FLAC format is unfit for multithreaded decoding. That is, because reliably finding the next frame involves parsing the current one.
My usual uneducated question: Why can't one parse frame 1, send to thread 1, parse frame 2, send to thread 2 etc?
Title: Re: More multithreading
Post by: ktf on 2024-02-28 11:26:19
That is possible of course, but of little use. Decoding only takes a tiny amount of time compared to parsing, which means the decoding thread will idle a lot, and sleeping/waking a thread comes with a lot of overhead. Most time is spent parsing.
Title: Re: More multithreading
Post by: Hakan Abbas on 2024-02-28 12:25:49
Decoding speed is more important than Encoding speed in most cases. And excessive operations are a plus for multithread. However, even in a very low processing load, we can benefit from multithread.

Intel Core i7-3770K(4 core, 8 thread), 16 gb ram, 240 gb ssd (https://www.disctech.com/Intel-520-Series-SSDSC2CW240A310-240GB-2-5-MLC-SSD-SATA3-Hard-Drive)
The SSD I use in this test system is really slow (2012). Therefore, it is a large bottleneck in terms of speed.
Code: [Select]
HALAC Normal Decoding
1,245,704,379 bytes -> 1,857,654,566 bytes
HALAC Normal mt=1 : 10.074
HALAC Normal mt=2 :  6.590
HALAC Normal mt=4 :  4.914
HALAC Normal mt=8 :  4.303
-------------------
HALAC Fast Decoding
1.305.406.815 bytes -> 1,857,654,566 bytes
HALAC Fast mt=1 : 9.096
HALAC Fast mt=2 : 6.323
HALAC Fast mt=4 : 4.558
HALAC Fast mt=8 : 3.674
----------------------
FLAC Decoding
1.318.502.972 bytes -> 1,857,654,566 bytes
FLAC -0 : 14.307
Title: Re: More multithreading
Post by: ktf on 2024-02-28 13:07:08
Decoding speed is more important than Encoding speed in most cases.
Can you name a few of those cases?

As I see it, encoded audio is often decoded real-time for playback. On any modern desktop/laptop CPU, FLAC already reaches 1000x playback speed, including MD5 calculation. FLAC playback has happened on battery-operated devices for for 20 years already, and it is often faster than decoding of lossy audio, see https://www.rockbox.org/wiki/CodecPerformanceComparison

The only audio-related use-case I can think of is verification of stored files, and in that case MD5 calculation is the bottleneck already. There are some non-audio usecases as described here (https://hydrogenaud.io/index.php/topic,123712), those would indeed benefit from faster decoding when MD5 is disabled.

So sure, FLAC decoding can be made faster and some people would benefit. But I wouldn't say it is more important than encoding speed in most cases, as it is already very, very fast.
Title: Re: More multithreading
Post by: Porcus on 2024-02-28 14:22:46
For perspective: Decoding CDDA at 1000x-ish speed means a maximal 4 GB WAVE file would run in half a minute-ish time. (And it seems to me that 4 GB full of 24-bit content decodes faster than 16-bit content - a big file is not unlikely to be higher resolution.)
Why bother about one file? Because if you are decoding (or verifying) several files, then use an application that will spawn one thread per file and respawn when that file is done.

The only audio-related use-case I can think of is verification of stored files, and in that case MD5 calculation is the bottleneck already.
First, if faster verification is an objective, one could do like WavPack/Monkey's/OptimFROG are able to: implement some flac -t --no-md5   that merely runs though the frame checksums without decoding.
Whether it is worth it given FLAC's decoding speeds is a different question. Same for offloading MD5 to different thread ...

But there are other audio-related uses. One is simple processing like ReplayGain scanning.
(Disregarding the fact that true-peak scan could be more intensive.)

Rewind time to 2007-ish, when HDD costs would some users to choose Monkey's for their CD rips. If you wanted to configure EAC to rip to Monkey's (hacking around with Case's wapet for APEv2 tagging), and then run a RG scan that would decode the .ape file, that would take quite some time. Of course it isn't much of an issue if you can run it overnight.
But the following more cumbersome procedure would cost less CPU: Configure EAC to rip to FLAC -0 with metaflac computing ReplayGain and tagging it (recall, WAVE tagging was a big meh in that age!); then convert to Monkey's with tags transfer.
Because a FLAC encoding and two decodings (one for RG and one for conversion) would be cheaper on the CPU than a Monkey's decoding. According to your tests back in 2009, (http://www.audiograaf.nl/losslesstest/old_index.html) encoding + 2x decoding FLAC -0 would go at 70x realtime on your hardware back then - and that was faster than a single Monkey's Fast decode, which could take a minute for a CD. (Yet another disclaimer: ot saying that is an awful wait either, when you are in a ripping process that takes more when you are changing CDs.)
Title: Re: More multithreading
Post by: Hakan Abbas on 2024-02-28 16:25:41
That is possible of course, but of little use. Decoding only takes a tiny amount of time compared to parsing, which means the decoding thread will idle a lot, and sleeping/waking a thread comes with a lot of overhead. Most time is spent parsing.
"Encode once, decode all the time". The principle is an important discourse in data compression. "Is the number of people who produce music(encode) too much or the number of people who listen(decode) too much?" The answer to his question supports this. Of course, in general data compression, there are also places where encode speed is important, and there are also places where decode speed is important. We can reach them as a result of a short search. However, the general opinion is that the decode speed is more important. Besides, it's not my idea. It is the general opinion of some authorities and the sector.
F_Score (universal score) (https://gdcc.tech/rules/) = C + 2 * D + (S + F) / 10⁶

In addition, hashing operations can work independently at both encode and decode stages. But of course, if a codec has various dependencies due to its nature, multithread will not be effective. In the case of audio compression/decompression, we know that FLAC is fast enough. This is one of the biggest reasons why it has wide usage. The things I mentioned and showed here are that, contrary to what you said, the decode phase of audio data can also be performed quite efficiently in multithread. So just because something is fast doesn't mean it can't get faster. And faster will never hurt.
Title: Re: More multithreading
Post by: Porcus on 2024-02-28 17:17:58
"Encode once, decode all the time".
Sure. I've used that argument myself. So I am kinda surprised to see a score formula that puts only twice as much weight on decoding as encoding - that doesn't translate well to "all the time".

Also, wall time and CPU time are not the same. If you are in for efficiency, then don't produce overhead.
Title: Re: More multithreading
Post by: ktf on 2024-02-29 09:03:34
"Encode once, decode all the time".
But this doesn't explain why multithreading decoding would be beneficial. When working with documents, images, etc. the file is decoded at once in its entirety, and then multithreading makes sense, because faster is better. But audio and video are usually decoded at playback speed, and being any faster only has a lower CPU usage as a benefit. In that case, multithreading doesn't lower total CPU usage, but increases it.

Additionally, decoding is far more vulnerable to security bugs then encoding is, so I'd like to keep that code simple where possible. Besides, if this were a competition, there is no way FLAC would win: it is a 25 year old format based on patent-free techniques. So in effect, FLAC is working with 50 year old techniques like Rice coding. Of course it doesn't stand a chance to techniques like ANS when only looking at speed and compression ratio.

So I'd prefer to FLAC be compressing good, fast and stable. There are many codecs that compress better, there are faster codecs, but FLAC is an open, patent-free, very well documented standard, has a lot of independent decoder implementations, has several independent encoder implementations, compresses reasonably well (within 10% of state-of-the-art), has a stable, fast and open-source reference implementation and is supported by a lot of hardware.
Title: Re: More multithreading
Post by: Hakan Abbas on 2024-02-29 11:54:37
When working with documents, images, etc. the file is decoded at once in its entirety, and then multithreading makes sense, because faster is better. But audio and video are usually decoded at playback speed, and being any faster only has a lower CPU usage as a benefit.
You're right about what you said. I'm looking at the case more generally in terms of data compression. This is also a little due to my previous different studies.

Besides, if this were a competition, there is no way FLAC would win: it is a 25 year old format based on patent-free techniques. So in effect, FLAC is working with 50 year old techniques like Rice coding. Of course it doesn't stand a chance to techniques like ANS when only looking at speed and compression ratio.
Even if Rice coding is old (remember Solomon W. Golomb with respect), I think it's very ideal for audio compression. When used correctly, it can work efficiently from HUFFMAN, ANS and even from AC. I have already said that according to my tests, the compression ratio of Rice encoding in audio data is better than ANS. In certain cases, even in image compression. So if I can also solve the speed issue, I can use a custom Rice-derived coding system in the coming period.

So I'd prefer to FLAC be compressing good, fast and stable. There are many codecs that compress better, there are faster codecs, but FLAC is an open, patent-free, very well documented standard, has a lot of independent decoder implementations, has several independent encoder implementations, compresses reasonably well (within 10% of state-of-the-art), has a stable, fast and open-source reference implementation and is supported by a lot of hardware.
I have always liked and used FLAC. But according to this interpretation, it comes to a conclusion that there is no need to develop anything new and it is not worth spending time on it. Maybe that's the truth.
Title: Re: More multithreading
Post by: ktf on 2024-02-29 12:05:55
So I'd prefer to FLAC be compressing good, fast and stable. There are many codecs that compress better, there are faster codecs, but FLAC is an open, patent-free, very well documented standard, has a lot of independent decoder implementations, has several independent encoder implementations, compresses reasonably well (within 10% of state-of-the-art), has a stable, fast and open-source reference implementation and is supported by a lot of hardware.
I have always liked and used FLAC. But according to this interpretation, it comes to a conclusion that there is no need to develop anything new and it is not worth spending time on it. Maybe that's the truth.
That is pulling it too broad. I just stated a list of goals for FLAC and its strong points, nothing else. I listed that, because I think developing multithreaded decoding provides little gain and might conflict the goal of being stable. I'm not saying FLAC should have no more additions, and I am certainly not saying new codecs have no use.
Title: Re: More multithreading
Post by: cid42 on 2024-03-02 12:11:24
I guess FLAC decoding is not multithread yet.
The FLAC format is unfit for multithreaded decoding. That is, because reliably finding the next frame involves parsing the current one. One could offload MD5 calculation to a separate thread, and maybe parsing and decoding could be done in separate threads, but I currently don't really see a way to  make more than 4 threads have any benefit at all, and even then, the workload would be very uneven.
I agree that the juice probably isn't worth the squeeze wrt a MT decoder, but it isn't that dire. A seektable does provide a reliable way to chunk up the input, and even without a seektable a decoder could make use of things like max frame size if available to attempt to give threads a reasonable chunk of sequential frames to work on to minimise overhead (or just do 1MiB chunks or even split the file into n parts for n threads). A MT decoder can be proof-of-concepted similar to the hack job I did with flaccid, which I might do at some point just to answer some of these questions.

I doubt there's much benefit to be had unless the file is already in ram, and md5 would have to be ignored as that just kills the whole idea unless capping out at some double digit speedup is the goal. It does rather mess up memory accesses.
Title: Re: More multithreading
Post by: Porcus on 2024-04-11 10:09:08
I guess FLAC decoding is not multithread yet.
The FLAC format is unfit for multithreaded decoding. That is, because reliably finding the next frame involves parsing the current one. One could offload MD5 calculation to a separate thread, and maybe parsing and decoding could be done in separate threads, but I currently don't really see a way to  make more than 4 threads have any benefit at all, and even then, the workload would be very uneven.

As pointed out here, (https://hydrogenaud.io/index.php/topic,125694.msg1042566.html#msg1042566) ffmpeg does some degree of multithreading on decoding - and that goes for FLAC too. Running a longer test on a different computer by now with new ffmpeg, but on this laptop with ffmpeg 5.1.1, it seems that the fastest decoding of "one CD image" (73 minutes, tagless, decoded to NUL) are in order:
* ffmpeg decoding FLAC
* ffmpeg decoding TAK
* ffmpeg decoding TTA
* ffmpeg decoding WavPack -f / -g and ALAC,
and: flac.exe 1.4.2 on dual mono file which has no MD5
* ffmpeg on WavPack -h
* flac.exe and wvunpack.exe --threads (5.7.0) on -f, -g, -h and ffmpeg on WavPack -hh
* takc.exe (doesn't multithread decoding)
* wvunpack.exe --threads on -hh
* and from then on the usual order of single-threaded wvunpack.exe with refalac and tta meddling in before the heaviest ones.