More multithreading

Topic: More multithreading (Read 24635 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #75 – 2023-07-20 22:08:12

Quote from: rutra80 on 2023-07-14 00:12:45

V3:
-j1 -8ep - 103s
-j2 -8ep - 118s
-j4 -8ep - 37s
-j8 -8ep - 26s

V4:
-j1 -8ep - 100s
-j2 -8ep - 51s
-j4 -8ep - 29s
-j8 -8ep - 22s

Abusing 16 threads gives the same time as real 8.

Re: More multithreading

Reply #76 – 2023-07-20 22:44:08

flac git-f8cb7f08. The times are very similar to flac git-1357f844.

Code: [Select]

 -j1: 0m43.917s
 -j2: 0m24.247s
 -j3: 0m18.118s
 -j4: 0m14.752s
 -j5: 0m12.700s
 -j6: 0m11.321s
 -j7: 0m10.283s
 -j8: 0m9.496s
 -j9: 0m9.571s
-j10: 0m9.408s
-j11: 0m9.359s
-j12: 0m9.231s
-j13: 0m9.094s
-j14: 0m9.077s
-j15: 0m8.990s
-j16: 0m8.954s

Re: More multithreading

Reply #77 – 2023-07-21 01:45:07

Again my simple numbers. Looks fast and scaling is fine.

Code: [Select]

v1 vs v3 vs v4
j1 103x  104x  106x
j2 106x  115x  207x
j3 203x  225x  306x
j4 298x  326x  402x
j5 381x  426x  492x
j6 460x  521x  566x
j7 543x  615x  645x
j8 620x  670x  705x
j9 685x  710x  768x
j10 705x  625x  765x
j11 725x  670x  699x
j12 740x  680x  688x
j13 750x  675x  683x
j14 752x  670x  676x
j15 750x  650x  675x

Re: More multithreading

Reply #78 – 2023-07-21 06:23:38

Quote from: sundance on 2023-07-20 20:28:36

My results with the v4 binary:
[...]
Excellent scaling here up to -j6 (having 6 cores here...)

Good to see. As you can imagine, getting this right for faster presets is more difficult than for slower presets. Of course, -7 isn't particularly fast, but it is quite a bit faster than -8p. Also, I think this scales better on Linux (where pthreads is native) than on Windows (where pthreads is 'emulated') so these numbers on Windows are very nice I'd say.

For example, scaling for the really fast presets like -0 and -3 stopped after 2 threads already because MD5 was 'blocking'. With these changes, using 3 threads is almost 3x as fast as 1 thread, which I think is a big win. Mainly theoretically of course, because I don't think many people will use such a fast preset with multithreading, but still, it is nice that it works.

Quote from: Replica9000 on 2023-07-20 22:44:08

flac git-f8cb7f08. The times are very similar to flac git-1357f844.

Yes, the only change was a small fix for building with multithreading disabled.

Quote from: rutra80 on 2023-07-20 22:08:12

[...]
Abusing 16 threads gives the same time as real 8.

I still don't know why -j2 was so much slower than -j1 on your system with v3, but good to see this has been fixed.

Quote from: Wombat on 2023-07-21 01:45:07

Again my simple numbers. Looks fast and scaling is fine.
[...]

This seems to contradict the results of music_1 though. The systems of you and music_1 have the highest physical core count. I don't know what causes this difference.

Re: More multithreading

Reply #79 – 2023-07-22 02:15:32

When going above j12 with my 12core/24thread CPU -8ep still sees small benefits up to j16 but from j17 on it becomes extremely slow to 1 thread it seems. Was there a mention of this limit i did overread?

Re: More multithreading

Reply #80 – 2023-07-22 03:17:55

Quote from: Wombat on 2023-07-22 02:15:32

When going above j12 with my 12core/24thread CPU -8ep still sees small benefits up to j16 but from j17 on it becomes extremely slow to 1 thread it seems. Was there a mention of this limit i did overread?

My CPU only has 16 threads. If I try to use more, I get: "WARNING, cannot set number of threads: too many"

I thought maybe FLAC refuses to use more than the available threads, but I see this in the code:

Code: [Select]

#define FLAC__STREAM_ENCODER_MAX_THREADS 16
#define FLAC__STREAM_ENCODER_MAX_THREADTASKS 34

I changed it to 32 and 68 respectively, and I can use up to 32 threads now.

Re: More multithreading

Reply #81 – 2023-07-22 03:21:25

Nice find, thanks. Lets wait for ktf what is the reason for this.

Re: More multithreading

Reply #82 – 2023-07-22 03:45:59

I built this from my phone, so can't test.

FLAC 64-bit Windows. Static binary, 32 threads enabled.
Edit: built with no ASM optimizations. (Faster 16-bit encoding)

Re: More multithreading

Reply #83 – 2023-07-22 04:49:17

Cool, thanks!
It still scales a little up to j24. -8ep -V
j12 182x
j16 196x
j24 206x

Re: More multithreading

Reply #84 – 2023-07-22 06:33:49

Quote from: Wombat on 2023-07-22 03:21:25

Nice find, thanks. Lets wait for ktf what is the reason for this.

I have to put a limit somewhere, because some memory allocation happens statically. Seemed reasonable to put it at 16. Looking at your data, twice the number of threads for 10% gain doesn't seem worthwhile really, so it still seems pretty reasonable.

Re: More multithreading

Reply #85 – 2023-07-22 09:46:46

It might make sense for the thread count to track consumer x86_64 physical cores, which currently tops out at 24 with the 13900k, or track consumer threads which currently tops at 32. The biggest x86 server chip is bergamo zen4c with 128 cores, but if any of us interact with it it's likely only a few cores at a time in the cloud. Is there any scope to reduce the per-core memory footprint?

Re: More multithreading

Reply #86 – 2023-07-22 11:23:02

@Replica9000 for CDDA with -8ep your compile seems 13-28% slower on my i7-4790K.

Re: More multithreading

Reply #87 – 2023-07-22 11:57:26

Quote from: cid42 on 2023-07-22 09:46:46

It might make sense for the thread count to track consumer x86_64 physical cores, which currently tops out at 24 with the 13900k, or track consumer threads which currently tops at 32.

At such high thread counts, there is a tremendous amount of overhead. As Wombats results showed, there is very little gain. Sure, I could increase max number of threads, but would it make sense really? I said static memory allocation is a problem, but now that I've checked, it isn't really a problem. Increasing max thread count by 1 results in static allocation of 3 extra pointers (which are 8 bytes each).

Quote

Is there any scope to reduce the per-core memory footprint?

I don't think that is really necessary. FLAC uses memory already very efficiently. Memory measurements are rather erratic, but I've tried anyway.

Results with -8

Code: [Select]

~$ sleep 1; for I in {1..3}; do for J in {1..16}; do echo -n "$J "; /usr/bin/time -v ./flac-v4 -fsj$J -8 /media/test.wav /media/test.wav /media/test.wav /media/test.wav /media/test.wav 2>&1 | grep "Maximum resident"; done; done
1 	Maximum resident set size (kbytes): 3584
2 	Maximum resident set size (kbytes): 7264
3 	Maximum resident set size (kbytes): 7816
4 	Maximum resident set size (kbytes): 8372
5 	Maximum resident set size (kbytes): 8164
6 	Maximum resident set size (kbytes): 11196
7 	Maximum resident set size (kbytes): 12600
8 	Maximum resident set size (kbytes): 13516
9 	Maximum resident set size (kbytes): 14052
10 	Maximum resident set size (kbytes): 16736
11 	Maximum resident set size (kbytes): 19920
12 	Maximum resident set size (kbytes): 18620
13 	Maximum resident set size (kbytes): 19632
14 	Maximum resident set size (kbytes): 23604
15 	Maximum resident set size (kbytes): 24572
16 	Maximum resident set size (kbytes): 25028

With larger blocksizes this increases quite a bit. With a blocksize of 32768:

Code: [Select]

1 	Maximum resident set size (kbytes): 5456
4 	Maximum resident set size (kbytes): 26384
8 	Maximum resident set size (kbytes): 63704
12 	Maximum resident set size (kbytes): 78156
16 	Maximum resident set size (kbytes): 108120

With a blocksize of 32768 and -r 15 this increases even more

Code: [Select]

1 	Maximum resident set size (kbytes): 7552
4 	Maximum resident set size (kbytes): 45320
8 	Maximum resident set size (kbytes): 84968
12 	Maximum resident set size (kbytes): 126768
16 	Maximum resident set size (kbytes): 159012

So, memory usage is already highly dynamic. I wouldn't know where I could cut down. Also, 25MB for 16 cores isn't much really.

Re: More multithreading

Reply #88 – 2023-07-22 12:22:29

Wombats CPU is 12c24t. The 13900k is 8p+16e aka 24c32t. The p/e core thing muddies the waters as the e cores are lower clocked, but there's a good chance that there's decent scaling up to 24 flac threads.

A rule of thumb for hyperthreading is that it normally provides a -5 to +30% benefit relative to no SMT depending on the workload, with outliers in both directions. It's no surprise that wombat shows a +13% benefit from -j12 to -j24.

Re: More multithreading

Reply #89 – 2023-07-22 13:57:17

Quote from: cid42 on 2023-07-22 12:22:29

Wombats CPU is 12c24t.

Forgot about that bit.

Quote

The 13900k is 8p+16e aka 24c32t.

I'm curious as to whether this code properly scales on such heterogeneous architectures in general. In v1 and v3, threads couldn't 'leapfrog' each other, so threads would need to be rotated over P and E cores to stay in sync, or else threads would have to idle. With v4, threads can in fact leapfrog each other (one thread can do three frames while another does two for example), so this should scale reasonably well.

On my 4 core Linux PC (i7-4710MQ), it does scale very well. For setting -8, using 4 threads gives a 3.9x speedup and on -5 it gives a 3.75x speedup.

Re: More multithreading

Reply #90 – 2023-07-22 15:04:14

Quote from: cid42 on 2023-07-22 12:22:29

A rule of thumb for hyperthreading is that it normally provides a -5 to +30% benefit relative to no SMT depending on the workload, with outliers in both directions. It's no surprise that wombat shows a +13% benefit from -j12 to -j24.

-j16 and -j24 trigger the same 142Watt power limit here so any benefit highers the efficiency imho.

Re: More multithreading

Reply #91 – 2023-07-22 15:37:24

Quote from: rutra80 on 2023-07-22 11:23:02

@Replica9000 for CDDA with -8ep your compile seems 13-28% slower on my i7-4790K.

Replica9000 compiled without asm-optimizations he mentions. That gives a good performance boost with 16bit audio on some modern CPUs like my Ryzen 5900x.
The compiler does well there. Unfortunately some older CPUs can't benefit and this makes them slower.
Our member sundance experienced and benched that already together with an intel 8700.
The same thing happens to a smaller degree when using the additional compiler flag -falign-functions=32 (default 16) in the GCC compiler.

Re: More multithreading

Reply #92 – 2023-07-22 15:58:20

Same build as above, with ASM optimizations.
Static Win64 binary.

Re: More multithreading

Reply #93 – 2023-07-22 18:10:12

First, this question for development:
If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead? Sure the audio will have different lenghts, but still: If I call (possibly with options) flac -8p *.wav or possibly for that matter, flac -2ef flacfileencodedwith_-0b56789.flac flacfileenccodedwith_-Mb32_-l23.flac longfile.rf64 outrageouslylongfile.w64 veryshortaudiofilewith2GBheaders.wav, and the executable can spawn multiple threads, then what?

Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers. That depends on purpose I guess:

If I download an album as .wav (there are still sources who only offer that as lossless format), I might want to run flac -8pr7 *.wav and be done ASAP - in execution time ("wall time"). Even if the CPU might throttle at the end of the process, I might save time if the encoder starts firing all guns at once.
Longer job: overhead matters more. Likely you want as many cores to run as keeps a reasonable thermal equilibrium.
Testing right now. A more complicated task, where we want both total time measured and threadseconds (to scrutinize idle time vs overhead).
The answer to the first question on top might - for all that I know - suggest that <certain consideration> is not particularly interesting, it will be gone once the user runs multiple input files, so ... ?

timer64, "Global time" surely but also "Process time" - or some other utility?
Powershell measure-command returns execution time, but nothing else?

Re: More multithreading

Reply #94 – 2023-07-22 19:01:49

Quote from: Porcus on 2023-07-22 18:10:12

timer64, "Global time" surely but also "Process time" - or some other utility?
Powershell measure-command returns execution time, but nothing else?

WavPack for example has built-in benchmark so flac can try this too, at least for test builds.

Another thing is that my Linux vs Windows benchmarks indicate that Linux seems to perform better with lower thread count while Windows do the opposite, I don't know if it is expected or due to differences in measurement methods. With a built-in benchmark I don't need to worry about this.
https://hydrogenaud.io/index.php/topic,124437.msg1030148.html#msg1030148

Re: More multithreading

Reply #95 – 2023-07-22 21:13:27

Quote from: Porcus on 2023-07-22 18:10:12

If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead?

Yes, of course. But that would only benefit the flac command line tool, and I was worried how console output should be made easy to understand. Also, multithreading over files is already possible with various utilities like GNU parallel. The approach with multithreading over a single file can benefit all libFLAC users, and is not achievable with other tools.

Quote

Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.

Most importantly wall time. Second, wall time with 1 thread divided by wall time with X number of threads. On my machine, this gives me:

Quote from: ktf on 2023-07-22 13:57:17

For setting -8, using 4 threads gives a 3.9x speedup and on -5 it gives a 3.75x speedup.

I'd say these are the only numbers that are interesting to the end user: how much do we gain, and how efficient is it.

edit:

Quote from: bennetng on 2023-07-22 19:01:49

Another thing is that my Linux vs Windows benchmarks indicate that Linux seems to perform better with lower thread count while Windows do the opposite, I don't know if it is expected or due to differences in measurement methods. With a built-in benchmark I don't need to worry about this.

I don't think measuring wall time is particularly complicated, so I don't think there is much difference in such a measurement. CPU time is difficult, of course. However, threading is something heavily dependent on kernel, and I've seen quite different behaviour, with some bugs only showing up on Linux and others only showing up on Windows. Can't differentiate between what is kernel and what is libFLAC, but I don't the timer utility is to blame here.

Re: More multithreading

Reply #96 – 2023-07-23 09:40:53

Out of curiosity I ran my test files with ktf's v4 binary with lower settings:
-5:

Code: [Select]

-j1:    Average time =  14.054 seconds (3 rounds), Encoding speed = 769.30x
-j2:    Average time =   7.637 seconds (3 rounds), Encoding speed = 1415.74x
-j3:    Average time =   5.364 seconds (3 rounds), Encoding speed = 2015.79x
-j4:    Average time =   4.172 seconds (3 rounds), Encoding speed = 2591.36x
-j5:    Average time =   4.166 seconds (3 rounds), Encoding speed = 2595.30x
-j6:    Average time =   4.817 seconds (3 rounds), Encoding speed = 2244.71x
-j7:    Average time =   5.061 seconds (3 rounds), Encoding speed = 2136.34x
-j8:    Average time =   5.175 seconds (3 rounds), Encoding speed = 2089.41x

-0

Code: [Select]

-j1:    Average time =   9.710 seconds (3 rounds), Encoding speed = 1113.53x
-j2:    Average time =   5.570 seconds (3 rounds), Encoding speed = 1941.00x
-j3:    Average time =   4.194 seconds (3 rounds), Encoding speed = 2578.17x
-j4:    Average time =   5.593 seconds (3 rounds), Encoding speed = 1933.02x
-j5:    Average time =   6.210 seconds (3 rounds), Encoding speed = 1740.97x
-j6:    Average time =   6.525 seconds (3 rounds), Encoding speed = 1657.01x
-j7:    Average time =   6.838 seconds (3 rounds), Encoding speed = 1581.09x
-j8:    Average time =   6.995 seconds (3 rounds), Encoding speed = 1545.68x

No matter what compression level I used, I couldn't get it faster that some 4.2 seconds. But the scaling flattens later/earlier.
Btw. the mere time to copy the 40 WAVs (2 GB) to a different folder on the same SSD is ~ 0.3-0.4 secs (copy *.wav wav2 /q), calculation of MD5s is in the 3 seconds ballpark.

P.S.: with "-5 --no-md5-sum" the speed limit here is 3.492 seconds @ -j4.

Re: More multithreading

Reply #97 – 2023-07-23 11:35:06

Lets be careful not to wander into FLACCL territory where it encodes at 999999x rate but initializes for several seconds on every file, ending down slower than FLAC.

Re: More multithreading

Reply #98 – 2023-07-23 12:45:57

Quote from: rutra80 on 2023-07-23 11:35:06

Lets be careful not to wander into FLACCL territory where it encodes at 999999x rate but initializes for several seconds on every file, ending down slower than FLAC.

1:02:56 of CDDA on i7-4790K with NVMe:

-j8:
-8 - 3,67s
-7 - 2,86s
-6 - 2,80s
-5 - 2,60s
-4 - 3,96s
-3 - 2,20s
-2 - 3,70s
-1 - 2,96s

-j4:
-8 - 4,91s
-7 - 3,01s
-6 - 2,63s
-5 - 2,04s
-4 - 3,87s
-3 - 2,18s
-2 - 2,84s
-1 - 3,28s

-j2:
-8 - 8,10s
-7 - 5,36s
-6 - 4,81s
-5 - 3,45s
-4 - 3,90s
-3 - 2,61s
-2 - 2,71s
-1 - 2,97s

Yep, somethings funky with the scaling already, with -j1 it's fine.

Re: More multithreading

Reply #99 – 2023-07-23 13:01:29

Quote from: ktf on 2023-07-22 21:13:27

Quote from: Porcus on 2023-07-22 18:10:12
If you call the encoder to process multiple files, isn't that where you can multi-thread with very little overhead?
Yes, of course. But that would only benefit the flac command line tool, and I was worried how console output should be made easy to understand.

Suggestion for that case with four concurrent files:

file1 started encoding
file2 started encoding
file3 started encoding
file4 <uses the last thread, output as usual counting up>
file2: wrote 12345678 bytes, ratio=0,543
file1: 33% complete, ratio=0,628 <this a single status report>
file3: 11% complete, ratio=1,000 <this a single status report>
file5 <uses the last thread, output as usual counting up>
file4: wrote 23456789 bytes, ratio=0,555
file6 <uses the last thread, output as usual counting up>
file1: wrote 33333333 bytes, ratio=0,666
file5: wrote 11111111 bytes, ratio=0,567
file3: wrote 98765432 bytes, ratio=1,000- < <-- I propose a "-" to signify that it is smaller than the original even if that is beyond the third decimal. And a "+" for say 1,00001. But I don't miss the old failure report.>

Quote from: ktf on 2023-07-22 21:13:27

Also, multithreading over files is already possible with various utilities like GNU parallel. The approach with multithreading over a single file can benefit all libFLAC users, and is not achievable with other tools.

Yes - multithreading over multiple files was not at all meant as a substitute for multithreading over a single file. But, if certain single files are hard to improve upon, consider if that will make a difference to the user.
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset. Possibly you could consider the following line of arguments - subject to being anywhere remotely close to the fact, I am quiiiite ignorant here:

If it is just one single file, it will be done in one second anyway, you can get it down to half a second but who cares if you cannot get it down to a third of a second - end-users get impatient over seconds to wait, not over percentages;
Yes it would matter if user has 300 such small files and start encoding them all by invoking flac -0 *.wav, but then speed-up is better achieved by passing one file to one thread
So for -0 encoding, maybe not spawn too many threads per file? Maybe even just one?
... well maybe if the input is big, there is a gain? Which you might not even know in advance, it could be piped. Is there some read buffering going on? What about: for light enough settings (fixed-predictor encoding or so, taking into account whether stereo decorrelation is invoked), do not assign a new thread until you have read B bytes, B being quite a sizable chunk.
So: a file that would be done in the blink of an eye (or where the potential time saved is just the blink of an eye) won't get more than a thread or two anyway.

Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?

Quote from: ktf on 2023-07-22 21:13:27

Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time.

Obviously for the end result. But for testing, you don't get much useful extra information from including anything else?

Notice