More multithreading

Topic: More multithreading (Read 21821 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #25 – 2023-07-13 07:51:42

Sorry, I indeed made a mistake. The v2 binary is functionally identical to the first one, I think I did a copy-paste the wrong way. I'll get back with a new binary.

edit: Here's a new one. Sorry for wasting your time with the previous one.

Re: More multithreading

Reply #26 – 2023-07-13 09:04:09

I hope you don't let this experiment slow down regular single threaded encoding. At least my use case is encoding several files at a time and doing multiple files in separate threads is faster than spreading single file encoding over multiple threads.

Re: More multithreading

Reply #27 – 2023-07-13 10:14:17

The goal of course is to not let this affect single-threading. As you can see from the graphs attached to the first post, that goal has been achieved: FLAC 1.4.3 and this new binary with -j1 perform exactly the same. Of course, there are plenty of environments without POSIX threads (pthreads) so single threading performance remains very important.

Indeed, doing multiple files at once is faster than multithreading within a single file, but the latter is more transparent to the user and to me it seemed easier to properly implement in the flac command line tool. Also, multithreading over files is possible with tools like GNU parallel, so this approach is complementary to that.

Re: More multithreading

Reply #28 – 2023-07-13 10:44:58

Results for v3 binary:

Code: [Select]

-j1:    Average time =  22.844 seconds (3 rounds), Encoding speed = 473.30x
-j2:    Average time =  18.255 seconds (3 rounds), Encoding speed = 592.27x
-j3:    Average time =   9.570 seconds (3 rounds), Encoding speed = 1129.82x
-j4:    Average time =   6.603 seconds (3 rounds), Encoding speed = 1637.35x
-j5:    Average time =   6.646 seconds (3 rounds), Encoding speed = 1626.76x
-j6:    Average time =   7.094 seconds (3 rounds), Encoding speed = 1524.18x
-j7:    Average time =   6.446 seconds (3 rounds), Encoding speed = 1677.41x
-j8:    Average time =   6.539 seconds (3 rounds), Encoding speed = 1653.46x
-j9:    Average time =   7.046 seconds (3 rounds), Encoding speed = 1534.42x
-j10:   Average time =   7.123 seconds (3 rounds), Encoding speed = 1517.90x
-j11:   Average time =   6.800 seconds (3 rounds), Encoding speed = 1589.92x
-j12:   Average time =   6.286 seconds (3 rounds), Encoding speed = 1719.92x

Scales almost perfectly at the beginning (1 nc thread = 18.3 sec, 2 nc threads = 9.6 sec, 3 nc threads = 6.6 sec), but after that nothing/little is gained. Does the thread management take all the extra time the additional cores could provide?

Re: More multithreading

Reply #29 – 2023-07-13 11:54:52

Quote from: sundance on 2023-07-13 10:44:58

Does the thread management take all the extra time the additional cores could provide?

There's no thread management really, it just dispatches as much work as it can. Maybe the problem is the housekeeping thread can't keep up. Could you try what happens if you run with the undocumented option --no-md5-sum to see if scaling continues for thread 4, 5, 6 etc.? If that is the case, then maybe MD5 needs to run in its own thread.

Re: More multithreading

Reply #30 – 2023-07-13 12:09:37

Very promising results on v3 by sundance, I'd say. But Case raises an important point. I'm using single-threaded flac.exes to convert multiple files in parallel in foobar2000. If multithreaded encoding is to become the default in the FLAC executable one day, we should inform Peter et al. to change the predefined FLAC conversion dialog in foobar2000 to disable multithreading when two or more files are being converted to FLAC simultaneously. And since, IIRC, the -j switch didn't exist in previous versions, I fear that the multithreaded-by-default flac.exe would break compatibility with older versions in foobar?

Quote from: ktf

Loose mid side does the full calculation once every few frames (once every 0.4 s or something) and then uses the result for the next few frames. That a dependency between frames and thus threads. Maybe I'll fix that by implementing a different 'loose mid-side algorithm', perhaps the algorithm that ffmpeg uses.

Sounds like a great idea. I've got some time this week, let me know if you can use some assistance in trying out an alternative approach.

By the way, I noticed that FLAC preset -6 deviates a bit from the convex speed-performance hull in your plots. It seems that, by using -r 5 instead of -r 6 in preset 6, one can shift that operating point leftward (i.e., towards faster) along the speed axis, with almost zero degradation of the compression ratio (at least in my experiments). Attached a painted-in estimate of how that would change your plot. Comments (by anyone, that is) appreciated.

Chris

Re: More multithreading

Reply #31 – 2023-07-13 12:36:26

Quote from: C.R.Helmrich on 2023-07-13 12:09:37

If multithreaded encoding is to become the default in the FLAC executable one day

It isn't. FLAC/libFLAC is not only being used on desktops. There is a wide range of hardware this runs on (embedded devices and microcontrollers for example), and the intention is to keep it that way.

Quote from: C.R.Helmrich on 2023-07-13 12:09:37

Sounds like a great idea. I've got some time this week, let me know if you can use some assistance in trying out an alternative approach.

Anyone who wants to contribute code is welcome to do so. A patch through the mailing list or a PR at Github are preferred.

Quote

By the way, I noticed that FLAC preset -6 deviates a bit from the convex speed-performance hull in your plots.

While I did propose a retune at this forum at some point, I'm not sure anymore whether striving for a convex hull is worth changing settings. As you said, people rely on defaults and certain settings giving a certain results. Changing -r 6 to -r 5 probably won't hurt much, but the result is pretty much 'cosmetic'. Also, it could very well be this graph looks different on a different CPU, or even a different architecture. Maybe changing settings so the graph approaches the ideal on my CPU makes it less ideal on another CPU. The heavy-hitters in x86 code are subtly different from ARM64.

Re: More multithreading

Reply #32 – 2023-07-13 12:45:06

Again CDDA -8p -V, 5900x, 12 cores, 24 threads

Code: [Select]

v1 vs v3
j1 103x  104x
j2 106x  115x
j3 203x  225x
j4 298x  326x
j5 381x  426x
j6 460x  521x
j7 543x  615x
j8 620x  670x
j9 685x  710x
j10 705x  625x
j11 725x  670x
j12 740x  680x
j13 750x  675x
j14 752x  670x
j15 750x  650x

I triple checked the dip at j10. The first version scaled more even here.
It is very nice to see ~150x speed for -8ep in CUETools

Re: More multithreading

Reply #33 – 2023-07-13 14:09:46

@ktf: v3 binary with --no-md5-sum:

Code: [Select]

-j1:    Average time =  20.276 seconds (3 rounds), Encoding speed = 533.25x
-j2:    Average time =  18.350 seconds (3 rounds), Encoding speed = 589.20x
-j3:    Average time =   9.644 seconds (3 rounds), Encoding speed = 1121.11x
-j4:    Average time =   6.803 seconds (3 rounds), Encoding speed = 1589.22x
-j5:    Average time =   5.412 seconds (3 rounds), Encoding speed = 1997.66x
-j6:    Average time =   4.863 seconds (3 rounds), Encoding speed = 2223.47x
-j7:    Average time =   6.105 seconds (3 rounds), Encoding speed = 1771.10x
-j8:    Average time =   4.902 seconds (3 rounds), Encoding speed = 2205.78x
-j9:    Average time =   4.737 seconds (3 rounds), Encoding speed = 2282.62x
-j10:   Average time =   4.898 seconds (3 rounds), Encoding speed = 2207.28x
-j11:   Average time =   4.925 seconds (3 rounds), Encoding speed = 2195.18x
-j12:   Average time =   4.860 seconds (3 rounds), Encoding speed = 2224.69x

Another thing that came to mind to explain the performance plateau here: Since I am reading ~2GB of WAV and write 1.1GB of FLAC to an SSD drive (Samsung Evo 860 @ SATA III) in each encoding session, a considerable amount of time might be needed for that. I don't think that this SSD setup is faster than some 600 MB/sec.

Re: More multithreading

Reply #34 – 2023-07-13 15:18:21

Quote from: ktf on 2023-07-13 11:54:52

Quote from: sundance on 2023-07-13 10:44:58
Does the thread management take all the extra time the additional cores could provide?
There's no thread management really, it just dispatches as much work as it can. Maybe the problem is the housekeeping thread can't keep up. Could you try what happens if you run with the undocumented option --no-md5-sum to see if scaling continues for thread 4, 5, 6 etc.? If that is the case, then maybe MD5 needs to run in its own thread.

So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping? When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU. If I use -j8, I have 7 threads at 100%, and one thread at 35%.

Re: More multithreading

Reply #35 – 2023-07-13 15:40:55

Quote from: Replica9000 on 2023-07-13 15:18:21

So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping? When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU. If I use -j8, I have 7 threads at 100%, and one thread at 35%.

Yes, that is correct. The thing is, you are running with setting -8p, which means each thread has lots to crunch and there is relatively little to do for the first thread (MD5 checksumming, preparing data etc.) sundance is running setting -7, which is much faster, which means the 'housekeeping thread' has much more to do, and scaling stops earlier. When running preset -0, I guess scaling already stops at 2 threads.

To fix this, MD5 calculation would needs its own thread, but when to 'add' that thread depends on how much number crunching needs to be done. For a fast preset like -0 through -5, the 3rd thread should probably already be dedicated to MD5. For presets -6 and -7 that would the 4th thread, for -8 the 5th thread and for settings like -8p or -8e that would be something like the 16th thread.

I'm not sure whether there is a better way to fix this imbalance really.

Re: More multithreading

Reply #36 – 2023-07-13 16:20:17

Quote from: ktf on 2023-07-13 15:40:55

Quote from: Replica9000 on 2023-07-13 15:18:21
So if I run -j2 for 2 threads, there's one thread encoding and one thread for housekeeping? When I use -j2, one thread is using 100%, while the other is only at 8% on my CPU. If I use -j8, I have 7 threads at 100%, and one thread at 35%.

Yes, that is correct. The thing is, you are running with setting -8p, which means each thread has lots to crunch and there is relatively little to do for the first thread (MD5 checksumming, preparing data etc.) sundance is running setting -7, which is much faster, which means the 'housekeeping thread' has much more to do, and scaling stops earlier. When running preset -0, I guess scaling already stops at 2 threads.

To fix this, MD5 calculation would needs its own thread, but when to 'add' that thread depends on how much number crunching needs to be done. For a fast preset like -0 through -5, the 3rd thread should probably already be dedicated to MD5. For presets -6 and -7 that would the 4th thread, for -8 the 5th thread and for settings like -8p or -8e that would be something like the 16th thread.

I'm not sure whether there is a better way to fix this imbalance really.

Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?

Re: More multithreading

Reply #37 – 2023-07-13 17:49:17

flac-multithreading-v3-win

Code: [Select]

timer64.exe v3 -j1 -8p -f in.wav
Global Time  =    55.756

timer64.exe v3 -j2 -8p -f in.wav
Global Time  =    53.016

timer64.exe v3 -j3 -8p -f in.wav
Global Time  =    34.281

timer64.exe v3 -j4 -8p -f in.wav
Global Time  =    31.115

timer64.exe v3 -j5 -8p -f in.wav
Global Time  =    23.207

timer64.exe v3 -j6 -8p -f in.wav
Global Time  =    18.717

timer64.exe v3 -j7 -8p -f in.wav
Global Time  =    15.722

timer64.exe v3 -j8 -8p -f in.wav
Global Time  =    13.413

timer64.exe v3 -j9 -8p -f in.wav
Global Time  =    12.010

timer64.exe v3 -j10 -8p -f in.wav
Global Time  =    10.612

timer64.exe v3 -j11 -8p -f in.wav
Global Time  =     9.801

timer64.exe v3 -j12 -8p -f in.wav
Global Time  =     8.832

timer64.exe v3 -j13 -8p -f in.wav
Global Time  =     8.255

timer64.exe v3 -j14 -8p -f in.wav
Global Time  =     7.622

timer64.exe v3 -j15 -8p -f in.wav
Global Time  =     7.135

timer64.exe v3 -j16 -8p -f in.wav
Global Time  =     6.927

Re: More multithreading

Reply #38 – 2023-07-13 21:34:56

Quote from: Replica9000 link=msg=1030063

Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?

It must be calculated frame-by-frame, otherwise one would have to store the entire audio input in memory (since no disk might be accessible during encoding), which would make FLAC's RAM consumption unbound.

I'm not an expert in multithreading implementations, but couldn't the MD5 calculation (and bitstream writing, if not already done so) be moved into the housekeeping/management thread, at least for presets where that thread is mostly idle?

Quote from: ktf on 2023-07-13 12:36:26

... the intention is to keep it that way (single-threaded)

... Changing -r 6 to -r 5 probably won't hurt much, but the result is pretty much 'cosmetic'. Also, it could very well be this graph looks different on a different CPU, or even a different architecture.

The single-threaded and cosmetic aspects make sense, but I doubt the overall shape of the curve will look much different above preset 2 on different platforms/CPUs. The numbers make perfect sense and are well described by O(n) complexity estimation. The main contributors to encoding runtime are the max. LPC order and number of apodizations tried, on any platform, and I didn't change that part of the configuration.

Chris

Re: More multithreading

Reply #39 – 2023-07-13 22:36:41

Quote from: Replica9000 on 2023-07-13 16:20:17

Is the md5sum calculated every x amount of data encoded, or does it calculate once the whole stream is encoded?

MD5 works in chunks.
Precisely when in the process reference FLAC does that calculation I don't know, but as MD5 is calculated from the uncompressed PCM input, it could in principle be "at any time". Most likely when the chunk is loaded into memory.
The verify option will decode the FLAC bitstream to PCM, which is then MD5'ed.

Edited.
As for some other topics that came up here:

Default: I agree that multithreading should not be a default. But, it will be harder for a novice user to have to give the appropriate options - indeed, I guess that those who are barely used to .exe files and not so much to command-lines, would want to drag and drop.
In that case, one should maybe just make a flac-multithread.exe that defaults to a multi-threading option ...?
WavPack has a way to rename the executable to invoke options: https://hydrogenaud.io/index.php/topic,122626 . Yeah David credits me for the idea, but it was because there was already since long such a way to invoke debugging. FLAC and WavPack don't have the same history ...

-6 and the convex hull:
I have fallen prey to "eyeballing" the chart myself, not thinking over that time is on a log scale. As far as convexity is a concern, it should be on an un-logged time scale: If I am willing to double the running time from 1 minute to 2 minutes to save B bytes, then nothing says I am willing to wait for 14 more minutes (an octupling of the 2) to save another 3*B bytes.

Some considerations I made on -6: https://hydrogenaud.io/index.php/topic,123025.msg1016398.html#msg1016398
Point is, it is "as heavy as predictor order 8 goes".
I did test -6r5 vs -6r6 though, and the -r made very little size impact.

Re: More multithreading

Reply #40 – 2023-07-14 00:00:22

-j2 was often bad on this CPU (i5-1135G7, four cores and eight threads) with the first build. Others have posted results where it doesn't make much of a difference, but here it often outright slows it down. The limited results I have with version 1 vs version 3 indicates that the latter is an improvement.

I did a few runs, also let it cool off to "ensure" that -j2 isn't too much affected by some throttling induced by running -j1 right before. Will do more, but reporting -0 figures here.
Table is a bit cryptic: For each -0 -j<N> I did
* pause for 2 minutes to allow the CPU to cool down
* ran the first build three consecutive encodes of the 38 CDs in my signature.
* new pause for 2 minutes
* three consecutive encodes with version 3 of the exe
Then advance the "j".

Numbers quoted are the number of seconds on the "from cool", and then under the "next": how much more the next two runs took, on a presumably hotter CPU. "more" ... with one exception.

-0

-j1

next

-j2

next

-j3

next

-j4

next

-j5

next

-j6

next

-j7

next

-j8

next

-j9

next

124

+3,+6

141

+1,+9

105

+7,+7

102

+9,+9

100

+15,+10

+16,+19

113

−2,+5

104

+5,+18

104

+9,+16

120

+5,+7

109

+1,+1

105

+10,+9

100

+12,+9

101

+6,+10

103

+6,+10

103

+8,+11

108

+18,+9

102

+9,+5

So high thread count was kinda useless with -0. -j3 ... hard to tell from this alone that the "success" of -j3 is merely what happened to -j2.

I also tried -0b4096 --no-md5-sum, and here the "next" on j1 were negative, meaning they took shorter time than the one that had two minutes cooldown first - it might have been that it hadn't "idled" whatever it was doing when I started the .bat and left the computer:

-j1

next

-j2

next

-j3

next

-j4

next

-j5

next

-j6

next

-j7

next

-j8

next

-j9

next

−8,−4

+8,+2

+7,+12

+9,+1

+12,+17

+7,+11

+10,+14

+8,+14

+8,+16

−9,−6

+7,+7

+9,+13

+15,+14

+9,+18

+12,+9

+10,+17

+13,+13

+8,+12

Not as strikingly bad -j2, but whatever happened to it, it is much better in version 3. Now the evidence that -j3 is the sweet spot (for the fast fixed-predictor setting!) is slightly clearer.

I'm putting on a -0b4096 (with MD5) as well as more common settings for an overnight or over-week-end job.

Re: More multithreading

Reply #41 – 2023-07-14 00:12:45

Quote from: rutra80 on 2023-07-12 00:47:22

15:42 of CDDA on i7-4790K:

-j1 -8ep - 101s
-j2 -8ep - 99s
-j4 -8ep - 34s
-j8 -8ep - 25s

V3:
-j1 -8ep - 103s
-j2 -8ep - 118s
-j4 -8ep - 37s
-j8 -8ep - 26s

Re: More multithreading

Reply #42 – 2023-07-14 00:20:35

Oh my, the jury is sent out again on what makes -j2 worse.

Re: More multithreading

Reply #43 – 2023-07-14 01:17:01

Quote from: Porcus on 2023-07-14 00:20:35

Oh my, the jury is sent out again on what makes -j2 worse.

-j2 is not worse with Repllica9000, music_1 and my Ryzens.
It may be even down to some choice of a modern compiler why older intels do a bit uneven.

Re: More multithreading

Reply #44 – 2023-07-14 03:00:08

1h 43m 16/44.1 file, Ryzen 5850U.
flac git-3e2d9a43 20230712

Code: [Select]

      -0      -1      -2      -3      -4      -5      -6      -7       -8
 j1:  3.717s  3.936s  4.175s  4.318s  4.947s  5.872s  8.217s  10.183s  15.206s
 j2:  2.404s  2.395s  2.351s  2.262s  2.879s  3.822s  6.057s   8.070s  13.112s
 j3:  2.525s  2.415s  2.511s  2.270s  2.884s  2.397s  3.407s   4.500s   7.349s
 j4:  2.529s  2.443s  2.564s  2.318s  2.904s  2.529s  2.754s   3.370s   5.439s
 j5:  2.558s  2.385s  2.588s  2.420s  2.944s  2.560s  2.795s   2.933s   4.440s
 j6:  2.604s  2.407s  2.660s  2.393s  3.000s  2.579s  2.797s   2.960s   3.853s
 j7:  2.631s  2.416s  2.640s  2.380s  2.991s  2.558s  2.823s   2.971s   3.438s
 j8:  2.612s  2.433s  2.659s  2.444s  3.043s  2.602s  2.838s   2.967s   3.603s
 j9:  2.684s  2.441s  2.637s  2.385s  3.026s  2.540s  2.874s   3.003s   3.850s
j10:  2.613s  2.425s  2.678s  2.425s  3.019s  2.551s  2.864s   2.977s   3.633s
j11:  2.681s  2.439s  2.753s  2.490s  2.993s  2.537s  2.824s   2.976s   3.566s
j12:  2.691s  2.401s  2.692s  2.420s  3.011s  2.571s  2.805s   3.011s   3.492s
j13:  2.631s  2.440s  2.627s  2.462s  3.009s  2.565s  2.817s   3.000s   3.556s
j14:  2.646s  2.448s  2.648s  2.429s  3.043s  2.595s  2.818s   2.957s   3.576s
j15:  2.672s  2.457s  2.762s  2.419s  3.003s  2.577s  2.848s   2.954s   3.518s
j16:  2.623s  2.473s  2.657s  2.475s  3.024s  2.573s  2.917s   2.953s   3.692s

Code: [Select]

      -0p     -1p     -2p     -3p     -4p     -5p     -6p      -7p      -8p
 j1:  3.806s  3.987s  4.224s  5.433s  6.403s  8.354s  16.345s  19.972s  44.046s
 j2:  2.516s  2.466s  2.434s  3.317s  4.293s  6.288s  14.358s  17.957s  42.445s
 j3:  2.586s  2.474s  2.645s  2.498s  4.430s  3.694s   8.293s  10.356s  23.904s
 j4:  2.615s  2.462s  2.732s  2.732s  4.492s  2.954s   6.167s   7.687s  17.831s
 j5:  2.705s  2.491s  2.812s  2.633s  4.470s  2.987s   5.050s   6.239s  14.655s
 j6:  2.712s  2.676s  2.765s  2.674s  4.478s  2.989s   4.385s   5.452s  12.682s
 j7:  2.721s  2.679s  2.771s  2.625s  4.444s  2.974s   3.967s   4.989s  11.348s
 j8:  2.745s  2.563s  2.816s  2.623s  4.404s  3.007s   3.859s   4.485s  10.559s
 j9:  2.736s  2.622s  2.754s  2.633s  4.431s  3.007s   4.558s   5.438s  11.801s
j10:  2.721s  2.581s  2.755s  2.638s  4.415s  2.991s   4.307s   5.215s  12.191s
j11:  2.756s  2.641s  2.824s  2.623s  4.415s  3.025s   4.043s   4.908s  11.589s
j12:  2.769s  2.818s  2.802s  2.628s  4.454s  3.027s   3.968s   4.663s  10.990s
j13:  2.797s  2.669s  2.841s  2.645s  4.450s  2.990s   4.084s   4.509s  10.508s
j14:  2.776s  2.575s  2.781s  2.601s  4.465s  3.018s   4.084s   4.441s  10.017s
j15:  2.738s  2.566s  2.889s  2.598s  4.482s  3.003s   4.148s   4.507s   9.646s
j16:  2.800s  2.569s  2.822s  2.623s  4.443s  3.046s   4.138s   4.515s   9.299s

Re: More multithreading

Reply #45 – 2023-07-14 07:31:12

Thank you all for the results. I do have a few ideas on what can be changed to improve performance further. Might take a while though.

As many are asking for specifics, I'll try to outline the process. The flac command line tool isn't changed much. It accepts the new option and parses it, then passes it to libFLAC. Nothing else is changed. The real magic happens in libFLAC.

libFLAC accepts chunks of PCM data through the FLAC__stream_encoder_process function call. When single threading, this function directly processes the data. As soon as it has got enough samples to fill a single frame, it will process those samples into a frame and write that frame. This involves adding data to the verify queue (if applicable), calculating the MD5 sum, creating a FLAC frame and writing it.

When multithreading, the FLAC__stream_encoder_process call does the adding to the verify queue and the MD5 sum calculating, but then copies the data to a separate data structure and sends a signal to a thread to pick it up. It also checks whether the 'oldest' bit of data has finished processing so it can be written. Sometimes one thread runs faster than the other (because one is interrupted by the OS for example) but we must make sure the oldest thread writes it data first, otherwise the audio data is no longer in the right order. When there is nothing left to be done, the FLAC__stream_encoder_process returns to the client process.

So, there is one thread (the main thread, which I've called housekeeping thread before) that does the dispatching, MD5 calculation and writing the finished frames, and a bunch of thread that do the converting of PCM samples to FLAC frames. The main problem is that these are almost never balanced: for very fast presets like -0, the MD5 sum calculation takes as much time as converting PCM samples to FLAC frames, so there is no use for more than 1 extra thread. However, for presets like -8p, the main thread has pretty much nothing to do, so when invoked with a low thread number, one thread is idling all the time.

The only way to fix this problem is to no longer specialise thread too much. I don't want to "cheat" by adding an extra thread when the first one has nothing to do nor add an extra thread for MD5 which may or may not be necessary: the number of thread the user asks for must be the number of threads that is actually spawned.

So my idea is to create two work queues: the main thread adds work to an MD5 queue (which must be picked up by one particular thread, because MD5 calculation cannot happen in parallel) and to a frame queue (which can be picked up by any thread in parallel). That means the main thread has even less to do than it has now, so as soon as the queue is full, it can start working on a frame by itself. As soon as it is finished with that frame, it will go back to managing the other threads. I'll also make sure one thread can leapfrog another, because that is currently not possible. That might improve performance running on CPUs with both performance and efficiency cores like the newest Intel CPUs and a lot of ARM CPUs.

Re: More multithreading

Reply #46 – 2023-07-14 09:57:24

Quote from: Porcus link=msg=1030091

Some considerations I made on -6: https://hydrogenaud.io/index.php/topic,123025.msg1016398.html#msg1016398
Point is, it is "as heavy as predictor order 8 goes".
I did test -6r5 vs -6r6 though, and the -r made very little size impact.

Thanks, Porcus, for pointing me to that study of yours. Quoting you: "Why is the difference to -5 small and the difference to -7 large? It is not the -r5 to -r6. In the 38 CD corpus in my signature, -5 -r6 improved 0.0044 percent." That is much too little improvement for quite a few percent encoder slowdown, if you ask me.

Thanks for the explanation, ktf. Your plan sounds worth trying.

Chris

Re: More multithreading

Reply #47 – 2023-07-14 13:30:51

Some more assorted comments:

@ktf on the plan forward:
* Although you want a "-j that works no matter settings", is that really an imperative? If there is very little to gain from multi-threading, then say "-j 4 will consider using 4 threads; it may use less if it doesn't think it is worth it"?
I'd say that if there is nothing gained in splitting the housekeeping task, then don't do it.
Maybe - like how "-M" tries to be smart and "-m" does it brute-force - there could be a -J4 for "allow up to 4 threads, the encoder decides if it is worth it", while -j4 uses 4 threads (likely not useful for settings below xx, but if user really wants ...)
* Also, if user goes outside presets, then they cannot expect options to be doing a good job. If I try -8 -b4095 and am surprised that the result is so much worse than plain -8 - I have done precisely that - then that is up to me to learn why it wasn't much efficient.
But this makes for a case to tune presets so that they are more multi-threading friendly, at least if that doesn't hurt single-threading much. Say, block size for -0, -1, -2. And maybe also retune the scope of -M, so that it fits multi-threading.

@Wombat on "old" intels: The CPU in question was launched 2020 Q3, that isn't ... old. Maybe there is something weird about it, but it isn't that it is lacking the last three generations of instruction sets.
And that is why the bad result on -j2 surprises me, as nobody else has posted anything that bad.

Quote from: C.R.Helmrich on 2023-07-14 09:57:24

That is much too little improvement for quite a few percent encoder slowdown, if you ask me.

But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.
YMMV on material and CPU, but I tried one 1.3 GB compilation (same as used here, same computer too)
-5 is a 30 second job. The difference between -5r5 and -5r6 was half a second. The difference between -5r1 and -5r5 was half a second too. But going -5r7 cost a few seconds.
-6 is a 40 second job. So it isn't the -r (up to 6).

Re: More multithreading

Reply #48 – 2023-07-14 14:51:32

Quote from: Porcus on 2023-07-14 13:30:51

@Wombat on "old" intels: The CPU in question was launched 2020 Q3, that isn't ... old. Maybe there is something weird about it, but it isn't that it is lacking the last three generations of instruction sets.
And that is why the bad result on -j2 surprises me, as nobody else has posted anything that bad.

Somehow i was thinking about a i5-7500T you used in a different test. Your newer one even has AVX 512 support.
Sundance and his older 8700 has also a faster j2.

Re: More multithreading

Reply #49 – 2023-07-14 15:19:58

Quote from: Porcus on 2023-07-14 13:30:51

The difference between -5r5 and -5r6 was half a second. The difference between -5r1 and -5r5 was half a second too. But going -5r7 cost a few seconds.

Hm well, not sure about the latter, after a couple of re-runs. Maybe r7 is cheap too.

@ktf : Is it so that if fine partitioning is not needed (so that size impact of -r<high> is small) then time impact is by and large small as well? I think you once explained that -r 8 does indeed partition in 2^8 whether or not that helps, indicating that the "time cost" is sunk before one knows whether it was any use of it.

Anyway if someone feels like testing it: https://hydrogenaud.io/index.php/topic,123025.msg1030124.html#msg1030124

@Wombat : Yeah, and I also use an i5-6300U, launched 2015. Used in a WavPack multithreading test.

Notice