More multithreading

Topic: More multithreading (Read 21833 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #50 – 2023-07-14 15:36:51

Tried -0b4096, -5, -7 and -8. Confirming that v3 improves -j2. And the changes are so much that I won't bother to do any more comparisons between the two.

Again, cooldown and three consecutive runs.

j:	-j1	next	-j2	next	-j3	next	-j4	next	-j5	next	-j6	next	-j7	next	-j8	next	-j9	next

0b4096, v1	118	+0,+2	101	+7,+4	89	+4,+7	88	+1,+10	87	+1,+7	86	+4,+14	86	+2,+11	87	+3,+6	87	+5,+6
0b4096, v3	118	+1,+3	90	+5,+8	88	+5,+8	87	+8,+10	86	+9,+8	85	+7,+8	88	+2,+4	87	+2,+7	86	+5,+6

-5, v1	169	+4,+3	164	+0,-2	104	+3,+7	93	+5,+8	96	+2,+6	93	+5,+11	98	+2,+5	93	+8,+11	94	+6,+6
-5, v3	172	+3,-1	143	+4,+0	101	-0,+3	92	+5,+9	93	+4,+10	93	+7,+12	92	+8,+10	92	+7,+10	92	+8,+10

-7, v1	272	+1,+3	275	+1,+0	166	+0,+1	136	+7,+13	126	+7,+8	117	+1,+6	111	+9,+13	111	+1,+8	109	+8,+29
-7, v3	272	+8,+2	245	-0,−2	156	+7,+9	135	+9,+12	119	+7,+9	106	+1,+12	108	+1,+11	110	+8,+12	108	+7,+12

-8, v1	(*)404	(*)+something	486	+2,+9	250	+7,+6	213	+9,+12	197	+6,+8	174	+2,+19	164	+6,+10	141	+1,+15	143	+6,+8
-8, v3	423	+1,+1	391	+2,+6	255	+4,+3	222	+1,+5	188	+2,+37	184	-0,+−5	149	+1,+18	139	+9,+12	140	+1,+13

(*) Unreliable "404", it was a re-run on maybe an even colder CPU, because at first I got a nonsense results where it took like 437 seconds and then less on the two immediately following runs (on a heated CPU). Something must have kept the CPU busy during those 437.
Since it had more time to cool down when I redid it, the 404 might be reading a bit low. Since the suspiciously high -8j1 still was ten percent faster than the fastest -8j2, it does anyway confirm that j2 was much slower in the version 1 exe.

Re: More multithreading

Reply #51 – 2023-07-14 15:38:36

Quote from: Porcus on 2023-07-14 13:30:51

* Although you want a "-j that works no matter settings", is that really an imperative?

That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j. I could say: the user asked for four threads and one is mostly idling, so I'll spawn a fifth to compensate for the idling, but I don't want that.

Quote

If there is very little to gain from multi-threading, then say "-j 4 will consider using 4 threads; it may use less if it doesn't think it is worth it"?

The goal is to try to make this scale as well as it possibly can, without touching single-threaded behaviour.

Quote

I'd say that if there is nothing gained in splitting the housekeeping task, then don't do it.

There is potentially a lot of gain possible.

Quote from: Porcus on 2023-07-14 13:30:51

But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.

I agree, it is probably the two extra apodizations that makes this so much slower.

Quote from: Porcus on 2023-07-14 15:19:58

@ktf : Is it so that if fine partitioning is not needed (so that size impact of -r<high> is small) then time impact is by and large small as well? I think you once explained that -r 8 does indeed partition in 2^8 whether or not that helps, indicating that the "time cost" is sunk before one knows whether it was any use of it.

Depends. If an incompatible blocksize is chosen, max -r is capped anyway. With a compatible blocksize, the largest part of the time impact is looking for the optimal ordering, so yes, that time is spent anyway, independent of whether it is used. However, the bitwriter is a little slower with more partitions. I can't remember whether that is at all measurable without instrumentation.

Re: More multithreading

Reply #52 – 2023-07-14 16:11:50

i3-12100, 16GB RAM, NVMe SSD (~2.7GB/s write, ~3.3GB/s read), recompress a CDDA flac image to a new file, using PowerShell measure-command totalseconds.

v3 -8
wrote 460350140 bytes
j1 13.889325
j2 10.5771965
j3 5.3922851
j4 3.9220238
j5 4.0600122
j6 4.0264503
j7 4.1554986
j8 4.0284002

v3 -8p
wrote 460143727 bytes
j1 41.0064016
j2 37.8299355
j3 18.7853751
j4 13.4384533
j5 13.5461772
I think there is no need to test up to j8.

v2 -8
wrote 460350140 bytes
j1 14.3544608
j2 10.7919867
j3 5.6689622
j4 3.9546462
j5 4.0161195

v2 -8p
wrote 460143727 bytes
j1 41.0112598
j2 37.7006967
j3 18.9425906
j4 12.8517866
j5 10.4393538
j6 10.5004978
Oh, v2 is better for me with -8p.

Re: More multithreading

Reply #53 – 2023-07-14 19:03:02

Quote from: ktf on 2023-07-11 18:33:30

Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.

Code: [Select]

$ ./flacv1
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./flacv1)
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./flacv1)

Code: [Select]

$ sudo apt-get install libc6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libc6 is already the newest version (2.31-13+deb11u6).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.

What should I do? Thanks.

Re: More multithreading

Reply #54 – 2023-07-14 19:18:37

Quote from: bennetng on 2023-07-14 19:03:02

Quote from: ktf on 2023-07-11 18:33:30
Sure. Source it at https://github.com/xiph/flac/pull/634 (edit: https://github.com/ktmf01/flac/tree/pthread2 more specifically) Binary is attached, but static binaries on Linux are always less portable then on Windows, so I hope it works.
Code: [Select]
$ ./flacv1
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by ./flacv1)
./flacv1: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by ./flacv1)
Code: [Select]
$ sudo apt-get install libc6
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libc6 is already the newest version (2.31-13+deb11u6).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
What should I do? Thanks.

Update your OS. Are you running Debian old stable?

Re: More multithreading

Reply #55 – 2023-07-14 19:34:26

Quote from: Replica9000 on 2023-07-14 19:18:37

Update your OS. Are you running Debian old stable?

It is already the latest one I can download and use, does it mean I must use a different distribution?
https://mxlinux.org/download-links/
MX-21.3_x64 “ahs”, an “Advanced Hardware Support” release for very recent hardware, with 6.0 kernel and newer graphics drivers and firmware. 64 bit only. Works for all users, but especially if you use AMD Ryzen, AMD Radeon RX graphics, or 9th/10th/11th generation Intel hardware.

Re: More multithreading

Reply #56 – 2023-07-14 20:29:45

Quote from: bennetng on 2023-07-14 19:34:26

Quote from: Replica9000 on 2023-07-14 19:18:37
Update your OS. Are you running Debian old stable?
It is already the latest one I can download and use, does it mean I must use a different distribution?
https://mxlinux.org/download-links/
MX-21.3_x64 “ahs”, an “Advanced Hardware Support” release for very recent hardware, with 6.0 kernel and newer graphics drivers and firmware. 64 bit only. Works for all users, but especially if you use AMD Ryzen, AMD Radeon RX graphics, or 9th/10th/11th generation Intel hardware.
[attach type=image]26523[/attach]

I'm not familiar with MX, but it appears to be based on Debian stable. Debian 11 is now old stable. You might need to update your repositories. Current Debian stable had libc 2.36.

Re: More multithreading

Reply #57 – 2023-07-14 20:51:05

OK, thanks, and I just found the page below, so looks like there is no need to do everything from scratch.
https://mxlinux.org/wiki/system/upgrading-from-mx-21-to-mx-23-without-reinstalling/

Re: More multithreading

Reply #58 – 2023-07-15 08:04:32

Quote from: bennetng on 2023-07-14 16:11:50

i3-12100, 16GB RAM, NVMe SSD (~2.7GB/s write, ~3.3GB/s read), recompress a CDDA flac image to a new file, using PowerShell measure-command totalseconds.

Windows v1
-8
wrote 460350140 bytes
j1 13.7896849
j2 10.8577313
j3 5.6010285
j4 4.0480883
j5 4.0507501

-8p
wrote 460143727 bytes
j1 40.9393991
j2 37.7043669
j3 19.8624566
j4 12.3921615
j5 10.8594597
j6 10.7756123

Linux v1, using "time" command showing "real"
-8
wrote 460350151 bytes
j1 12.924s
j2 10.143s
j3 5.227s
j4 4.136s
j5 4.113s

-8p
wrote 460143729 bytes
j1 36.962s
j2 35.696s
j3 18.382s
j4 13.629s
j5 11.572s
j6 12.001s

So yes, v3 is worse than v2 and v1 in -8p.

Re: More multithreading

Reply #59 – 2023-07-15 11:56:41

Quote from: ktf on 2023-07-14 15:38:36

Quote from: Porcus on 2023-07-14 13:30:51
But -r6 doesn't make for the slowdown - at least on my computers. It is the subdivide_tukey(2) that takes more time.
I agree, it is probably the two extra apodizations that makes this so much slower.

Hmm, on my (ancient, agreed) mobile Intel i7 M620 with two cores and HyperThreading and 3 WAVs encoded in parallel in foobar, I saw about 5% speedup, in several tries, when going from -r 6 to -r 5 with FLAC 1.4.3. But alright then, apparently less than that on other systems, and like I wrote earlier, the main contributors to encoding runtime are the max. LPC order and number of apodizations tried, on any platform. For the record, using LPC order 10 instead of 8 (indeed the obvious approach) made preset 6 exactly as slow as preset 7 with order 12 on my laptop. I assume that's because max. order 12 is heavily code-optimized and max. order 10 isn't? Does anyone know?

Btw, back to the topic subject: In doubt, I very much prefer aiming for maximum possible multithreading efficiency with few (like 4 or so) threads than with more than 8 threads. With a dozen CPU cores or more available, one should probably use file parallel encoding, anyway. In the video coding project that I'm currently contributing to (VVenC, in case anyone's interested), we noticed that the multithreading performance doesn't scale too well at high thread and core counts on some CPUs. Since, on other CPUs, it does, the CPU architecture itself might be the reason.

Chris

Re: More multithreading

Reply #60 – 2023-07-15 17:04:15

Quote

I hope you don't let this experiment slow down regular single threaded encoding. At least my use case is encoding several files at a time and doing multiple files in separate threads is faster than spreading single file encoding over multiple threads.

Just my 2 cents: Mulitple file threading (e.g. in foobar) is faster here too, but I would definitely vote "YEA" for ktf's efforts in single file multithreading, especially if single file performance is not affected. I guess I'm not the only one who uses "flac.exe" on a simple command line or in a script or simple tool the calls the flac binary. All those scenarios benefit a lot from a flac.exe that runs 4x its single thread speed...

Re: More multithreading

Reply #61 – 2023-07-16 13:02:18

Well done building multithreading into libflac, bet that was a pain

One of input, output, MD5 and encode is the bottleneck at any given time (probably the same thing throughout), ideally the threading model would automatically prioritise the bottleneck to minimise idle time.

My first thought is no specialised threads (aside from housekeeping at the start of FLAC__stream_encoder_process to make sure that the first thread to read is the one if any that has partial unprocessed data from the previous call). If a thread handles a frame it does everything required for that frame to keep things in the fastest cache possible, preferably L1/L2. Mutex for input output and MD5 to keep them serial (they're the only serial things). Keep track of a frames start location and global I/O/MD5 location to determine what should be prioritised as tasks get completed. MD5 should probably take priority over encode as it's serial, but if one thread is hashing another can encode first (and even write before hashing if it's still not their turn to hash).

I think that's as good as it gets when each state is handled in frame-sized chunks. MD5 and output both being serial but arbitrary order started me down a line of thinking about requiring a heuristic for optimal priority, however if either is the bottleneck the opportunity to pick quickly disappears as the wavefront for each will be on different threads.

There may be a benefit to two working frames per thread so that a thread can work on something while the other frame is stalled for whatever reason. It might be beneficial when the bottleneck changes over time, but mostly it papers over the idle time that would otherwise be present when extra threads cannot be spawned and we're dealing with frame-sized chunks always. Alternatively ignore this complexity and the user can DIY this behaviour by setting a higher thread count than they have hardware threads.

Haven't considered verify step, but that should be easy enough to add to the above model.

Quote from: ktf on 2023-07-11 17:30:15

Quote from: Porcus on 2023-07-11 16:38:09
It seems there is a certain minimum amount of work that needs to go in a thread-task, otherwise the overhead completely swamps any possible gain. So, if you set a small blocksize, for example 32, multithreading shows massive negative gains.
Maybe that can be countered by choosing some arbitrary minimum number of samples a thread handles per iteration, for example min_samples=4096 blocksize=32 would mean a single thread handles the next chunk of 128 frames.

Re: More multithreading

Reply #62 – 2023-07-16 22:03:41

Quote from: ktf on 2023-07-14 15:38:36

Quote from: Porcus on 2023-07-14 13:30:51
* Although you want a "-j that works no matter settings", is that really an imperative?
That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j.

But you could spawn less threads?
Say, choose to implement -j7 to mean "up to 7", where the selection algorithm could be subject to change. And then maybe let -j7,7 mean "I ordered seven!", like -r7,7 works. This of course depends on whether you are comfortable about releasing a 1.5.0 with a "crude" selection algorithm.

You have quite some choice here, because reference FLAC is not at all consistent in applying numerical arguments. -l7 and -r7 mean "at most 7", but there is no "-l7,7"; on the other hand, -q7 means "exactly 7" and there is no -q6,7 to force a range. There is however a -q0 for "let encoder decide".

Re: More multithreading

Reply #63 – 2023-07-17 00:08:17

Quote from: Porcus on 2023-07-16 22:03:41

Quote from: ktf on 2023-07-14 15:38:36
Quote from: Porcus on 2023-07-14 13:30:51
* Although you want a "-j that works no matter settings", is that really an imperative?
That's not really what I said. I'd like to improve multi-threading by not having threads that idle much, and I also don't want to spawn more threads than asked for by -j.
But you could spawn less threads?
Say, choose to implement -j7 to mean "up to 7", where the selection algorithm could be subject to change. And then maybe let -j7,7 mean "I ordered seven!", like -r7,7 works. This of course depends on whether you are comfortable about releasing a 1.5.0 with a "crude" selection algorithm.

You have quite some choice here, because reference FLAC is not at all consistent in applying numerical arguments. -l7 and -r7 mean "at most 7", but there is no "-l7,7"; on the other hand, -q7 means "exactly 7" and there is no -q6,7 to force a range. There is however a -q0 for "let encoder decide".

The selection algorithm could only choose what is optimal before the task starts. If I'm not mistaken, once a task is using x amount of threads, that can't be changed until the task ends. Maybe having -j7 use 7 threads, and have -j0 be automatic/optimal (probably run 1 thread per physical core/fpu).

Re: More multithreading

Reply #64 – 2023-07-17 08:16:46

Quote from: cid42 on 2023-07-16 13:02:18

Well done building multithreading into libflac, bet that was a pain

Not really. I'm dreading implementation of Ogg FLAC metadata editing way more actually.

Quote

If a thread handles a frame it does everything required for that frame to keep things in the fastest cache possible, preferably L1/L2.

I think threading overhead is way more important than having stuff in L1/L2. The occasional stalls (and associated context switch) is way more expensive than having to load stuff from main memory more often.

Quote

Mutex for input output and MD5 to keep them serial (they're the only serial things).

As far as I know mutexes are not meant to keep things serial, they are to lock things.

Quote

There may be a benefit to two working frames per thread

That is the exact difference between v1 and v3.

Quote from: Porcus on 2023-07-16 22:03:41

But you could spawn less threads?

Yes, but I'd rather first try to get this to scale properly.

Re: More multithreading

Reply #65 – 2023-07-17 11:37:34

Quote from: Replica9000 on 2023-07-17 00:08:17

The selection algorithm could only choose what is optimal before the task starts. If I'm not mistaken, once a task is using x amount of threads, that can't be changed until the task ends.

I'd guess that there is room for a pretty good solution even if you constrain yourself to making the choice once, when the executable is started, from the other options passed (like "-0").
(Except, the application has so many options that taking them all as input to the threads selection in a smart way, would be quite a job. But I'd say it would be good enough, if you got something that handles the numerical presets -0 to -8 (and above that, just go full steam I guess?) and with/without --verify?)

Re: More multithreading

Reply #66 – 2023-07-17 11:52:38

Quote from: ktf on 2023-07-17 08:16:46

As far as I know mutexes are not meant to keep things serial, they are to lock things.

When multiple threads want to use a resource simultaneously, locking is the easiest way to ensure that they form a queue instead of a free-for-all. If a global variable input_loc kept track of where the next read is in samples, a lock ensures that 4 threads trying to simultaneously read blocksize=1000 see the correct one of input_loc=0,1000,2000,3000 and fread in the right order, instead of them all probably seeing input_loc=0 and freading in arbitrary order.

Keep track of the location of I/O/MD5, when it's time for a worker to interact with one of them lock it first. Mutex required for input, technically output and md5 don't require mutexes as we're keeping track of unique frame locations and only one thread should interact at a time, but I believe explicitly using a mutex updates the thread-local view of a global variable which may otherwise be an old cached value and may result in a stall or at least a delay (could be wrong on that point).

Rough pseudocode ignoring verify step and not including wake/sleep mechanism:

Code: [Select]

enum{IDLE, INPUT_READ, ENCODED, WRITTEN};

struct{
	mutex input_m, output_m, md5_m;
	uint64_t input_loc, output_loc, md5_loc;
} globals;

struct{
	int status;
	uint64_t frame_loc;
} worker;

while(1){//worker loop
	if(status==IDLE)
		lock input
		frame_loc=input_loc
		read the next frames worth of input
		input_loc+=blocksize
		unlock input
		status=INPUT_READ
	else if(frame_loc==md5_loc)
		lock md5
		update hash
		md5_loc+=blocksize
		unlock md5
	else if(status==INPUT_READ)
		encode
		status=ENCODED
	else if(status==ENCODED && output_loc==frame_loc)
		lock output
		write
		output_loc+=blocksize
		unlock output
		status=WRITTEN
	else if(status==WRITTEN && frame_loc<md5_loc)
		status=IDLE
	else
		;//waiting for its turn to md5 or write
}

This ensures the input->encode->output order for a frame and ensures the seriality of I/O/MD5 but keeps the md5 stage of a frame floating (could be done after any of input/encode/output) to try and minimise idle time.

In an actual implementation along the lines of the above there would also be a worker with PARTIAL status containing unencoded samples from the previous FLAC__stream_encoder_process call. This FLAC__stream_encoder_process call would have to make sure that the PARTIAL worker if present is the first to read the input.

Re: More multithreading

Reply #67 – 2023-07-18 14:52:05

15:42 of CDDA on 2x Xeon E5620 NUMA system:

V1:
-j1 -8ep - 342s
-j2 -8ep - 337s
-j4 -8ep - 116s
-j8 -8ep - 87s
-j9 -8ep - 74s
-j10 -8ep - 65s
-j11 -8ep - 59s
-j12 -8ep - 55s
-j13 -8ep - 49s
-j14 -8ep - 50s
-j15 -8ep - 43s
-j16 -8ep - 40s

V3:
-j1 -8ep - 353s
-j2 -8ep - 328s
-j4 -8ep - 122s
-j8 -8ep - 79s
-j9 -8ep - 70s
-j10 -8ep - 62s
-j11 -8ep - 56s
-j12 -8ep - 51s
-j13 -8ep - 47s
-j14 -8ep - 44s
-j15 -8ep - 41s
-j16 -8ep - 39s

On 8 threaded i7-4790K I was able to shed another 2s by running 16 threads (HyperThreading inefficiency?) but it seems to be the limit - how about removing it?

Re: More multithreading

Reply #68 – 2023-07-20 16:03:15

After spending quite a bit of time trying some other approaches, here is a new binary.

A word of warning first: the multithreading code got quite a bit more complicated here, and I haven't tested thoroughly yet, so it might hang or create corrupt files every now and then. Please use with caution and only for benchmarking/testing.

The changes mainly focus on making threading more flexible, better using CPU resources. Whereas previous binaries saw (almost) no speed boost with settings like -8p -j2 because 1 thread was mostly idle, it should now pretty much fully utilize 2 cores. Also, Using more than 2 cores for fast presets like -0 should now help, because MD5 is split of from the main thread into a worker thread.

As you can see in the PDF, v4 improves -j2 for all settings, except settings -1 and -4. Improving those requires a separate solution that will be rolled out later.

Re: More multithreading

Reply #69 – 2023-07-20 17:44:00

flac-multithreading-v4-win
AMD Ryzen 9 5950X (16 Cores 32 Threads)

Code: [Select]

timer64.exe v4 -j1 -8p -f in.wav
Global Time  =    60.407

timer64.exe v4 -j2 -8p -f in.wav
Global Time  =    37.956

timer64.exe v4 -j3 -8p -f in.wav
Global Time  =    25.652

timer64.exe v4 -j4 -8p -f in.wav
Global Time  =    19.207

timer64.exe v4 -j5 -8p -f in.wav
Global Time  =    16.313

timer64.exe v4 -j6 -8p -f in.wav
Global Time  =    14.022

timer64.exe v4 -j7 -8p -f in.wav
Global Time  =    12.405

timer64.exe v4 -j8 -8p -f in.wav
Global Time  =    10.840

timer64.exe v4 -j9 -8p -f in.wav
Global Time  =    10.399

timer64.exe v4 -j10 -8p -f in.wav
Global Time  =     8.999

timer64.exe v4 -j11 -8p -f in.wav
Global Time  =     8.374

timer64.exe v4 -j12 -8p -f in.wav
Global Time  =     7.724

timer64.exe v4 -j13 -8p -f in.wav
Global Time  =     7.558

timer64.exe v4 -j14 -8p -f in.wav
Global Time  =     6.814

timer64.exe v4 -j15 -8p -f in.wav
Global Time  =     7.532

timer64.exe v4 -j16 -8p -f in.wav
Global Time  =     6.840

Re: More multithreading

Reply #70 – 2023-07-20 18:28:59

flac git-1357f844 20230720

run with -8p

Code: [Select]

 -j1: 0m43.870s
 -j2: 0m24.211s
 -j3: 0m17.975s
 -j4: 0m14.690s
 -j5: 0m12.686s
 -j6: 0m11.325s
 -j7: 0m10.291s
 -j8: 0m9.530s
 -j9: 0m9.571s
-j10: 0m9.460s
-j11: 0m9.365s
-j12: 0m9.213s
-j13: 0m9.131s
-j14: 0m9.076s
-j15: 0m9.003s
-j16: 0m8.999s

Re: More multithreading

Reply #71 – 2023-07-20 19:28:42

Quote from: music_1 on 2023-07-20 17:44:00

flac-multithreading-v4-win
AMD Ryzen 9 5950X (16 Cores 32 Threads)
[...]

I don't think the numbers are reliable enough to draw conclusions just by themselves, but comparing with v3, it seems v4 doesn't scale deeper, but it gets there with significantly less threads. -j9 with v3 has about the same time as -j7 with v4, and all higher thread counts seem to follow the same pattern: v4 does things in the same time as v3 with 2 threads less. This is better than I'd hoped for.

Quote from: Replica9000 on 2023-07-20 18:28:59

flac git-1357f844 20230720

run with -8p
[...]

This seems to get there with 1 thread less until 8 threads, which is what I expected. It looks like the behaviour observed previously, where using a number of threads higher than the core count increases the time used, is no longer there.

All in all, not bad I'd say.

Re: More multithreading

Reply #72 – 2023-07-20 20:28:36

My results with the v4 binary:

Code: [Select]

-j1:	Average time =  22.865 seconds (3 rounds), Encoding speed = 472.86x
-j2:	Average time =  12.113 seconds (3 rounds), Encoding speed = 892.62x
-j3:	Average time =   8.367 seconds (3 rounds), Encoding speed = 1292.17x
-j4:	Average time =   6.518 seconds (3 rounds), Encoding speed = 1658.88x
-j5:	Average time =   5.357 seconds (3 rounds), Encoding speed = 2018.29x
-j6:	Average time =   4.886 seconds (3 rounds), Encoding speed = 2213.00x
-j7:	Average time =   4.840 seconds (3 rounds), Encoding speed = 2233.73x
-j8:	Average time =   4.724 seconds (3 rounds), Encoding speed = 2288.90x

Excellent scaling here up to -j6 (having 6 cores here...)

Re: More multithreading

Reply #73 – 2023-07-20 21:03:12

A test with 25 random files, decoded from FLAC with a stable version of FLAC, re-encoded with git-1357f844 using -j8, and decoded again with the stable version.

Code: [Select]

a2e5ffbacccec5eeb055a9d8b86aa407  Alice In Chains - Junkhead.orig.wav
a2e5ffbacccec5eeb055a9d8b86aa407  Alice In Chains - Junkhead.wav
ea2b96ccb1700203cff1febcf3583b4f  Assemblage 23 - Pages.orig.wav
ea2b96ccb1700203cff1febcf3583b4f  Assemblage 23 - Pages.wav
e68c9d86855cdeba088fdd90c9edb386  Blutengel - Ich Bin Das Feuer.orig.wav
e68c9d86855cdeba088fdd90c9edb386  Blutengel - Ich Bin Das Feuer.wav
c154800d3fd2ede0bed7e0a25b582bdb  Chimaira - Left For Dead.orig.wav
c154800d3fd2ede0bed7e0a25b582bdb  Chimaira - Left For Dead.wav
d237c825964d7b2718b287a36a5911be  Chimaira - The Flame.orig.wav
d237c825964d7b2718b287a36a5911be  Chimaira - The Flame.wav
34a76a7eba7069fb8ebcdfe2995f5cfa  Eisbrecher - Nein Danke.orig.wav
34a76a7eba7069fb8ebcdfe2995f5cfa  Eisbrecher - Nein Danke.wav
7d0e21fc23630e5c559b2ebb2ab300b5  Eisbrecher - Unschuldsengel.orig.wav
7d0e21fc23630e5c559b2ebb2ab300b5  Eisbrecher - Unschuldsengel.wav
0b601ad4ea94c43e58182736f94d9e07  Five Finger Death Punch - The Agony Of Regret.orig.wav
0b601ad4ea94c43e58182736f94d9e07  Five Finger Death Punch - The Agony Of Regret.wav
bca9d450c42bfef2c9620b5b1c68e81a  Five Finger Death Punch - You.orig.wav
bca9d450c42bfef2c9620b5b1c68e81a  Five Finger Death Punch - You.wav
e84ea2925ec4eb1d55f3f878813bba3b  KMFDM - Last Things.orig.wav
e84ea2925ec4eb1d55f3f878813bba3b  KMFDM - Last Things.wav
5261e8a315ac5f6f3862376e343dacb3  Linkin Park - Shadow Of The Day.orig.wav
5261e8a315ac5f6f3862376e343dacb3  Linkin Park - Shadow Of The Day.wav
7fb6a83e650f642e4061984e63797929  Megadeth - I Know Jack.orig.wav
7fb6a83e650f642e4061984e63797929  Megadeth - I Know Jack.wav
bd0e25c7b0f69667d907f24d42258475  Megadeth - The Right To Go Insane.orig.wav
bd0e25c7b0f69667d907f24d42258475  Megadeth - The Right To Go Insane.wav
ed2905920e3d3d9d3c2071cf14255332  Metallica - Holier Than Thou.orig.wav
ed2905920e3d3d9d3c2071cf14255332  Metallica - Holier Than Thou.wav
eb4072314e7641ab14907cbb5a183976  Nine Inch Nails - All The Pigs, All Lined Up.orig.wav
eb4072314e7641ab14907cbb5a183976  Nine Inch Nails - All The Pigs, All Lined Up.wav
06ed26ea67393804675ef2e6305f77bd  Nine Inch Nails - Head Like A Hole (Clay).orig.wav
06ed26ea67393804675ef2e6305f77bd  Nine Inch Nails - Head Like A Hole (Clay).wav
e90672c5b36aed09b38647d907d3586f  Project Pitchfork - Schalt Und Rauch.orig.wav
e90672c5b36aed09b38647d907d3586f  Project Pitchfork - Schalt Und Rauch.wav
daaa63180a26a2716c886f8eee07d7e9  Sepultura - Slaves Of Pain.orig.wav
daaa63180a26a2716c886f8eee07d7e9  Sepultura - Slaves Of Pain.wav
ac19a7d52d6e00545da33668b6ea26c8  Sepultura - We Who Are Not As Others.orig.wav
ac19a7d52d6e00545da33668b6ea26c8  Sepultura - We Who Are Not As Others.wav
0aa52479725fe6e755e00ffda36d1191  Spineshank - 40 Below.orig.wav
0aa52479725fe6e755e00ffda36d1191  Spineshank - 40 Below.wav
a186a0036687c2e96c17e20a71d9116a  Stone Temple Pilots - Sin.orig.wav
a186a0036687c2e96c17e20a71d9116a  Stone Temple Pilots - Sin.wav
c969721b69ca499c50e4643ef6cdddda  Tantric - I'll Stay Here.orig.wav
c969721b69ca499c50e4643ef6cdddda  Tantric - I'll Stay Here.wav
e24100e76d3af18b19a23d9156986331  Taproot - Myself.orig.wav
e24100e76d3af18b19a23d9156986331  Taproot - Myself.wav
0b6a1765390f6451588e0984595ddec9  The Crystal Method - Jaded.orig.wav
0b6a1765390f6451588e0984595ddec9  The Crystal Method - Jaded.wav
ec2cf54651c01c1ecec17a0f5fb18225  Van Halen - One Foot Out The Door.orig.wav
ec2cf54651c01c1ecec17a0f5fb18225  Van Halen - One Foot Out The Door.wav

Re: More multithreading

Reply #74 – 2023-07-20 21:39:59

TL;DR of Reply 73: All match.

Notice