More multithreading

Topic: More multithreading (Read 24626 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #100 – 2023-07-23 13:53:18

Quote from: rutra80 on 2023-07-23 12:45:57

Yep, somethings funky with the scaling already, with -j1 it's fine.

What do you mean? That -1 and -4 are different? This has been mentioned in the thread start, reply #2, #18, #19, #30 and #68. Otherwise, I don't know what you mean.

Quote from: Porcus on 2023-07-23 13:01:29

Suggestion for that case with four concurrent files:
[...]

Very messy. I would get lost in that.

Quote

Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.

Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup. With preset -0 I still get a 1.4x speedup with 4 threads. So the overhead of setting up and destroying threads isn't too much.

Quote

Possibly you could consider the following line of arguments - subject to being anywhere remotely close to the fact, I am quiiiite ignorant here:
If it is just one single file, it will be done in one second anyway, you can get it down to half a second but who cares if you cannot get it down to a third of a second - end-users get impatient over seconds to wait, not over percentages;

We're not talking about seconds here, but milliseconds with current CPUs. Seriously, encoding 1600 files of such 1-second files takes 10 seconds in total when single threading with preset -8, and 5 seconds with -j4.

Also, a program should act somewhat predictable to an end user. If the command line tool uses a different ordering from what the user supplies as input to improve throughput, that is going to be confusing.

Quote

Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?

The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.

Quote

Quote from: ktf on 2023-07-22 21:13:27
Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time.
Obviously for the end result. But for testing, you don't get much useful extra information from including anything else?

I don't know any.

Re: More multithreading

Reply #101 – 2023-07-23 15:03:49

Quote from: ktf on 2023-07-23 13:53:18

Quote
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.
Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup.

I was thinking about the -0 end and not the -8 end ... but anyway:
Testing a compilation album with short songs, not atypical for the genre: https://nocleansinging.bandcamp.com/album/hold-fast-grindviolence-compilation - free download for anyone to replicate the experiment on their computers
30:26 long, 20 tracks. 23:49 is CDDA, 997 kbit/s at -5 (yes noisy), 4:07 is 44.1/24 at 1698, and 2:31 is 96/24 at 3409.
for %j IN (1,2,4,8,16) DO (timeout /t 8 & \bin\timer64.exe flac-multithreading-v4.exe -ss -j%j -f <setting> *.flac )
8 seconds is maybe not much cooldown time, but quite a lot compared to the busy times. And -j16 is supposed to be useless on a 4core8thread i5-1135G7 (throwing it in just to verify it it doesn't make a mess out of anything):

--lax -0b16384 --no-md5-sum where -j4 takes more time than -j2 even if occupying more threads. Times -j1 -j2 -j4 -j8 -j16 are:
   2.955    2.166    2.193    2.316    2.337
-0b4096 where again -j4 takes more time than -j2
   3.414    2.144    2.311    2.310    2.444
-2e and at this stage I wonder if I should have run -j3 and -j5 and the whole thing
   4.118    2.640    2.895    2.990    3.091
-5 and finally -j4 catches -j2, but -j8 doesn't improve over -j4
   4.554    3.045    2.859    2.953    3.068
-8 and here -j4 does save considerable time.
   9.578    5.624    3.895    3.844    3.826

So up to -5-ish, running -j4 / -j8 (/-j16) just means that it fires 4 / all 8 (/ditto) threads to do the same work as -j2 does.
Do I interpret it correctly as follows? That means that you fire up 2 or 6 extra threads only to do the extra work from the overhead? That is a waste. If I want to put my CPU to work for two seconds, and can get it done in four threadseconds, then spending sixteen threadseconds probably makes for several times as much heat - which would translate to a huge increase in duration if I were to run this for a weeklong job where the CPU would be pretty much throttled over the heat?

Quote from: ktf on 2023-07-23 13:53:18

Quote
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?
The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.

And that just makes my argument even better (for you): If multi-threading multi-files means that users are not going to invoke <particular single-file scenario> so often, you don't need to worry so much about it, as you would have if you just presume that all multi-threading is run on single files.

Re: More multithreading

Reply #102 – 2023-07-23 16:00:02

I think the main drawback of wall time is it includes everything, antivirus updating in background, Windows telemetry, and something like that.

Honestly, at this moment I hope the main focus is still single file multithreading, with a secret (now I mentioned so no longer secret) wish of variable blocksize development that may utilize some threads, which is much more rewarding than the pathetic -pe combination.

Re: More multithreading

Reply #103 – 2023-07-23 17:53:31

I don't know how Windows calculates the time a process uses. On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:

Code: [Select]

real    0m43.619s
user    0m43.106s
sys     0m0.512s

user + sys = 0m43.618s. I don't really have anything else on my system using resources other than the browser.

Running with two threads to ramdisk:

Code: [Select]

real    0m23.948s
user    0m47.475s
sys     0m0.376s

(user + sys) / jobs = 0m23.925s

Running with 8 threads to ramdisk:

Code: [Select]

real    0m8.575s
user    1m7.709s
sys     0m0.568s

(user + sys) / jobs = 0m8.535s

Running with 8 threads to disk (zfs):

Code: [Select]

real    0m40.068s
user    1m14.153s
sys     0m2.573s

(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).

Re: More multithreading

Reply #104 – 2023-07-23 23:49:35

Warning for information overload here.

I ran a variety of settings through version 3 and version 4 (note: only -j1, -j2, -j4 and -j8). Every figure is after a 120 seconds pause for cooldown. I suspect that wasn't always enough.
Times were recorded with the timer64 utility. I don't know what process time is worth, but those figures are surprising: there are big differences from version 3 to version 4, where the latter frequently measures up much higher; in two computers with fans, that happens at -5 settings and -2 settings, but for my fanless (hence throttling) home desktop it happens at the heavier -8xx settings.
But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?

Ran on three computers, all with Intels 4cores8threads CPUs.
Common observation: for the -0 settings, one can stick to -j2.

Results from a HP Prodesk with i5-7500T (same as here). In version 4, -j8 slows down Global time compared to -j4 (and sometimes made -j8 slower than version 3).

-8pr7	j1 process	j1 global	j2 process	j2 global	j4 process	j4 global	j8 process	j8 global
version3	638	639	667	633	683	224	688	189
version4	617	618	648	325	665	168	690	176
-8er7
version3	692	692	718	683	743	243	747	207
version4	670	671	694	348	728	184	750	191
-8r7
version3	176	177	190	161	191	57	191	51
version4	172	172	180	91	186	48	213	56
-8r0
version3	156	157	171	144	177	63	169	46
version4	156	157	161	82	166	43	198	52
-5q14
version3	67	67	73	44	75	30	75	31
version4	66	67	70	36	73	21	108	30
-5q6
version3	67	68	74	44	75	30	75	31
version4	67	68	70	36	73	21	102	30
-2er7
version3	105	106	122	83	120	34	125	39
version4	101	102	106	54	111	30	121	33
-0mr0
version3	51	52	57	34	65	35	65	35
version4	51	52	55	29	81	28	117	42
-0Mr0
version3	47	48	55	33	55	33	56	34
version4	48	48	59	36	60	36	60	36
-0r0
version3	46	47	54	33	60	36	59	36
version4	46	46	50	26	78	28	120	44

You notice that there are some -j8 settings where version 4 boosts "Process time" quite a lot: The "-5" settings and the "-0" settings, except the "-0Mr0" (the "soft" mid/side).

Same test ran on a Dell business laptop, i7-1185G7. Here -j8 is a good thing for the -8-based; but, compare to version 3 at the -8j8 settings.

-8pr7	j1 process	j1 global	j2 process	j2 global	j4 process	j4 global	j8 process	j8 global
version3	581	592	713	683	799	300	964	153
version4	688	696	849	435	860	229	1008	150
-8er7
version3	716	723	751	719	852	319	1018	162
version4	692	698	902	458	903	237	1197	173
-8r7
version3	170	178	203	185	258	97	231	50
version4	198	210	218	126	209	71	239	47
-8r0
version3	154	174	167	138	199	81	214	45
version4	145	161	194	108	198	66	224	49
-5q14
version3	62	77	71	60	73	35	73	42
version4	61	77	77	53	79	35	98	31
-5q6
version3	58	65	72	62	74	41	75	38
version4	59	72	78	55	78	34	102	32
-2er7
version3	95	108	118	101	134	57	130	45
version4	94	106	122	75	128	46	172	44
-0mr0
version3	44	60	57	44	58	45	58	44
version4	45	60	54	38	62	37	113	44
-0Mr0
version3	40	50	52	47	53	43	51	38
version4	40	57	54	54	54	52	54	53
-0r0
version3	39	55	50	36	50	39	51	43
version4	45	63	49	43	64	37	121	58

Process time numbers jump on the same spots in the table, but also -2er7.
The top-left result (-8pr7 -j1 on version 3) was the first that was run, and if 2 minutes cooldown was too little (which I suspect), it might be reading too low due to starting from longer cooldown when I fiddled a little back and forth.

Now on my usual fanless desktop which throttles at will and produces unreliable numbers (CPU: i5-1135G7), the bottom of the table deviates slightly:

-8pr7	j1 process	j1 global	j2 process	j2 global	j4 process	j4 global	j8 process	j8 global
version3	449	451	442	452	378	194	603	143
version4	441	450	475	248	546	160	978	135
-8er7
version3	475	485	461	470	398	200	698	156
version4	473	481	509	260	645	178	1037	137
-8r7
version3	123	123	105	110	63	42	89	31
version4	123	123	117	63	84	37	162	31
-8r0
version3	107	107	93	97	48	37	81	28
version4	107	107	106	59	65	33	136	28
-5q14
version3	41	47	25	34	23	24	24	24
version4	41	51	34	28	11	18	13	20
-5q6
version3	42	46	24	34	24	23	23	23
version4	41	46	34	28	10	18	14	20
-2er7
version3	64	68	33	58	13	26	28	28
version4	64	67	55	40	45	25	25	22
-0mr0
version3	30	34	12	23	23	25	22	25
version4	30	36	9	23	8	17	13	31
-0Mr0
version3	27	32	14	24	13	24	13	24
version4	27	32	16	30	16	30	14	30
no-md5
version3	18	23	6	19	11	15	11	15
version4	18	23	14	15	8	16	8	16
TAK -p0
-md5		55		40		41
no MD5		47		32		28		n/a for TAK

Here it says "no-md5", that is -0r0 --no-md5-sum, instead of the ordinary -0r0 I ran above.
But anyway, here the high "Process" times are on the -8 settings.
Also included, for comparison: TAK at its fastest setting, -p0. MD5 summing is optional in TAK, and seems to remove some of the benefits from the multithreading, which for TAK is capped at 4 threads. Times here were recorded differently, with echo:|time .

Re: More multithreading

Reply #105 – 2023-07-24 09:18:21

Two remarks on apparently "slow" speeds: TAK and the Dell laptop.

TAK. I had expected it to run faster, but it boils down to how fast (single-threaded) flac has become. Bragging rights to @ktf here.
In ktf's comparison studies, nothing encodes as fast as TAK -p0 - also verified on a couple of Intel CPUs in addition to the main study. Here it didn't run any faster than flac -5. (Curiously too, on these eleven CDs - the *j*.wav part of my signature - it didn't even compress better. But that doesn't generalize ...)
So I casually ran 1.3.4 at -5. Process/global times 52 and 58 seconds, indicating that the new builds are 1/6th faster. And -0Mr0: 35 and 47. Ran again and got exactly the same.
So on this computer, TAK -p0 was tied to old flac -0Mr0. But the fixed-predictor speedups since 1.3.4 are quite formidable, so finally TAK -p0 is getting dethroned at plain speed ... at least on a modern CPU.

Then the Dell laptop in the middle table is surprisingly slow given that the CPU is supposed to be better at every parameter: https://www.cpubenchmark.net/compare/2917vs3793vs3830 . I see it is set up with a pagefile, but if I/O were a concern it should be much more visible at the -0 settings. RAM is 16 GB on all.
There must be some BIOS-controlled more aggressive throttling going on, to save user's lap from getting burned I guess. Whereas the fanless computer, which has a heatsink body around a NUC board, runs too hot to touch ... maybe that actually dissipates more heat than an awfully noisy laptop fan would do, but I am surprised over the impact. Maybe I should check if I can downclock it slightly.

Re: More multithreading

Reply #106 – 2023-07-24 19:50:08

Quote from: Replica9000 on 2023-07-23 17:53:31

I don't know how Windows calculates the time a process uses. On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:
Code: [Select]
real    0m43.619s
user    0m43.106s
sys     0m0.512s
user + sys = 0m43.618s. I don't really have anything else on my system using resources other than the browser.

Running with two threads to ramdisk:
Code: [Select]
real    0m23.948s
user    0m47.475s
sys     0m0.376s
(user + sys) / jobs = 0m23.925s

Running with 8 threads to ramdisk:
Code: [Select]
real    0m8.575s
user    1m7.709s
sys     0m0.568s
(user + sys) / jobs = 0m8.535s

Running with 8 threads to disk (zfs):
Code: [Select]
real    0m40.068s
user    1m14.153s
sys     0m2.573s
(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).

So after some testing, it seems instead of dividing user+system by the jobs run, I should have divided by the percentage of CPU actually used by the jobs run. When writing to ramdisk, there's no I/O bottleneck, so running FLAC with higher settings will get each thread to (nearly) 100%. When writing to disk, the process is waiting on I/O to catch up (might not happen so much with smaller files), so each thread might only be using 50% or 25%, etc. Also using lower presets won't cause each thread to run at 100% either. So it seems for a process that actively uses CPU for the duration of the task, the real (wall) time and user times will be the same (within a few milliseconds). Only if a process sits idle during it's task will the real time and user time differ. I always test on ramdisk and use the real time to show performance. Looks like that is still the best way without any extra math involved. Hope that makes sense, I'm awful at explaining things.

Re: More multithreading

Reply #107 – 2023-07-26 13:05:41

Quote from: Porcus on 2023-07-23 23:49:35

But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?

It was supposed to wait for work, but by mistake it did 'busy waiting'.

Anyway, attached is a new win64 binary. It should be much more efficient when the user asks for (way) too much threads. It lets threads properly waits when out of work, and also pauses threads for a long time when they have to wait often. That dramatically reduces the amount of overhead. Also, it raises the number of max threads to 64.

In my own tests, asking for 16 threads on a 4 core, 8 thread machine with preset -0 results in a 10% slower time than the sweet spot at 4 threads, whereas the previous binary could get **much** slower, sometimes even getting slower than single threaded.

This new version should not change much for slow presets like -8 with a sane number of threads, but makes a huge difference when selecting a number of threads that is way too high and with fast presets. I think it will also make quite a difference when run on a CPU that is already intermittently busy, because it scales up and down the number of active threads based on how well they run. This is difficult to measure however.

Re: More multithreading

Reply #108 – 2023-07-26 14:31:01

Questions before firing up the next FOR loops - in case there is anything that could be omitted / should be included:

* Is -M still at this stage limited to two threads? No matter what other settings? Anything else particular about -m vs -M vs --no-mid-side?
(Above I just didn't bother to make an exception for -M in the FOR loop, but -0 would anyway max out speed at low threads count.)

* Anything special about re-encoding? (Decoding is fast, but is it fast enough not to matter much for the housekeeping thread under any reasonable circumstances? Should that be tested?)

* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM? (AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum and compare to the one for the source - but in principle, flac -f --verify flacfile.flac doesn't need to compute MD5 from source if that is stored in the source file ... not saying it is worth it, if users ask for -8pel32 they might want to test source first rather than waiting eons just to be told that nah source was corrupted.)

* Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)

Re: More multithreading

Reply #109 – 2023-07-26 15:20:32

Own standard compille of v4 without limit vs own v5, again 12core/24thread 5900x, -8ep -V
v4 vs v5
j12 173x 173x
j16 183x 183x
j24 193x 194x

For this scenario it works well, thanks!

Re: More multithreading

Reply #110 – 2023-07-26 16:05:22

Quote from: Porcus on 2023-07-26 14:31:01

* Is -M still at this stage limited to two threads? No matter what other settings?

Yes and yes.

Quote

but -0 would anyway max out speed at low threads count.

It did max out at 3 threads with v4 in my tests, now it does at 4. But that CPU only has 4 cores anyway.

Quote

* Anything special about re-encoding?

Yes, decoding does hold up encoding on (very) fast presets.

Quote

* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM?

The first thread crosses the API boundary, and is for (1) the internals of the flac command line program, (2) the WAV reading or FLAC decoding, (3) verify decoding and (4) some internal copying and moving of data. If this thread is idle, it will start working on a frame. One of the other threads does MD5 calculation on the data that is to be encoded, and the others create frames.

I just found out that the flac command line program does NOT calculate and/or check MD5 of the original file on reencoding. It only calculates a new MD5. It also doesn't check whether the original MD5 and the new one are the same. Probably something that should be fixed at some point.

Quote

(AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum

No, it does not. It decodes and checks whether each and every decoded sample is the same as every input sample. It does not verify the stored MD5.

Quote

Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)

I think that is a feature of the getopt functions. If an 'abbreviation' is unique, it will accept that. So --no-md will also work. --no-m does not work because it is ambiguous, it could also mean --no-mid-side

Re: More multithreading

Reply #111 – 2023-07-26 20:48:14

flac git-5500690f 20230726

Code: [Select]

        -0      -1      -2     -3      -4      -5      -6      -7     -8 
 -j1    3.82s   3.99s   4.25s   4.37s   5.02s   6.02s   8.18s   10.27s 15.35s
 -j2    2.13s   3.16s   2.35s   2.37s   4.20s   3.28s   4.55s   5.59s  8.16s
 -j3    1.61s   3.19s   1.74s   1.77s   4.21s   2.41s   3.32s   4.09s  6.13s
 -j4    1.64s   3.18s   1.66s   1.60s   4.22s   2.00s   2.72s   3.32s  4.99s
 -j5    1.71s   3.18s   1.70s   1.64s   4.20s   1.83s   2.37s   2.85s  4.27s
 -j6    1.66s   3.18s   1.71s   1.66s   4.21s   1.82s   2.07s   2.55s  3.84s
 -j7    1.72s   3.21s   1.72s   1.64s   4.22s   1.83s   2.05s   2.31s  3.47s
 -j8    1.74s   3.17s   1.78s   1.64s   4.22s   1.84s   2.06s   2.21s  3.23s
 -j9    1.71s   3.31s   1.77s   1.64s   4.21s   1.87s   2.08s   2.22s  3.16s
 -j10   1.72s   3.17s   1.75s   1.62s   4.21s   1.85s   2.09s   2.24s  3.10s
 -j11   1.73s   3.18s   1.82s   1.69s   4.21s   1.87s   2.10s   2.33s  3.04s
 -j12   1.78s   3.16s   1.87s   1.63s   4.27s   1.93s   2.11s   2.24s  2.97s
 -j13   1.76s   3.21s   1.80s   1.69s   4.20s   1.89s   2.11s   2.29s  2.93s
 -j14   1.70s   3.17s   1.79s   1.66s   4.22s   1.88s   2.13s   2.32s  2.91s
 -j15   1.82s   3.18s   1.85s   1.67s   4.21s   1.92s   2.11s   2.30s  2.85s
 -j16   1.76s   3.20s   1.91s   1.65s   4.23s   1.89s   2.12s   2.27s  2.84s

Code: [Select]

        -0p     -1p     -2p     -3p     -4p     -5p     -6p     -7p    -8p
 -j1    3.82s   3.99s   4.24s   5.43s   6.37s   8.30s   16.39s  20.10s 44.02s
 -j2    2.16s   3.17s   2.34s   3.02s   5.57s   4.61s   9.19s   11.06s 24.25s
 -j3    1.61s   3.17s   1.74s   2.17s   5.57s   3.40s   6.76s   8.20s  18.19s
 -j4    1.64s   3.21s   1.65s   1.79s   5.60s   2.77s   5.48s   6.69s  14.85s
 -j5    1.65s   3.19s   1.68s   1.76s   5.59s   2.39s   4.75s   5.76s  12.82s
 -j6    1.63s   3.18s   1.69s   1.75s   5.57s   2.13s   4.21s   5.12s  11.44s
 -j7    1.65s   3.16s   1.73s   1.78s   5.58s   2.09s   3.86s   4.68s  10.38s
 -j8    1.69s   3.17s   1.78s   1.80s   5.58s   2.11s   3.54s   4.33s  9.59s
 -j9    1.75s   3.19s   1.78s   1.78s   5.60s   2.15s   3.55s   4.27s  9.58s
 -j10   1.78s   3.16s   1.79s   1.75s   5.58s   2.12s   3.48s   4.24s  9.72s
 -j11   1.77s   3.18s   1.77s   1.73s   5.57s   2.17s   3.44s   4.17s  9.51s
 -j12   1.75s   3.17s   1.84s   1.79s   5.60s   2.18s   3.39s   4.17s  9.45s
 -j13   1.72s   3.19s   1.87s   1.84s   5.57s   2.16s   3.35s   4.06s  9.22s
 -j14   1.78s   3.17s   1.87s   1.79s   5.57s   2.15s   3.32s   3.99s  9.16s
 -j15   1.76s   3.18s   1.82s   1.82s   5.59s   2.19s   3.28s   3.95s  9.09s
 -j16   1.76s   3.25s   1.81s   1.82s   5.59s   2.22s   3.30s   3.93s  9.03s

Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.

Re: More multithreading

Reply #112 – 2023-07-26 20:57:15

Quote from: Replica9000 on 2023-07-26 20:48:14

Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.

-M --> it doesn't utilize more. Reply 110 above and last sentence original post.

Re: More multithreading

Reply #113 – 2023-07-26 21:01:38

Quote from: Porcus on 2023-07-26 20:57:15

Quote from: Replica9000 on 2023-07-26 20:48:14
Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.
-M --> it doesn't utilize more. Reply 110 above and last sentence original post.

I read that and didn't think to check which presets were using it. Doh! Seems my brain is on vacation this week.

Re: More multithreading

Reply #114 – 2023-07-29 13:34:18

@Wombat made a couple of builds from the same source as the above version 5, and here follow some measurements against the one with "v3" flags, requiring AVX2 but not AVX512 (did I get that right)? This on a HP Deskpro which cannot run the AVX512 thing.

Compiles compare kinda how they should .... ? At least, no nasty surprises and no miracles, just a mild improvement from the instruction set of 3 to 5 percent on most settings - although yes some exceptions to either direction, and far down-right in the table there are a few positive numbers where the Wombat v3 build takes slightly more time.

That explains the (Wv3) line in the table: time difference in percent (negative means faster), against ktf's latest build, which appears in the "main" line.
That line first has compression time in seconds. Then I thought, why not represent the others as penalty relative to the benchmark where speed is proportional to number of cores. Say, if times are not 40/20/10 but 40/21/12, the penalties of 1 and 2 seconds show up as 5% (of the 20) and 20% (of the 10).

Although those %s might be misleading when numbers become small (I mean, it is the seconds that make us impatient!), that is anyway where the -j4 doesn't unleash much. E.g. -0r1 -j2 is was done in 27 seconds and -j4 saves less than four more.
Also, I manually deleted the two -1j4 ones because -M caps the -j at 2 anyway. That in turn is because the multitreading is not (yet) optimized for -M, which is also pretty clear from the overhead on the two -1j2's.

... why this choice of settings? Because it seems the "-r" makes a difference between Clang and GCC compiles (#349 explains a mistake), so why not try a very fine partitioning and a very coarse. Not so much for the number of seconds, more to verify that it doesn't behave unexpectedly stupid under -r variations.

	j1	j2 ovrh	j4 ovrh
8pl32	3552	3 %	7 %
(Wv3)	-5 %	-5 %	-5 %
8per7	5317	3 %	8 %
(Wv3)	-5 %	-5 %	-5 %
8pr7	646	3 %	7 %
(Wv3)	-7 %	-5 %	-5 %
8er7	688	4 %	10 %
(Wv3)	-5 %	-5 %	-5 %
8r7	176	5 %	11 %
(Wv3)	-5 %	-4 %	-4 %
8r1	153	4 %	11 %
(Wv3)	-5 %	-3 %	-3 %
5r7	71	7 %	22 %
(Wv3)	-3 %	-2 %	-2 %
5r1	66	8 %	25 %
(Wv3)	-3 %	-2 %	-4 %
2er7	105	7 %	16 %
(Wv3)	-4 %	-4 %	-4 %
2er1	61	11 %	52 %
(Wv3)	-7 %	-6 %	1 %
2r7	61	11 %	48 %
(Wv3)	-3 %	-3 %	2 %
2r1	52	12 %	74 %
(Wv3)	-4 %	-3 %	-3 %
1r7	53	58 %
(Wv3)	-3 %	-5 %	-5 %
1r1	48	51 %
(Wv3)	-3 %	-6 %	-7 %
0r7	51	13 %	76 %
(Wv3)	-3 %	-2 %	1 %
0r1	47	15 %	98 %
(Wv3)	-3 %	-3 %	2 %

Times are median of 3. No CPU cooldown, rather the opposite: three -j1 runs discarded, then -j1, -j2, -j4 and Wombat build -j1, -j2, -j4. If anything, that would mean that Wombat -j1 got different conditions, not the Wombat -j4.

Re: More multithreading

Reply #115 – 2023-07-29 18:51:38

To whom it may concern...
Just for the fun of it, I ran my -7 test on an up-to-date computer, which i happen to have for a couple of days to set it up.
It has a Raptor Lake Intel Core i9 13900F, 5.6 GHz max., 8/16 performance/efficient cores, 32 threads.

Code: [Select]

-j1:	Average time =  15.172 seconds (3 rounds), Encoding speed = 712.63x
-j2:	Average time =   8.047 seconds (3 rounds), Encoding speed = 1343.55x
-j3:	Average time =   5.704 seconds (3 rounds), Encoding speed = 1895.62x
-j4:	Average time =   4.434 seconds (3 rounds), Encoding speed = 2438.43x
-j5:	Average time =   3.658 seconds (3 rounds), Encoding speed = 2955.71x
-j6:	Average time =   3.182 seconds (3 rounds), Encoding speed = 3397.51x
-j7:	Average time =   2.885 seconds (3 rounds), Encoding speed = 3747.23x
-j8:	Average time =   2.808 seconds (3 rounds), Encoding speed = 3850.43x
-j10:	Average time =   2.807 seconds (3 rounds), Encoding speed = 3851.80x
-j12:	Average time =   2.841 seconds (3 rounds), Encoding speed = 3806.15x
-j14:	Average time =   2.868 seconds (3 rounds), Encoding speed = 3769.87x
-j16:	Average time =   2.935 seconds (3 rounds), Encoding speed = 3683.40x

Test were done with ktf's v5 binary (flac git-5500690f 20230726)

Re: More multithreading

Reply #116 – 2023-07-29 22:03:51

And since the performance plateau was reached @ -j7 in the test above, I tried with some heavier loads, too:
-8:

Code: [Select]

-j1:    Average time =  23.259 seconds (3 rounds), Encoding speed = 464.85x
-j2:    Average time =  12.256 seconds (3 rounds), Encoding speed = 882.18x
-j3:    Average time =   8.570 seconds (3 rounds), Encoding speed = 1261.61x
-j4:    Average time =   6.566 seconds (3 rounds), Encoding speed = 1646.66x
-j5:    Average time =   5.374 seconds (3 rounds), Encoding speed = 2011.78x
-j6:    Average time =   4.679 seconds (3 rounds), Encoding speed = 2310.59x
-j7:    Average time =   4.207 seconds (3 rounds), Encoding speed = 2569.80x
-j8:    Average time =   3.908 seconds (3 rounds), Encoding speed = 2766.87x
-j10:   Average time =   3.732 seconds (3 rounds), Encoding speed = 2896.85x
-j12:   Average time =   3.672 seconds (3 rounds), Encoding speed = 2944.44x
-j14:   Average time =   3.657 seconds (3 rounds), Encoding speed = 2956.79x
-j16:   Average time =   3.706 seconds (3 rounds), Encoding speed = 2917.17x

Performance peak here is -j10 .. -j12, so even when the CPU ran out of performance cores there is some benefit.

-8p:

Code: [Select]

-j1:    Average time =  74.252 seconds (3 rounds), Encoding speed = 145.61x
-j2:    Average time =  37.895 seconds (3 rounds), Encoding speed = 285.32x
-j3:    Average time =  26.564 seconds (3 rounds), Encoding speed = 407.02x
-j4:    Average time =  20.738 seconds (3 rounds), Encoding speed = 521.36x
-j5:    Average time =  17.305 seconds (3 rounds), Encoding speed = 624.79x
-j6:    Average time =  15.060 seconds (3 rounds), Encoding speed = 717.94x
-j7:    Average time =  13.406 seconds (3 rounds), Encoding speed = 806.52x
-j8:    Average time =  12.321 seconds (3 rounds), Encoding speed = 877.50x
-j10:   Average time =  12.201 seconds (3 rounds), Encoding speed = 886.13x
-j12:   Average time =  11.442 seconds (3 rounds), Encoding speed = 944.94x
-j14:   Average time =  10.573 seconds (3 rounds), Encoding speed = 1022.64x
-j16:   Average time =   9.777 seconds (3 rounds), Encoding speed = 1105.86x
-j18:   Average time =   9.352 seconds (3 rounds), Encoding speed = 1156.12x
-j20:   Average time =   8.942 seconds (3 rounds), Encoding speed = 1209.17x
-j22:   Average time =   8.547 seconds (3 rounds), Encoding speed = 1264.96x
-j24:   Average time =   8.219 seconds (3 rounds), Encoding speed = 1315.49x
-j26:   Average time =   7.949 seconds (3 rounds), Encoding speed = 1360.17x
-j28:   Average time =   7.850 seconds (3 rounds), Encoding speed = 1377.32x
-j30:   Average time =   7.836 seconds (3 rounds), Encoding speed = 1379.73x
-j32:   Average time =   7.791 seconds (3 rounds), Encoding speed = 1387.76x
-j34:   Average time =   7.746 seconds (3 rounds), Encoding speed = 1395.82x
-j38:   Average time =   7.819 seconds (3 rounds), Encoding speed = 1382.73x
-j42:   Average time =   7.928 seconds (3 rounds), Encoding speed = 1363.72x
-j46:   Average time =   7.963 seconds (3 rounds), Encoding speed = 1357.78x

Performance peak at -j32..-j34 here.

Re: More multithreading

Reply #117 – 2023-07-30 07:26:53

flac-multithreading-v5-win

Code: [Select]

timer64.exe v5 -j1 -8p -f in.wav
Global Time  =    55.150

timer64.exe v5 -j2 -8p -f in.wav
Global Time  =    30.851

timer64.exe v5 -j3 -8p -f in.wav
Global Time  =    25.706

timer64.exe v5 -j4 -8p -f in.wav
Global Time  =    19.132

timer64.exe v5 -j5 -8p -f in.wav
Global Time  =    16.910

timer64.exe v5 -j6 -8p -f in.wav
Global Time  =    13.622

timer64.exe v5 -j7 -8p -f in.wav
Global Time  =    12.661

timer64.exe v5 -j8 -8p -f in.wav
Global Time  =    10.662

timer64.exe v5 -j9 -8p -f in.wav
Global Time  =    10.145

timer64.exe v5 -j10 -8p -f in.wav
Global Time  =     8.773 

timer64.exe v5 -j11 -8p -f in.wav
Global Time  =     8.469 

timer64.exe v5 -j12 -8p -f in.wav
Global Time  =     7.719 

timer64.exe v5 -j13 -8p -f in.wav
Global Time  =     7.503 

timer64.exe v5 -j14 -8p -f in.wav
Global Time  =     6.735

timer64.exe v5 -j15 -8p -f in.wav
Global Time  =     6.678 

timer64.exe v5 -j16 -8p -f in.wav
Global Time  =     6.413]

Re: More multithreading

Reply #118 – 2023-07-30 18:21:41

Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads. That local optimum doesn't show in e.g. Replica9000's statistics for preset 7/8, if I'm not mistaken. Anyway, good multithreading performance at and below 8 threads with v5!

Chris

Re: More multithreading

Reply #119 – 2023-07-30 21:07:33

Quote from: sundance on 2023-07-29 22:03:51

Performance peak at -j32..-j34 here.

Thanks! I think I can conclude from that that the 'leapfrogging' works correctly. One would assume a P core can do more work in the same amount of time than an E core, so if they were bound to the same number of frames before having to wait (which is the case with v1 and v3) that should have been visible from the results.

What is a bit vague though is how the scheduler works. Does it first saturate the P cores, than the E cores, than the 'hyperthreading system of the P cores'? Or does it first to the P cores, than hyperthreading, then E cores? Anyway, seeing no regressions I'd say it works pretty well, even if it isn't very efficient anymore at those high thread counts.

Quote from: C.R.Helmrich on 2023-07-30 18:21:41

Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads.

Thanks! I think that local minimum is because at that point the first and second thread don't have to switch context too often. Those first two threads are less 'specialized' than the other threads and this brings some inefficiency.

Quote from: C.R.Helmrich on 2023-07-30 18:21:41

Anyway, good multithreading performance at and below 8 threads with v5!

I agree! I think I did pretty well, not having done multithreaded programming before.

Of course, thank you all for benchmarking. This has helped tremendously!

Re: More multithreading

Reply #120 – 2023-07-31 00:05:20

If anyone cares about my table overloads, then first there is this a mistake of mine: The i5-7500T in the top table here, is 4 cores 4 threads. So for that, consider the -j8 a "sanity check".

I let some builds loose on the i5-1135G7 equipped fanless desktop again. (A table comparing with Wombat's builds posted here.)
Lower figures better. Negative numbers better for the new build. Only one run per setting per build, so take times with a grain of salt.

I have two different things into the table here as well.
* The "timediff" rows are the running time of the version 5 vs the version 4: negative numbers are speedups, positive are slowdowns.
* The other rows, then the "ovrhd/diff" colums show "overhead penalty": the idealized time would be "j1 time / # of threads", and I compare actual time taken and quote the percent extra. Say, the 53% in the top j5 cell: Idealized time would be 25-ish seconds, and it took 53 percent more, i.e. 38-ish. The worst number, "452%" means it took 5.5 times the ideal,
The "j9" column is adjusted to match the j1 time / 8, since there are only 8 threads on this CPU. -j9 was ran a sanity check.
Percent penalty is quite useless on the -2r0 settings, where j4 was fastest in wall time.

Also the percent penalty is quite useless in the rightmost column. The two colums to the end are ran with -M, where -j is capped at 2 because -M isn't good for multithreading. No shit, Sherlock: -2Mer7 -j2 took more time than -2Mer7.

-8:	j1 time/diff	j2 ovrhd/diff	j3 ovrhd/diff	j4 ovrhd/diff	j5 ovrhd/diff	j8 ovrhd/diff	j9 ovrhd/diff	-Mj1 time/diff	j2 ovrhd/diff
v4	124	19%	20%	35%	53%	131%	120%	96	67%
v5	121	8%	23%	40%	55%	124%	138%	83	97%
time v5 vs v4	−2%	−12%	0%	1%	−1%	−6%	6%	−14%	2%
-5:
v4	48	23%	49%	86%	104%	271%	287%	42	84%
v5	49	24%	56%	80%	105%	239%	246%	40	88%
time v5 vs v4	1%	1%	6%	−2%	1%	−8%	−10%	−4%	−1%
-3:
v4	36	27%	57%	112%	166%	366%	386%	39	79%
v5	37	26%	68%	100%	148%	312%	307%	38	92%
time v5 vs v4	1%	1%	9%	−4%	−6%	−10%	−15%	−2%	5%
-2er7:
v4	69	19%	34%	60%	76%	175%	238%	51	100%
v5	70	16%	38%	52%	75%	180%	212%	50	102%
time v5 vs v4	1%	−1%	5%	−4%	1%	3%	−6%	−1%	0%
-2r0:
v4	54	21%	25%	93%	127%	452%	393%	36	80%
v5	44	24%	57%	85%	153%	377%	320%	38	91%
time v5 vs v4	−18%	−16%	3%	−22%	−9%	−29%	−30%	6%	13%

Re: More multithreading

Reply #121 – 2023-07-31 08:10:27

Quote from: Porcus on 2023-07-31 00:05:20

The i5-7500T in the top table here

I am confused, which table do you mean?

Re: More multithreading

Reply #122 – 2023-07-31 11:02:21

Quote from: ktf on 2023-07-31 08:10:27

which table

Forgot to linkify it. Here, reply 104.

Top table: -j8 takes about the same time as -j4. To be expected, there are only four threads.
Next tables: -j8 takes shorter time than -j4 for the -8 based presets, but not -5 nor fixed-predictor and that is on 4core8threads CPUs.

For version 5 "-j8" performance on the i5-1135G7 (Intel data here), these are time in seconds from the same data as the table in the previous reply 120.
Only -j1, -j4, -j8

Code: [Select]

seconds -j1 -j4 -j8
-8:     121  42  34
-5:      49  22  21
-3:      37  18  19
-2er7:   70  27  24
-2r0:    44  20  26

Although timings are to be taken with a grain of salt, I confirmed the relations between the -j's with two Wombat builds too, so I'm believing that -8 benefits from going -j4 to -j8, the others only marginally so or get slower; reservation: this on an already-hot computer, passively cooled. That could potentially be a small benefit to -j1 more than -j4 more than -j8, since immediately before -j1 resp -j4 resp -j8 there was a -Mj2 resp -j3 resp -j5 (the latter heavier). But there was hardly a particular benefit for the top-left element, as I first ran v4, and so the CPU had been running -8jx then -8Mjx encoding for fifteen minutes process time: it would have been hot.

Larger benefits at heavier jobs might be expected from a well-cooled computer, but not on a passively cooled and runs hot to touch: then I would rather expect that trying to increase the workload by employing more threads, would cause throttling and diminish the speedup and even more so on heavier jobs when a thread isn't idling so much.

To point out how even this fanless computer - when running hot for a long time - still utilize multi-threading:
Here are results from multithreading -8pel32. A two-day job on the now-obsolete version 4.

Code: [Select]

versionv4-8epl32-j1 

Commit   =     14224 KB  =     14 MB
Work Set =     15328 KB  =     15 MB

Kernel Time  =    19.031 =    0%
User Time    = 60512.734 =   98%
Process Time = 60531.765 =   98%
Global Time  = 61573.994 =  100%
  
versionv4-8epl32-j2 

Commit   =     16772 KB  =     17 MB
Work Set =     19516 KB  =     20 MB

Kernel Time  =    16.609 =    0%
User Time    = 93122.156 =  196%
Process Time = 93138.765 =  196%
Global Time  = 47361.310 =  100%
  
versionv4-8epl32-j4 

Commit   =     18588 KB  =     19 MB
Work Set =     20804 KB  =     21 MB

Kernel Time  =    25.937 =    0%
User Time    = 98499.734 =  385%
Process Time = 98525.671 =  385%
Global Time  = 25562.915 =  100%
  
versionv4-8epl32-j8 

Commit   =     22164 KB  =     22 MB
Work Set =     23360 KB  =     23 MB

Kernel Time  =   172.843 =    0%
User Time    = 138908.546 =  771%
Process Time = 139081.390 =  772%
Global Time  = 17994.765 =  100%

Sure process time was twice as large on -j8 than on -j1, but this was version 4.

So I'd say this is better than expected: it was run on a computer where multithreading benefits would be expected smaller and still it speeds up quite a lot.

Re: More multithreading

Reply #123 – 2023-07-31 13:13:11

Yes, looks good. For now, there seems no obvious way to improve efficiency or scaling further, so I think it is time to write some documentation and get this merged.

Re: More multithreading

Reply #124 – 2023-07-31 16:27:59

Quote from: ktf on 2023-07-31 13:13:11

For now, there seems no obvious way to improve efficiency or scaling further

Except -M? Which probably will not be prioritized, maybe carry a "WARNING: -M limits multithreading to -j2"?
Anyway, on this computer it didn't seem that -Mj2 would even improve over -Mj1, but to get more data now I have tried a big number of -1fj1 vs -1fj2 runs on this and on two Dell laptops, and it looks slightly more optimistic. (Global) time saved by going -j2 was measured to 4% on this computer over twenty-something runs unattended, and then 10% and 20% those two laptops - and although I didn't fire up more runs ton the desktop in the above table, it could even be slightly more. At least it got the right sign, although you would have had to expect Github issues with "multithreading doesn't work!" had you implemented a previous suggestion of putting a -M in the -0 preset

(Only Intel CPUs tested here, I should add.)

Quote from: ktf on 2023-07-31 13:13:11

so I think it is time to write some documentation and get this merged.

In the course of that, there is a decision coming up - or one may postpone it: "-j0" should signify what? Suggestion:
Implement (or at least, make no decision that makes a future case against it) -j0 for allowing multithreading, let encoder decide. It could for now invoke -j1 but, thinking aloud and proposing something that actually multithreads but not too aggressive:
-j0 invokes -j2, except if there is -M, then it single-threads.
My loose idea behind that was to flag to users that -j0 is not supposed to be synonymous to any of the others - so stop whining when you find out it is neither -j1 nor -j2 - nor when it changes! Users cannot expect it to stay constant when it is tuned to be something smarter than a fixed number, so make it "smarter than -j2" from day one.

I take it that for the sake of applications that pass one file to one thread (like fb2k), the default will be -j1.

Notice