Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: More multithreading (Read 34041 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #100
Yep, somethings funky with the scaling already, with -j1 it's fine.
What do you mean? That -1 and -4 are different? This has been mentioned in the thread start, reply #2, #18, #19, #30 and #68. Otherwise, I don't know what you mean.

Suggestion for that case with four concurrent files:
[...]
Very messy. I would get lost in that.

Quote
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.
Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup. With preset -0 I still get a 1.4x speedup with 4 threads. So the overhead of setting up and destroying threads isn't too much.

Quote
Possibly you could consider the following line of arguments - subject to being anywhere remotely close to the fact, I am quiiiite ignorant here:
  • If it is just one single file, it will be done in one second anyway, you can get it down to half a second but who cares if you cannot get it down to a third of a second - end-users get impatient over seconds to wait, not over percentages;
We're not talking about seconds here, but milliseconds with current CPUs. Seriously, encoding 1600 files of such 1-second files takes 10 seconds in total when single threading with preset -8, and 5 seconds with -j4.

Also, a program should act somewhat predictable to an end user. If the command line tool uses a different ordering from what the user supplies as input to improve throughput, that is going to be confusing.

Quote
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?
The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.

Quote
Quote
Reason to ask this first is this question about what we should measure - and what utilities to use and read off the numbers.
Most importantly wall time.
Obviously for the end result. But for testing, you don't get much useful extra information from including anything else?
I don't know any.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #101
Quote
Say, we have taken note that it is hard to make good use of multi-threading a short file to be encoded with low preset.
Have you tested with short files. The impact doesn't seem to be too severe. If I take CDDA input files of 1 second (so 44100 samples) I'm still seeing net gains, not losses, when multithreading. For example, using -8 -j4 on my 4 core machine gives a 1.9x speedup.
I was thinking about the -0 end and not the -8 end ... but anyway:
Testing a compilation album with short songs, not atypical for the genre: https://nocleansinging.bandcamp.com/album/hold-fast-grindviolence-compilation - free download for anyone to replicate the experiment on their computers
30:26 long, 20 tracks. 23:49 is CDDA, 997 kbit/s at -5 (yes noisy), 4:07 is 44.1/24 at 1698, and 2:31 is 96/24 at 3409.
for %j IN (1,2,4,8,16) DO (timeout /t 8 & \bin\timer64.exe flac-multithreading-v4.exe -ss -j%j -f <setting> *.flac )
8 seconds is maybe not much cooldown time, but quite a lot compared to the busy times. And -j16 is supposed to be useless on a 4core8thread i5-1135G7 (throwing it in just to verify it it doesn't make a mess out of anything):

--lax -0b16384 --no-md5-sum where -j4 takes more time than -j2 even if occupying more threads. Times -j1 -j2 -j4 -j8 -j16 are:
     2.955    2.166    2.193    2.316    2.337   
-0b4096  where again -j4 takes more time than -j2
     3.414    2.144    2.311    2.310    2.444   
-2e and at this stage I wonder if I should have run -j3 and -j5 and the whole thing
     4.118    2.640    2.895    2.990    3.091   
-5 and finally -j4 catches -j2, but -j8 doesn't improve over -j4
     4.554    3.045    2.859    2.953    3.068   
-8 and here -j4 does save considerable time.
     9.578    5.624    3.895    3.844    3.826   

So up to -5-ish, running -j4 / -j8 (/-j16) just means that it fires 4 / all 8 (/ditto) threads to do the same work as -j2 does.
Do I interpret it correctly as follows? That means that you fire up 2 or 6 extra threads only to do the extra work from the overhead? That is a waste. If I want to put my CPU to work for two seconds, and can get it done in four threadseconds, then spending sixteen threadseconds probably makes for several times as much heat - which would translate to a huge increase in duration if I were to run this for a weeklong job where the CPU would be pretty much throttled over the heat?


Quote
Maybe this could eliminate the work of trying to improve scenarios where the impact won't matter to the users?
The problem is that I cannot determine for all systems that flac can run on, what scenarios stuff has impact and which do not.
And that just makes my argument even better (for you): If multi-threading multi-files means that users are not going to invoke <particular single-file scenario> so often, you don't need to worry so much about it, as you would have if you just presume that all multi-threading is run on single files.

Re: More multithreading

Reply #102
I think the main drawback of wall time is it includes everything, antivirus updating in background, Windows telemetry, and something like that.

Honestly, at this moment I hope the main focus is still single file multithreading, with a secret (now I mentioned so no longer secret) wish of variable blocksize development that may utilize some threads, which is much more rewarding than the pathetic -pe combination.

Re: More multithreading

Reply #103
I don't know how Windows calculates the time a process uses.  On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:
Code: [Select]
real    0m43.619s
user    0m43.106s
sys     0m0.512s
user + sys = 0m43.618s.  I don't really have anything else on my system using resources other than the browser.


Running with two threads to ramdisk:
Code: [Select]
real    0m23.948s
user    0m47.475s
sys     0m0.376s
(user + sys) / jobs = 0m23.925s


Running with 8 threads to ramdisk:
Code: [Select]
real    0m8.575s
user    1m7.709s
sys     0m0.568s
(user + sys) / jobs = 0m8.535s


Running with 8 threads to disk (zfs):
Code: [Select]
real    0m40.068s
user    1m14.153s
sys     0m2.573s
(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).

Re: More multithreading

Reply #104
Warning for information overload here.

I ran a variety of settings through version 3 and version 4 (note: only -j1, -j2, -j4 and -j8). Every figure is after a 120 seconds pause for cooldown. I suspect that wasn't always enough.
Times were recorded with the timer64 utility. I don't know what process time is worth, but those figures are surprising: there are big differences from version 3 to version 4, where the latter frequently measures up much higher; in two computers with fans, that happens at -5 settings and -2 settings, but for my fanless (hence throttling) home desktop it happens at the heavier -8xx settings.
But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?

Ran on three computers, all with Intels 4cores8threads CPUs.
Common observation: for the -0 settings, one can stick to -j2.

Results from a HP Prodesk with i5-7500T (same as here). In version 4, -j8 slows down Global time compared to -j4 (and sometimes made -j8 slower than version 3).
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 638639 667633 683224 688189
version4 617618 648325 665168 690176
-8er7
version3 692692 718683 743243 747207
version4 670671 694348 728184 750191
-8r7
version3 176177 190161 19157 19151
version4 172172 18091 18648 21356
-8r0
version3 156157 171144 17763 16946
version4 156157 16182 16643 19852
-5q14
version3 6767 7344 7530 7531
version4 6667 7036 7321 10830
-5q6
version3 6768 7444 7530 7531
version4 6768 7036 7321 10230
-2er7
version3 105106 12283 12034 12539
version4 101102 10654 11130 12133
-0mr0
version3 5152 5734 6535 6535
version4 5152 5529 8128 11742
-0Mr0
version3 4748 5533 5533 5634
version4 4848 5936 6036 6036
-0r0
version3 4647 5433 6036 5936
version4 4646 5026 7828 12044
You notice that there are some -j8 settings where version 4 boosts "Process time" quite a lot: The "-5" settings and the "-0" settings, except the "-0Mr0" (the "soft" mid/side).


Same test ran on a Dell business laptop, i7-1185G7. Here -j8 is a good thing for the -8-based; but, compare to version 3 at the -8j8 settings.
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 581592 713683 799300 964153
version4 688696 849435 860229 1008150
-8er7
version3 716723 751719 852319 1018162
version4 692698 902458 903237 1197173
-8r7
version3 170178 203185 25897 23150
version4 198210 218126 20971 23947
-8r0
version3 154174 167138 19981 21445
version4 145161 194108 19866 22449
-5q14
version3 6277 7160 7335 7342
version4 6177 7753 7935 9831
-5q6
version3 5865 7262 7441 7538
version4 5972 7855 7834 10232
-2er7
version3 95108 118101 13457 13045
version4 94106 12275 12846 17244
-0mr0
version3 4460 5744 5845 5844
version4 4560 5438 6237 11344
-0Mr0
version3 4050 5247 5343 5138
version4 4057 5454 5452 5453
-0r0
version3 3955 5036 5039 5143
version4 4563 4943 6437 12158
Process time numbers jump on the same spots in the table, but also -2er7.
The top-left result (-8pr7 -j1 on version 3) was the first that was run, and if 2 minutes cooldown was too little (which I suspect), it might be reading too low due to starting from longer cooldown when I fiddled a little back and forth.

Now on my usual fanless desktop which throttles at will and produces unreliable numbers (CPU: i5-1135G7), the bottom of the table deviates slightly:
-8pr7j1 processj1 globalj2 processj2 globalj4 processj4 globalj8 processj8 global
version3 449451 442452 378194 603143
version4 441450 475248 546160 978135
-8er7
version3 475485 461470 398200 698156
version4 473481 509260 645178 1037137
-8r7
version3 123123 105110 6342 8931
version4 123123 11763 8437 16231
-8r0
version3 107107 9397 4837 8128
version4 107107 10659 6533 13628
-5q14
version3 4147 2534 2324 2424
version4 4151 3428 1118 1320
-5q6
version3 4246 2434 2423 2323
version4 4146 3428 1018 1420
-2er7
version3 6468 3358 1326 2828
version4 6467 5540 4525 2522
-0mr0
version3 3034 1223 2325 2225
version4 3036 923 817 1331
-0Mr0
version3 2732 1424 1324 1324
version4 2732 1630 1630 1430
no-md5
version3 1823 619 1115 1115
version4 1823 1415 816 816
TAK -p0
-md5 55 40 41
no MD5 47 32 28 n/a for TAK
Here it says "no-md5", that is -0r0 --no-md5-sum, instead of the ordinary -0r0 I ran above.
But anyway, here the high "Process" times are on the -8 settings.
Also included, for comparison: TAK at its fastest setting, -p0. MD5 summing is optional in TAK, and seems to remove some of the benefits from the multithreading, which for TAK is capped at 4 threads. Times here were recorded differently, with echo:|time .

Re: More multithreading

Reply #105
Two remarks on apparently "slow" speeds: TAK and the Dell laptop.

TAK. I had expected it to run faster, but it boils down to how fast (single-threaded) flac has become. Bragging rights to @ktf here.
In ktf's comparison studies, nothing encodes as fast as TAK -p0 - also verified on a couple of Intel CPUs in addition to the main study. Here it didn't run any faster than flac -5. (Curiously too, on these eleven CDs - the *j*.wav part of my signature - it didn't even compress better. But that doesn't generalize ...)
So I casually ran 1.3.4 at -5. Process/global times 52 and 58 seconds, indicating that the new builds are 1/6th faster. And -0Mr0: 35 and 47. Ran again and got exactly the same.
So on this computer, TAK -p0 was tied to old flac -0Mr0. But the fixed-predictor speedups since 1.3.4 are quite formidable, so finally TAK -p0 is getting dethroned at plain speed ... at least on a modern CPU.

Then the Dell laptop in the middle table is surprisingly slow given that the CPU is supposed to be better at every parameter: https://www.cpubenchmark.net/compare/2917vs3793vs3830 . I see it is set up with a pagefile, but if I/O were a concern it should be much more visible at the -0 settings. RAM is 16 GB on all.
There must be some BIOS-controlled more aggressive throttling going on, to save user's lap from getting burned I guess. Whereas the fanless computer, which has a heatsink body around a NUC board, runs too hot to touch ... maybe that actually dissipates more heat than an awfully noisy laptop fan would do, but I am surprised over the impact. Maybe I should check if I can downclock it slightly.

Re: More multithreading

Reply #106
I don't know how Windows calculates the time a process uses.  On Linux, the time command gives 3 results, real, user and sys. Real is the wall time, user is how much time the process itself takes outside of the kernel, and sys is how much time the process itself takes within in kernel.

Running FLAC with one thread to a ramdisk (tmpfs) on my input gives me this:
Code: [Select]
real    0m43.619s
user    0m43.106s
sys     0m0.512s
user + sys = 0m43.618s.  I don't really have anything else on my system using resources other than the browser.


Running with two threads to ramdisk:
Code: [Select]
real    0m23.948s
user    0m47.475s
sys     0m0.376s
(user + sys) / jobs = 0m23.925s


Running with 8 threads to ramdisk:
Code: [Select]
real    0m8.575s
user    1m7.709s
sys     0m0.568s
(user + sys) / jobs = 0m8.535s


Running with 8 threads to disk (zfs):
Code: [Select]
real    0m40.068s
user    1m14.153s
sys     0m2.573s
(user + sys) / jobs = 0m9.590s
So in this case, FLAC only needed 9.59s to do it's thing, but writing to disk slowed down the process by an additional 30s (I'm running ZFS on a single disk and random I/O suffers).

So after some testing, it seems instead of dividing user+system by the jobs run, I should have divided by the percentage of CPU actually used by the jobs run.  When writing to ramdisk, there's no I/O bottleneck, so running FLAC with higher settings will get each thread to (nearly) 100%.  When writing to disk, the process is waiting on I/O to catch up (might not happen so much with smaller files), so each thread might only be using 50% or 25%, etc.  Also using lower presets won't cause each thread to run at 100% either.  So it seems for a process that actively uses CPU for the duration of the task, the real (wall) time and user times will be the same (within a few milliseconds).  Only if a process sits idle during it's task will the real time and user time differ.  I always test on ramdisk and use the real time to show performance.  Looks like that is still the best way without any extra math involved.  Hope that makes sense, I'm awful at explaining things.

Re: More multithreading

Reply #107
But when process time gets so high, is that because it wastes processing power on overhead, or is it something else?
It was supposed to wait for work, but by mistake it did 'busy waiting'.

Anyway, attached is a new win64 binary. It should be much more efficient when the user asks for (way) too much threads. It lets threads properly waits when out of work, and also pauses threads for a long time when they have to wait often. That dramatically reduces the amount of overhead. Also, it raises the number of max threads to 64.

In my own tests, asking for 16 threads on a 4 core, 8 thread machine with preset -0 results in a 10% slower time than the sweet spot at 4 threads, whereas the previous binary could get **much** slower, sometimes even getting slower than single threaded.

This new version should not change much for slow presets like -8 with a sane number of threads, but makes a huge difference when selecting a number of threads that is way too high and with fast presets. I think it will also make quite a difference when run on a CPU that is already intermittently busy, because it scales up and down the number of active threads based on how well they run. This is difficult to measure however.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #108
Questions before firing up the next FOR loops - in case there is anything that could be omitted / should be included:

* Is -M still at this stage limited to two threads? No matter what other settings? Anything else particular about -m vs -M vs --no-mid-side?
(Above I just didn't bother to make an exception for -M in the FOR loop, but -0 would anyway max out speed at low threads count.)

* Anything special about re-encoding? (Decoding is fast, but is it fast enough not to matter much for the housekeeping thread under any reasonable circumstances? Should that be tested?)

* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM? (AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum and compare to the one for the source - but in principle, flac -f --verify flacfile.flac doesn't need to compute MD5 from source if that is stored in the source file ... not saying it is worth it, if users ask for -8pel32 they might want to test source first rather than waiting eons just to be told that nah source was corrupted.)

* Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)

Re: More multithreading

Reply #109
Own standard compille of v4 without limit vs own v5, again 12core/24thread 5900x, -8ep -V
v4  vs v5
j12 173x 173x
j16 183x 183x
j24 193x 194x

For this scenario it works well, thanks!
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #110
* Is -M still at this stage limited to two threads? No matter what other settings?
Yes and yes.

Quote
but -0 would anyway max out speed at low threads count.
It did max out at 3 threads with v4 in my tests, now it does at 4. But that CPU only has 4 cores anyway.

Quote
* Anything special about re-encoding?
Yes, decoding does hold up encoding on (very) fast presets.

Quote
* In particular about MD5 computation and recompressing: Does flac (these builds, at least) compute the MD5sum "in the same workflow" for recompressing .flac as for compressing PCM?
The first thread crosses the API boundary, and is for (1) the internals of the flac command line program, (2) the WAV reading or FLAC decoding, (3) verify decoding and (4) some internal copying and moving of data. If this thread is idle, it will start working on a frame. One of the other threads does MD5 calculation on the data that is to be encoded, and the others create frames.

I just found out that the flac command line program does NOT calculate and/or check MD5 of the original file on reencoding. It only calculates a new MD5. It also doesn't check whether the original MD5 and the new one are the same. Probably something that should be fixed at some point.

Quote
(AFAIUnderstand, flac --verify wavefile.wav will verify by creating a second MD5 sum
No, it does not. It decodes and checks whether each and every decoded sample is the same as every input sample. It does not verify the stored MD5.

Quote
Also, I just discovered that there are not only one undocumented --no-md5-sum, but also a --no-md5 - do those work the same? (Also, in case these builds have some exceptional behaviour implemented for only one of them.)
I think that is a feature of the getopt functions. If an 'abbreviation' is unique, it will accept that. So --no-md will also work. --no-m does not work because it is ambiguous, it could also mean --no-mid-side
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #111
flac git-5500690f 20230726

Code: [Select]
        -0      -1      -2     -3      -4      -5      -6      -7     -8 
 -j1    3.82s   3.99s   4.25s   4.37s   5.02s   6.02s   8.18s   10.27s 15.35s
 -j2    2.13s   3.16s   2.35s   2.37s   4.20s   3.28s   4.55s   5.59s  8.16s
 -j3    1.61s   3.19s   1.74s   1.77s   4.21s   2.41s   3.32s   4.09s  6.13s
 -j4    1.64s   3.18s   1.66s   1.60s   4.22s   2.00s   2.72s   3.32s  4.99s
 -j5    1.71s   3.18s   1.70s   1.64s   4.20s   1.83s   2.37s   2.85s  4.27s
 -j6    1.66s   3.18s   1.71s   1.66s   4.21s   1.82s   2.07s   2.55s  3.84s
 -j7    1.72s   3.21s   1.72s   1.64s   4.22s   1.83s   2.05s   2.31s  3.47s
 -j8    1.74s   3.17s   1.78s   1.64s   4.22s   1.84s   2.06s   2.21s  3.23s
 -j9    1.71s   3.31s   1.77s   1.64s   4.21s   1.87s   2.08s   2.22s  3.16s
 -j10   1.72s   3.17s   1.75s   1.62s   4.21s   1.85s   2.09s   2.24s  3.10s
 -j11   1.73s   3.18s   1.82s   1.69s   4.21s   1.87s   2.10s   2.33s  3.04s
 -j12   1.78s   3.16s   1.87s   1.63s   4.27s   1.93s   2.11s   2.24s  2.97s
 -j13   1.76s   3.21s   1.80s   1.69s   4.20s   1.89s   2.11s   2.29s  2.93s
 -j14   1.70s   3.17s   1.79s   1.66s   4.22s   1.88s   2.13s   2.32s  2.91s
 -j15   1.82s   3.18s   1.85s   1.67s   4.21s   1.92s   2.11s   2.30s  2.85s
 -j16   1.76s   3.20s   1.91s   1.65s   4.23s   1.89s   2.12s   2.27s  2.84s

Code: [Select]
        -0p     -1p     -2p     -3p     -4p     -5p     -6p     -7p    -8p
 -j1    3.82s   3.99s   4.24s   5.43s   6.37s   8.30s   16.39s  20.10s 44.02s
 -j2    2.16s   3.17s   2.34s   3.02s   5.57s   4.61s   9.19s   11.06s 24.25s
 -j3    1.61s   3.17s   1.74s   2.17s   5.57s   3.40s   6.76s   8.20s  18.19s
 -j4    1.64s   3.21s   1.65s   1.79s   5.60s   2.77s   5.48s   6.69s  14.85s
 -j5    1.65s   3.19s   1.68s   1.76s   5.59s   2.39s   4.75s   5.76s  12.82s
 -j6    1.63s   3.18s   1.69s   1.75s   5.57s   2.13s   4.21s   5.12s  11.44s
 -j7    1.65s   3.16s   1.73s   1.78s   5.58s   2.09s   3.86s   4.68s  10.38s
 -j8    1.69s   3.17s   1.78s   1.80s   5.58s   2.11s   3.54s   4.33s  9.59s
 -j9    1.75s   3.19s   1.78s   1.78s   5.60s   2.15s   3.55s   4.27s  9.58s
 -j10   1.78s   3.16s   1.79s   1.75s   5.58s   2.12s   3.48s   4.24s  9.72s
 -j11   1.77s   3.18s   1.77s   1.73s   5.57s   2.17s   3.44s   4.17s  9.51s
 -j12   1.75s   3.17s   1.84s   1.79s   5.60s   2.18s   3.39s   4.17s  9.45s
 -j13   1.72s   3.19s   1.87s   1.84s   5.57s   2.16s   3.35s   4.06s  9.22s
 -j14   1.78s   3.17s   1.87s   1.79s   5.57s   2.15s   3.32s   3.99s  9.16s
 -j15   1.76s   3.18s   1.82s   1.82s   5.59s   2.19s   3.28s   3.95s  9.09s
 -j16   1.76s   3.25s   1.81s   1.82s   5.59s   2.22s   3.30s   3.93s  9.03s

Didn't notice this before, but it seems presets 1 and 4 don't benefit from more than 2 threads.



Re: More multithreading

Reply #114
@Wombat made a couple of builds from the same source as the above version 5, and here follow some measurements against the one with "v3" flags, requiring AVX2 but not AVX512 (did I get that right)? This on a HP Deskpro which cannot run the AVX512 thing.

Compiles compare kinda how they should .... ? At least, no nasty surprises and no miracles, just a mild improvement from the instruction set of 3 to 5 percent on most settings - although yes some exceptions to either direction, and far down-right in the table there are a few positive numbers where the Wombat v3 build takes slightly more time.

That explains the  (Wv3) line in the table: time difference in percent (negative means faster), against ktf's latest build, which appears in the "main" line.
That line first has compression time in seconds. Then I thought, why not represent the others as penalty relative to the benchmark where speed is proportional to number of cores. Say, if times are not 40/20/10 but 40/21/12, the penalties of 1 and 2 seconds show up as 5% (of the 20) and 20% (of the 10).

Although those %s might be misleading when numbers become small (I mean, it is the seconds that make us impatient!), that is anyway where the -j4 doesn't unleash much. E.g. -0r1 -j2 is was done in 27 seconds and -j4 saves less than four more.
Also, I manually deleted the two -1j4 ones because -M caps the -j at 2 anyway. That in turn is because the multitreading is not (yet) optimized for -M, which is also pretty clear from the overhead on the two -1j2's.


... why this choice of settings? Because it seems the "-r" makes a difference between Clang and GCC compiles (#349 explains a mistake), so why not try a very fine partitioning and a very coarse. Not so much for the number of seconds, more to verify that it doesn't behave unexpectedly stupid under -r variations.

j1j2 ovrhj4 ovrh
8pl3235523 %7 %
(Wv3)-5 %-5 %-5 %
8per753173 %8 %
(Wv3)-5 %-5 %-5 %
8pr76463 %7 %
(Wv3)-7 %-5 %-5 %
8er76884 %10 %
(Wv3)-5 %-5 %-5 %
8r71765 %11 %
(Wv3)-5 %-4 %-4 %
8r11534 %11 %
(Wv3)-5 %-3 %-3 %
5r7717 %22 %
(Wv3)-3 %-2 %-2 %
5r1668 %25 %
(Wv3)-3 %-2 %-4 %
2er71057 %16 %
(Wv3)-4 %-4 %-4 %
2er16111 %52 %
(Wv3)-7 %-6 %1 %
2r76111 %48 %
(Wv3)-3 %-3 %2 %
2r15212 %74 %
(Wv3)-4 %-3 %-3 %
1r75358 %
(Wv3)-3 %-5 %-5 %
1r14851 %
(Wv3)-3 %-6 %-7 %
0r75113 %76 %
(Wv3)-3 %-2 %1 %
0r14715 %98 %
(Wv3)-3 %-3 %2 %
Times are median of 3. No CPU cooldown, rather the opposite: three -j1 runs discarded, then -j1, -j2, -j4 and Wombat build -j1, -j2, -j4. If anything, that would mean that Wombat -j1 got different conditions, not the Wombat -j4.

Re: More multithreading

Reply #115
To whom it may concern...
Just for the fun of it, I ran my -7 test on an up-to-date computer, which i happen to have for a couple of days to set it up.
It has a Raptor Lake Intel Core i9 13900F, 5.6 GHz max., 8/16 performance/efficient cores, 32 threads.
Code: [Select]
-j1:	Average time =  15.172 seconds (3 rounds), Encoding speed = 712.63x
-j2: Average time =   8.047 seconds (3 rounds), Encoding speed = 1343.55x
-j3: Average time =   5.704 seconds (3 rounds), Encoding speed = 1895.62x
-j4: Average time =   4.434 seconds (3 rounds), Encoding speed = 2438.43x
-j5: Average time =   3.658 seconds (3 rounds), Encoding speed = 2955.71x
-j6: Average time =   3.182 seconds (3 rounds), Encoding speed = 3397.51x
-j7: Average time =   2.885 seconds (3 rounds), Encoding speed = 3747.23x
-j8: Average time =   2.808 seconds (3 rounds), Encoding speed = 3850.43x
-j10: Average time =   2.807 seconds (3 rounds), Encoding speed = 3851.80x
-j12: Average time =   2.841 seconds (3 rounds), Encoding speed = 3806.15x
-j14: Average time =   2.868 seconds (3 rounds), Encoding speed = 3769.87x
-j16: Average time =   2.935 seconds (3 rounds), Encoding speed = 3683.40x
Test were done with ktf's v5 binary (flac git-5500690f 20230726)

Re: More multithreading

Reply #116
And since the performance plateau was reached @ -j7 in the test above, I tried with some heavier loads, too:
-8:
Code: [Select]
-j1:    Average time =  23.259 seconds (3 rounds), Encoding speed = 464.85x
-j2:    Average time =  12.256 seconds (3 rounds), Encoding speed = 882.18x
-j3:    Average time =   8.570 seconds (3 rounds), Encoding speed = 1261.61x
-j4:    Average time =   6.566 seconds (3 rounds), Encoding speed = 1646.66x
-j5:    Average time =   5.374 seconds (3 rounds), Encoding speed = 2011.78x
-j6:    Average time =   4.679 seconds (3 rounds), Encoding speed = 2310.59x
-j7:    Average time =   4.207 seconds (3 rounds), Encoding speed = 2569.80x
-j8:    Average time =   3.908 seconds (3 rounds), Encoding speed = 2766.87x
-j10:   Average time =   3.732 seconds (3 rounds), Encoding speed = 2896.85x
-j12:   Average time =   3.672 seconds (3 rounds), Encoding speed = 2944.44x
-j14:   Average time =   3.657 seconds (3 rounds), Encoding speed = 2956.79x
-j16:   Average time =   3.706 seconds (3 rounds), Encoding speed = 2917.17x
Performance peak here is -j10 .. -j12, so even when the CPU ran out of performance cores there is some benefit.

-8p:
Code: [Select]
-j1:    Average time =  74.252 seconds (3 rounds), Encoding speed = 145.61x
-j2:    Average time =  37.895 seconds (3 rounds), Encoding speed = 285.32x
-j3:    Average time =  26.564 seconds (3 rounds), Encoding speed = 407.02x
-j4:    Average time =  20.738 seconds (3 rounds), Encoding speed = 521.36x
-j5:    Average time =  17.305 seconds (3 rounds), Encoding speed = 624.79x
-j6:    Average time =  15.060 seconds (3 rounds), Encoding speed = 717.94x
-j7:    Average time =  13.406 seconds (3 rounds), Encoding speed = 806.52x
-j8:    Average time =  12.321 seconds (3 rounds), Encoding speed = 877.50x
-j10:   Average time =  12.201 seconds (3 rounds), Encoding speed = 886.13x
-j12:   Average time =  11.442 seconds (3 rounds), Encoding speed = 944.94x
-j14:   Average time =  10.573 seconds (3 rounds), Encoding speed = 1022.64x
-j16:   Average time =   9.777 seconds (3 rounds), Encoding speed = 1105.86x
-j18:   Average time =   9.352 seconds (3 rounds), Encoding speed = 1156.12x
-j20:   Average time =   8.942 seconds (3 rounds), Encoding speed = 1209.17x
-j22:   Average time =   8.547 seconds (3 rounds), Encoding speed = 1264.96x
-j24:   Average time =   8.219 seconds (3 rounds), Encoding speed = 1315.49x
-j26:   Average time =   7.949 seconds (3 rounds), Encoding speed = 1360.17x
-j28:   Average time =   7.850 seconds (3 rounds), Encoding speed = 1377.32x
-j30:   Average time =   7.836 seconds (3 rounds), Encoding speed = 1379.73x
-j32:   Average time =   7.791 seconds (3 rounds), Encoding speed = 1387.76x
-j34:   Average time =   7.746 seconds (3 rounds), Encoding speed = 1395.82x
-j38:   Average time =   7.819 seconds (3 rounds), Encoding speed = 1382.73x
-j42:   Average time =   7.928 seconds (3 rounds), Encoding speed = 1363.72x
-j46:   Average time =   7.963 seconds (3 rounds), Encoding speed = 1357.78x
Performance peak at -j32..-j34 here.

Re: More multithreading

Reply #117
flac-multithreading-v5-win
Code: [Select]
timer64.exe v5 -j1 -8p -f in.wav
Global Time  =    55.150

timer64.exe v5 -j2 -8p -f in.wav
Global Time  =    30.851

timer64.exe v5 -j3 -8p -f in.wav
Global Time  =    25.706

timer64.exe v5 -j4 -8p -f in.wav
Global Time  =    19.132

timer64.exe v5 -j5 -8p -f in.wav
Global Time  =    16.910

timer64.exe v5 -j6 -8p -f in.wav
Global Time  =    13.622

timer64.exe v5 -j7 -8p -f in.wav
Global Time  =    12.661

timer64.exe v5 -j8 -8p -f in.wav
Global Time  =    10.662

timer64.exe v5 -j9 -8p -f in.wav
Global Time  =    10.145

timer64.exe v5 -j10 -8p -f in.wav
Global Time  =     8.773

timer64.exe v5 -j11 -8p -f in.wav
Global Time  =     8.469

timer64.exe v5 -j12 -8p -f in.wav
Global Time  =     7.719

timer64.exe v5 -j13 -8p -f in.wav
Global Time  =     7.503

timer64.exe v5 -j14 -8p -f in.wav
Global Time  =     6.735

timer64.exe v5 -j15 -8p -f in.wav
Global Time  =     6.678

timer64.exe v5 -j16 -8p -f in.wav
Global Time  =     6.413]

Re: More multithreading

Reply #118
Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads. That local optimum doesn't show in e.g. Replica9000's statistics for preset 7/8, if I'm not mistaken. Anyway, good multithreading performance at and below 8 threads with v5!

Chris
If I don't reply to your reply, it means I agree with you.

Re: More multithreading

Reply #119
Performance peak at -j32..-j34 here.
Thanks! I think I can conclude from that that the 'leapfrogging' works correctly. One would assume a P core can do more work in the same amount of time than an E core, so if they were bound to the same number of frames before having to wait (which is the case with v1 and v3) that should have been visible from the results.

What is a bit vague though is how the scheduler works. Does it first saturate the P cores, than the E cores, than the 'hyperthreading system of the P cores'? Or does it first to the P cores, than hyperthreading, then E cores? Anyway, seeing no regressions I'd say it works pretty well, even if it isn't very efficient anymore at those high thread counts.

Attached an analysis of sundance's preset-7/8 measurements (with the -j9 results coarsely interpolated here), revealing a somewhat (to me, at least) unexpected local efficiency optimum at 4-5 threads.
Thanks! I think that local minimum is because at that point the first and second thread don't have to switch context too often. Those first two threads are less 'specialized' than the other threads and this brings some inefficiency.

Anyway, good multithreading performance at and below 8 threads with v5!
I agree! I think I did pretty well, not having done multithreaded programming before.  :D

Of course, thank you all for benchmarking. This has helped tremendously!
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #120
If anyone cares about my table overloads, then first there is this a mistake of mine: The i5-7500T in the top table here, is 4 cores 4 threads. So for that, consider the -j8 a "sanity check".

I let some builds loose on the i5-1135G7 equipped fanless desktop again. (A table comparing with Wombat's builds posted here.)
Lower figures better. Negative numbers better for the new build. Only one run per setting per build, so take times with a grain of salt.

I have two different things into the table here as well.
* The "timediff" rows are the running time of the version 5 vs the version 4: negative numbers are speedups, positive are slowdowns.
* The other rows, then the "ovrhd/diff" colums show "overhead penalty": the idealized time would be "j1 time / # of threads", and I compare actual time taken and quote the percent extra. Say, the 53% in the top j5 cell: Idealized time would be 25-ish seconds, and it took 53 percent more, i.e. 38-ish. The worst number, "452%" means it took 5.5 times the ideal,
The "j9" column is adjusted to match the j1 time / 8, since there are only 8 threads on this CPU. -j9 was ran a sanity check.
Percent penalty is quite useless on the -2r0 settings, where j4 was fastest in wall time.

Also the percent penalty is quite useless in the rightmost column. The two colums to the end are ran with -M, where -j is capped at 2 because -M isn't good for multithreading. No shit, Sherlock: -2Mer7 -j2 took more time than -2Mer7.

-8:j1 time/diffj2 ovrhd/diffj3 ovrhd/diffj4 ovrhd/diffj5 ovrhd/diffj8 ovrhd/diffj9 ovrhd/diff-Mj1 time/diffj2 ovrhd/diff
v412419%20%35%53%131%120%9667%
v51218%23%40%55%124%138%8397%
time v5 vs v4−2%−12%0%1%−1%−6%6%−14%2%
-5:
v44823%49%86%104%271%287%4284%
v54924%56%80%105%239%246%4088%
time v5 vs v41%1%6%−2%1%−8%−10%−4%−1%
-3:
v43627%57%112%166%366%386%3979%
v53726%68%100%148%312%307%3892%
time v5 vs v41%1%9%−4%−6%−10%−15%−2%5%
-2er7:
v46919%34%60%76%175%238%51100%
v57016%38%52%75%180%212%50102%
time v5 vs v41%−1%5%−4%1%3%−6%−1%0%
-2r0:
v45421%25%93%127%452%393%3680%
v54424%57%85%153%377%320%3891%
time v5 vs v4−18%−16%3%−22%−9%−29%−30%6%13%

Re: More multithreading

Reply #121
The i5-7500T in the top table here
I am confused, which table do you mean?
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #122
which table
Forgot to linkify it. Here, reply 104.

Top table: -j8 takes about the same time as -j4. To be expected, there are only four threads.
Next tables: -j8 takes shorter time than -j4 for the -8 based presets, but not -5 nor fixed-predictor and that is on 4core8threads CPUs.

For version 5 "-j8" performance on the i5-1135G7 (Intel data here), these are time in seconds from the same data as the table in the previous reply 120.
Only -j1, -j4, -j8
Code: [Select]
seconds -j1 -j4 -j8
-8:     121  42  34
-5:      49  22  21
-3:      37  18  19
-2er7:   70  27  24
-2r0:    44  20  26
Although timings are to be taken with a grain of salt, I confirmed the relations between the -j's with two Wombat builds too, so I'm believing that -8 benefits from going -j4 to -j8, the others only marginally so or get slower; reservation: this on an already-hot computer, passively cooled. That could potentially be a small benefit to -j1 more than -j4 more than -j8, since immediately before -j1 resp -j4 resp -j8 there was a -Mj2 resp -j3 resp -j5 (the latter heavier). But there was hardly a particular benefit for the top-left element, as I first ran v4, and so the CPU had been running -8jx then -8Mjx encoding for fifteen minutes process time: it would have been hot.

Larger benefits at heavier jobs might be expected from a well-cooled computer, but not on a passively cooled and runs hot to touch: then I would rather expect that trying to increase the workload by employing more threads, would cause throttling and diminish the speedup and even more so on heavier jobs when a thread isn't idling so much.

To point out how even this fanless computer - when running hot for a long time - still utilize multi-threading:
Here are results from multithreading -8pel32. A two-day job on the now-obsolete version 4.
Code: [Select]
versionv4-8epl32-j1 

Commit   =     14224 KB  =     14 MB
Work Set =     15328 KB  =     15 MB

Kernel Time  =    19.031 =    0%
User Time    = 60512.734 =   98%
Process Time = 60531.765 =   98%
Global Time  = 61573.994 =  100%
 
versionv4-8epl32-j2

Commit   =     16772 KB  =     17 MB
Work Set =     19516 KB  =     20 MB

Kernel Time  =    16.609 =    0%
User Time    = 93122.156 =  196%
Process Time = 93138.765 =  196%
Global Time  = 47361.310 =  100%
 
versionv4-8epl32-j4

Commit   =     18588 KB  =     19 MB
Work Set =     20804 KB  =     21 MB

Kernel Time  =    25.937 =    0%
User Time    = 98499.734 =  385%
Process Time = 98525.671 =  385%
Global Time  = 25562.915 =  100%
 
versionv4-8epl32-j8

Commit   =     22164 KB  =     22 MB
Work Set =     23360 KB  =     23 MB

Kernel Time  =   172.843 =    0%
User Time    = 138908.546 =  771%
Process Time = 139081.390 =  772%
Global Time  = 17994.765 =  100%
Sure process  time was twice as large on -j8 than on -j1, but this was version 4.

So I'd say this is better than expected: it was run on a computer where multithreading benefits would be expected smaller and still it speeds up quite a lot.

Re: More multithreading

Reply #123
Yes, looks good. For now, there seems no obvious way to improve efficiency or scaling further, so I think it is time to write some documentation and get this merged.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #124
For now, there seems no obvious way to improve efficiency or scaling further
Except -M? Which probably will not be prioritized, maybe carry a "WARNING: -M limits multithreading to -j2"? 
Anyway, on this computer it didn't seem that -Mj2 would even improve over -Mj1, but to get more data now I have tried a big number of -1fj1 vs -1fj2 runs on this and on two Dell laptops, and it looks slightly more optimistic. (Global) time saved by going -j2 was measured to 4% on this computer over twenty-something runs unattended, and then 10% and 20% those two laptops - and although I didn't fire up more runs ton the desktop in the above table, it could even be slightly more. At least it got the right sign, although you would have had to expect Github issues with "multithreading doesn't work!" had you implemented a previous suggestion of putting a -M in the -0 preset ;)

(Only Intel CPUs tested here, I should add.)

so I think it is time to write some documentation and get this merged.
In the course of that, there is a decision coming up - or one may postpone it: "-j0" should signify what? Suggestion:
Implement (or at least, make no decision that makes a future case against it) -j0 for allowing multithreading, let encoder decide. It could for now invoke -j1 but, thinking aloud and proposing something that actually multithreads but not too aggressive:
 -j0 invokes -j2, except if there is -M, then it single-threads.
My loose idea behind that was to flag to users that -j0 is not supposed to be synonymous to any of the others - so stop whining when you find out it is neither -j1 nor -j2 - nor when it changes! Users cannot expect it to stay constant when it is tuned to be something smarter than a fixed number, so make it "smarter than -j2" from day one.

I take it that for the sake of applications that pass one file to one thread (like fb2k), the default will be -j1.