Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: More multithreading (Read 34037 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Re: More multithreading

Reply #126
I agree that it is a good idea to warn users it will change in the future. My thinking-aloud was ... why not make that warning clearer - because it isn't synonymous to any other setting, it cannot be expected to remain synonymous to any other setting. Just to warn users about that from jay day one. But I'm not crying over it currently being same as -j1.


Anyway, I see the tests here are CDDA material. Maybe high resolution and multichannel? More for sanity checking - see if there is anything completely bonkers going on - than for sheer numbers.
But the numbers ... might be unreliable, so maybe I should ask, were there any "general" changes made from v1 to v5?
Asking because when I tried the below hi-res corpus on -8Mper7 -j1/-j2 (yes, "M" - stupid setting but the point was to test "-M") then I got the following times:
v1   2574 vs 2550
v5   2538 vs 2534
(Wombat's build: slightly faster. I didn't run more 8pe, not included below.)


So, on to the tests; v1 (first one posted in this thread) vs v5 vs Wombat's build from same source as v5, but with AVX-512 compiler flag ("v4", but I left that out not to confuse with ktf's builds):

High resolution. 62 minutes, 2848MiB compressed (-5): size contributions are 1008 of 192/24 from the 2L testbench and Linn Records, 619 of that infamous 768kHz Carmen Gomes PR stunt, 350 of DXD (that's only 5 minutes!) from the 2L testbench - and then the rest is various-rate 32-bit integer from that sample rate converter site, and then one track in 192/16 from some French stoner band.
 
First two cells were redone later, computer was apparently still busy when I started. Also did -j9 for a sanity check (got only 8 threads) -  virtually the same times as -j8, omitted from the table.
One surprise: v1 doing -0b4096 -j2 so well. Confirmed by a few re-runs.
But generally, v5 is superior on this material too. Take individual times with a grain of salt.

This time I used seconds. You can see where the benefits flatten out:
.j1j2j3j4j5j8Mj1Mj2
v1@-0r02223181717172222
v52112121113112219
Wombat2113121212132319
v1@-02120181717172321
v52213111112112320
Wombat2213121212122320
v1@-0b40962115161616162215
v52112101110102219
Wombat2112111112112218
v1@-0er73130181717173232
v53117141211123229
Wombat3017151212123129
v1@-5332416171616
v5331814111211
Wombat321814121212
v1@-810011054383831
v51005339323230
Wombat975238303128
v1@-8e477484249183178153
v5481253183150151145
Wombat476247181149148141
v1@-8pr7638644335244255206
v5642332241201201194
Wombat631327239197197189
(Wombat builds produce slightly different files, size differences +/- 0.01%. )


5.1 multichannel. About an hour, DVD-sourced at 48kHz (avoiding high resolution here, one test at the time).
Since -M was off the table, I cut down to fewer -j options too. Again I have omitted a -j9 done for sanity checking, it produced numbers consistent with -j8.

There are some weirdnesses for -0; I cannot rule out that the computer might have been not completely done with some other job or whatever.  Also I cannot access that computer at the moment to re-run it (I am on the road, I had it output numbers to a text file in the cloud).
j1j2j4j8
v1@-0139914
v51321156
Wombat147710
v1@-2er71718911
v5171087
Wombat20977
v1@-514999
v514867
Wombat14967
v1@-534322410
v53320119
Wombat3119118
v1@-8e1181175935
v5121924834
Wombat115914732
v1@-8pr71621687948
v51611296746
Wombat153834744
So, since I was looking for anomalies, and am away from that computer since firing up that multichannel test ... well, I would have hoped I didn't have to re-run anything due to results like those -0. But with that minor reservation, I think the picture (on this computer with some Intel 4 cores 8 threads) is getting quite clear.  v5 does behave sane, but I should count myself lucky if I save much time going beyond -j4.

Re: More multithreading

Reply #127
Current git of the multithreading version c1fc2c91, CPU generic.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #128
I used that version above in my little j5005 machine and did recompress more than 300GB over the time using --threads=3 since it runs anyway. These are mixed bitrates.
All files correctly bit-compare.
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #129
Like 1.4.0, release it 09-09 in order not to confuse ISO-8601-illiterate Americans?  ;)
a final and multithreaded 1.4.4
Question arises, are the changes so minor that it will stay "1.4"?
Possible relevance: a "1.5.0" might justify some more discussion on what is to be included.

Speaking of the second digit:
1.3.4 has this error in Rice partitions with escape code zero (testbench file 64). And 1.3.4 is the last of the 1.3 series.
Is there a risk that 1.3.4 will be kept in production because the breaking changes to 1.4.0? If so, should there be a maintenance 1.3.5 with this bugfix?

Re: More multithreading

Reply #130
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Not really, no. Merge with master probably in a few weeks, release might be next year.

Question arises, are the changes so minor that it will stay "1.4"?
The reason to bump the 4 to 5 would be because of a breaking API change. That isn't the case here.

1.3.4 has this error in Rice partitions with escape code zero (testbench file 64). And 1.3.4 is the last of the 1.3 series.
Is there a risk that 1.3.4 will be kept in production because the breaking changes to 1.4.0? If so, should there be a maintenance 1.3.5 with this bugfix?
No, that won't happen. I don't have time to backport all fixes that have happened in the meantime.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #131
No, that won't happen. I don't have time to backport all fixes that have happened in the meantime.
Fair enough - it is not that it creates invalid files (I think?)
But the changelog could maybe have been clearer recommending up- or downgrade if flac.exe version 1.3.4 errs out on a file.

Re: More multithreading

Reply #132
@ktf Thanks for the info.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #133
"Currently, passing a value of 0 is synonymous with a value of 1, but this might change in the future"
Maybe it would be better to change its meaning to "sets to amount of available cores"?

Re: More multithreading

Reply #134
There is no platform-independent way to determine the 'amount of available cores': this is different for Windows, MacOS, *nixes, microcontrollers etc. Might also differ between CPU architectures. Also, with the advent of performance and efficiency cores, using all cores might not be beneficial. Same goes for hyperthreading and similar technologies.

So auto-selecting isn't as simple as it might seem.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #135
It's a de-facto standard that 0 means "as many cores as available", if you don't want to do that I suggest removing 0 as an option entirely. Either way the default should be 1. IMO if a user requests "as many cores as available", it's on them if that's not the most effective option.

Re: More multithreading

Reply #136
It's a de-facto standard that 0 means "as many cores as available",
Wasn't it so that "-threads 0" in ffmpeg means "let application decide"?

Either way the default should be 1.
Obviously. Say fb2k will spawn one instance per available thread.

Re: More multithreading

Reply #137
You may also relate 0=default and this is still 1 thread.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #138
First thanks for starting to develop multithreading.

I performed some tests and compared it to https://www.rarewares.org/files/mp3/fpMP3Enc.zip multithreading behavior.
fpFLAC2FLAC used all cores @100% by default (with no options added) but flac.exe with maximum opted threads uses CPU in this way:X

And i would like to ask if there is any chance to add some timers to CLI text for testing purposes?

Re: More multithreading

Reply #139
AMD Ryzen 5900x, 24 threads (--threads=24)
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #140
but flac.exe with maximum opted threads uses CPU in this way:
Please explain what options you used.

FLAC encodes very fast, the system calls used to enable multithreading take some time to execute and some things cannot be multithreaded. This means that when multitheading with a large number of threads, full CPU usage can only be reached when using a slow FLAC preset, like -8p. If you used the default compresion level of 5, then what you are seeing is probably due to parsing or decoding (which cannot be multithreaded) being a bottleneck.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #141
Please explain what options you used.
5950x automatically runs @~4,40-4,55 GHz settings -8 -V -j32 X
5950x auto throttles down to ~3,25-3,35 GHz settings -8 -V -e -p -j32 X

Re: More multithreading

Reply #142
5950x auto @4,45 GHz settings -8 -V -p -j32 X

Re: More multithreading

Reply #143
@ktf Do you already have a timeline for the merge with xiph master or even a final and multithreaded 1.4.4?
Not really, no. Merge with master probably in a few weeks, release might be next year.

Any news on this? I really liked the idea of MT FLAC, I've made profile for CUETools for speedier encoding of music... :)
TAPE LOADING ERROR

Re: More multithreading

Reply #144
Fuzzing found some exotic/rare bugs in this code that probably nobody will ever encounter, which I will need to fix. I don't have time to do that soon though. Especially since multithreaded coded is much harder to debug.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #145
I have pushed some changes to the multithreading code. It is too soon to say for sure, but it seems to fix some of the problems.

While the changes could impact performance, in my own tests it doesn't seem measurable. If anyone wants to double-check, the last two compiles here are probably usable for that.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #146
Health for your labor. I quickly did a small encoding test. I guess FLAC decoding is not multithread yet. I tried it, but I didn't see the difference.

Intel i7 3770k(4 core, 8 thread), 16 gb ram, 256 gb ssd
FLAC git-7f7da558 20240226 - "flac.exe -o output -x --no-md5 --totally-silent -jx -f input"
HALAC 0.2.6 Normal - "halac_encode input output -y -mt=x"

X
Code: [Select]
WAV : 1,857,654,566 bytes (Merged 3 Music album)
-------------------
HALAC Normal mt=1 : 10.359
HALAC Normal mt=2 :  6.578
HALAC Normal mt=4 :  4.328
HALAC Normal mt=8 :  3.672
HALAC Normal mt=16 : 3.609
1,245,704,379 bytes
-------------------
FLAC -0 j1 : 10.390
FLAC -0 j2 :  5.937
FLAC -0 j4 :  6.172
FLAC -0 j8 :  5.687
FLAC -0 j16 : 6.109
1,318,502,972 bytes
-------------------
FLAC -1 j1 : 11.015
FLAC -1 j2 :  6.484
FLAC -1 j4 :  6.469
FLAC -1 j8 :  7.125
FLAC -1 j16 : 6.765
1,293,667,655 bytes
-------------------
FLAC -2 j1 : 12.297
FLAC -2 j2 :  6.687
FLAC -2 j4 :  6.062
FLAC -2 j8 :  6.406
FLAC -2 j16 : 6.515
1,288,861,797 bytes
-------------------
FLAC -3 j1 : 16.453
FLAC -3 j2 :  8.750
FLAC -3 j4 :  6.000
FLAC -3 j8 :  5.562
FLAC -3 j16 : 5.219
1,254,819,663 bytes
-------------------
FLAC -4 j1 : 19.843
FLAC -4 j2 : 16.203
FLAC -4 j4 : 16.312
FLAC -4 j8 : 16.218
FLAC -4 j16 :16.406
1,221,587,898 bytes
-------------------
FLAC -5 j1 : 27.124
FLAC -5 j2 : 14.109
FLAC -5 j4 :  8.140
FLAC -5 j8 :  7.015
FLAC -5 j16 : 7.328
1,218,712,751 bytes
-------------------

 

Re: More multithreading

Reply #147
I guess FLAC decoding is not multithread yet.
The FLAC format is unfit for multithreaded decoding. That is, because reliably finding the next frame involves parsing the current one. One could offload MD5 calculation to a separate thread, and maybe parsing and decoding could be done in separate threads, but I currently don't really see a way to  make more than 4 threads have any benefit at all, and even then, the workload would be very uneven.
Music: sounds arranged such that they construct feelings.

Re: More multithreading

Reply #148
I have pushed some changes to the multithreading code. It is too soon to say for sure, but it seems to fix some of the problems.

While the changes could impact performance, in my own tests it doesn't seem measurable. If anyone wants to double-check, the last two compiles here are probably usable for that.

Not sure about the problems but for many 16/44.1 and several 24/96 files i didn't find any.
The recent git when i compile it as the one before gives ~191x speed for -j20 -8ep with 16/44.1 against former ~192x speed on my 5900x.
No speed concerns for sure.
Is troll-adiposity coming from feederism?
With 24bit music you can listen to silence much louder!

Re: More multithreading

Reply #149
Not sure about the problems but for many 16/44.1 and several 24/96 files i didn't find any.
It would show up randomly about once in every 3.000.000 executions. This is the kind of bugs we have to thank Google's oss-fuzz project for finding. You can probably imagine it took me quite a while to find a proper fix for it.

Quote
No speed concerns for sure.
Great, thanks for checking.
Music: sounds arranged such that they construct feelings.