Why test wall time for multithreading? After all, multithreading doesn't use less CPU, only waiting time if the CPU isn't already running busy. Conversion software that spawns one thread per file would be expected to be more efficient, so why not let it ... ?
But, there
are situations where you want to save to compressed, say in DAW plug-ins; then I'd expect even two seconds' wait to be noticeable. And if you are
opening a project, then the same goes for decoding?
Anyway, swine got curious:
* ffmpeg 7.0 now encodes faster: https://hydrogenaud.io/index.php/topic,125694.msg1042555.html#msg1042555 . Its fastest encoding is now on par with single-threaded fastest official WavPack (
which compresses better!), while 6.1.1 spent 60 percent more time compressing worse. Seems to be due to one thread processing source and ... ? (I have no idea of what overhead that creates.)
* ffmpeg also decodes with some degree of multithreading, unknown to me until pointed out in that same thread. Aha ... so how fast? The reference FLAC multithreading in git is only for encoding, not for decoding.
* For multiple files, I just noticed that since decoding is fast, the penalty from a FOR loop is more than noticeable. Hence decoders that can do wildcards are at advantage. Of course that advantage is very much real - heck, the time taken to
type a FOR loop is also significant here! - but for measuring the decoder ... what then?
For a more apples to apples comparison, this is
one (untagged) file, 73 minutes of CDDA on (internal) SSD. Corpus isn't super-important (I hope!):
I took the first ten minutes and a half of each of 7 CDs that are neither classical music nor metal - because the variety of signals, some old near-mono, some this and some that. Compressed better than your average I guess, numbers are given at the end.
So think of it as one full compilation CD as image (no cuesheet, no tags!)
Test was done with the hyperfine benchmarking tool that I recently started using: I run the whole thing 11 times + warmup for the larger part, all with pause in between to keep the CPU reasonably stable, and then because some figures looked suspicious and I would anyway re-run a few (no big changes!), I included some like shn and mlp in that run and pasted them in order. (Since the fastest ran in 1.199 and then 1.200, that makes no difference.) CPU: i5-1135-G7, 4 cores 8 threads.
I wish hyperfine could be set to report this nice summary with median instead of mean, for robustness to the whims of the OS - but laziness gets the better off me when the output looks as nice as this. Reformatted slightly, and commented.
ffmpeg -i =N.flac.-5.flac -f wav -y NUL ran1.03 ± 0.01 | times faster than | ffmpeg -i =N.flac.-0r0b4096--no-md5.flac -f wav -y NUL | This dual mono FLAC with no MD5 was encoded to decode fast. Seems ffmpeg ignores MD5. |
1.16 ± 0.01 | times faster than | ffmpeg -i =N.flac.-8l32.flac -f wav -y NUL | -8l32 --lax, to be more precise. I cannot force flac.exe to use a very high order, but this was intended to decode slower and it did |
1.32 ± 0.02 | times faster than | ffmpeg -i =N.tak.-p0.tak -f wav -y NUL | TAK, fastest one |
1.43 ± 0.02 | times faster than | ffmpeg -i =N.tak.-p4m.tak -f wav -y NUL | Why this is faster than -p2 ... could be number of frames |
1.59 ± 0.02 | times faster than | ffmpeg -i =N.tta -f wav -y NUL | TTA is a surprise. Look at how much faster than the reference ... |
1.73 ± 0.02 | times faster than | ffmpeg -i =N.wv.ffmpeg-0.wv -f wav -y NUL | WavPack. ffmpeg decodes WavPack faster than multithreaded wvunpack.exe does |
1.83 ± 0.02 | times faster than | ffmpeg -i =N.wv.-f.wv -f wav -y NUL | |
1.85 ± 0.03 | times faster than | ffmpeg -i =N.flac.-2e.flac -f wav -y NUL | FLAC with smaller block size, that yields time penalty |
1.93 ± 0.03 | times faster than | .\wvunpack.exe -qy --threads=8 =N.wv.-f.wv -o NUL | wvunpack with --threads=8. Only one faster than flac -d. |
2.08 ± 0.02 | times faster than | .\flac.exe -ss -d =N.flac.-0r0b4096--no-md5.flac -fo NUL | FLAC with official decoder. Here the absence of MD5 matters. ffmpeg does it twice as fast. |
2.08 ± 0.09 | times faster than | ffmpeg -i =N.alac.refalac.m4a -f wav -y NUL | ALAC compressed with refalac |
2.20 ± 0.02 | times faster than | ffmpeg -i =N.alac.ffmpeg.m4a -f wav -y NUL | ALAC compressed with ffmpeg |
2.45 ± 0.03 | times faster than | ffmpeg -i =N.wv.-x.wv -f wav -y NUL | WavPack default mode |
2.51 ± 0.03 | times faster than | ffmpeg -i =N.alac.cuetools8.m4a -f wav -y NUL | ALAC compressed with CUETools, the slower preset "8". |
2.69 ± 0.03 | times faster than | .\wvunpack.exe -qy --threads =N.wv.ffmpeg-0.wv -o NUL | WavPack by official wvunpack --threads (selecting the thread count itself). Note, no -m used. |
2.69 ± 0.03 | times faster than | .\wvunpack.exe -qy --threads =N.wv.-f.wv -o NUL | (-q for "quiet") |
2.87 ± 0.03 | times faster than | .\flac.exe -ss -d =N.flac.-5.flac -fo NUL | (-ss for "silent") |
3.04 ± 0.03 | times faster than | .\flac.exe -ss -d =N.flac.-2e.flac -fo NUL | Because block size 1152? |
3.30 ± 0.04 | times faster than | .\wvunpack.exe -qy --threads =N.wv.-x.wv -o NUL | WavPack default mode (-x does not slow down decoding), |
3.47 ± 0.17 | times faster than | ffmpeg -i =N.wv.-hx2.wv -f wav -y NUL | ffmpeg on a high mode .wv nearly catches official on a default mode |
3.52 ± 0.04 | times faster than | .\flac.exe -ss -d =N.flac.-8l32.flac -fo NUL | heaviest flac |
4.18 ± 0.14 | times faster than | .\wvunpack.exe -qy --threads =N.wv.-hx2.wv -o NUL | wvunpack takes 27 percent more time than ffmpeg |
4.59 ± 0.07 | times faster than | ffmpeg -i =N.wv.-hhx3.wv -f wav -y NUL | |
5.19 ± 0.17 | times faster than | .\takc.exe -d -overwrite -tn4 =N.tak.-p0.tak NUL | TAK. "-tn4" would turn on multithreaded encoding, but like FLAC it doesn't multithread decoding. 4x the time of ffmpeg! |
5.24 ± 0.13 | times faster than | .\wvunpack.exe -qy --threads =N.wv.-hhx3.wv -o NUL | |
5.68 ± 0.14 | times faster than | .\takc.exe -d -overwrite -tn4 =N.tak.-p2.tak NUL | |
5.76 ± 0.12 | times faster than | ffmpeg -i =N.shn -f wav -y NUL | Shorten, for completeness. ffmpeg does that faster too. |
5.95 ± 0.15 | times faster than | .\takc.exe -d -overwrite -tn4 =N.tak.-p4m.tak NUL | |
6.48 ± 0.16 | times faster than | .\wvunpack.exe -qy =N.wv.ffmpeg-0.wv -o NUL | wvunpack, single-threaded |
6.67 ± 0.08 | times faster than | .\wvunpack.exe -qy =N.wv.-f.wv -o NUL | |
6.95 ± 0.15 | times faster than | .\shorten.exe -x =N.shn NUL | |
7.39 ± 0.08 | times faster than | .\refalac -D =N.alac.refalac.m4a -o NUL | refalac spends 3.6x the time of ffmpeg |
8.28 ± 0.16 | times faster than | .\wvunpack.exe -qy =N.wv.-x.wv -o NUL | |
9.18 ± 0.10 | times faster than | .\tta.exe -d =N.tta NUL | TTA official spends 5.8x the time of ffmpeg. Either one is good or the other is bad ... or could it be the large block size? |
9.75 ± 0.10 | times faster than | ffmpeg -i =N.ape.-c1000.ape -f wav -y NUL | Monkey's is also faster with ffmpeg, but not that much |
10.55 ± 0.12 | times faster than | .\refalac -D =N.alac.ffmpeg.m4a -o NUL | |
10.75 ± 0.24 | times faster than | .\wvunpack.exe -qy =N.wv.-hx2.wv -o NUL | |
11.09 ± 0.12 | times faster than | .\refalac -D =N.alac.cuetools8.m4a -o NUL | refalac spends 5.3x the time of ffmpeg on this heavier file. |
13.17 ± 0.30 | times faster than | .\MAC.exe =N.ape.-c1000.ape NUL -d | |
13.68 ± 0.16 | times faster than | .\wvunpack.exe -qy =N.wv.-hhx3.wv -o NUL | hh-eaviest WavPack file. 2.6x the multithreaded time. 3x ffmpeg time. |
14.72 ± 0.16 | times faster than | ffmpeg -i =N.ape.-c3000.ape -f wav -y NUL | |
20.32 ± 0.41 | times faster than | .\MAC.exe =N.ape.-c3000.ape NUL -d | |
23.38 ± 0.51 | times faster than | ffmpeg -i =N.mlp.mka -f wav -y NUL | MLP. I was curious. |
45.37 ± 0.63 | times faster than | .\MAC.exe =N.ape.-c5000.ape NUL -d | |
54.04 ± 0.57 | times faster than | ffmpeg -i =N.ape.-c5000.ape -f wav -y NUL | The only one where ffmpeg ran slower than the official. Yes I re-ran them. |
It seems ffmpeg does that thing pretty universally, but not too well on Monkey's.
At this speed I might wonder whether there are significant differences due to whether/how the decoders ensure that the file is properly closed - even if it is null output. Speculations, but wavpack the encoder does close and reopen upon verification ...?
Also tested:MPEG4-ALS. ffmpeg crashed consistently on this file, of course that made for the "fastest" run and all the other figures wrong. Instead of correcting them: discard and another overnight run.
Extra time to write out.wav on same SSD compared to NUL
* zero-ish: wvunpack
* 0.07 to 0.11: ffmpeg (unreliably measured on .ape)
* 0.36 to 0.55: flac.exe (Xiph and Wombat, -2e was worst) and tta.exe
* 0.7 for refalac
takc.exe writes NUL.wav which takes 1.0 (1.2 seconds) more than just test decode - how much of that is for actual file and how much is for null output, I don't know. But it leads to this:
All official decoders can do test decode - verify by decoding. Extra time for them to do -o NUL compared to verify by decoding:
* 0.06 to 0.10 for flac.exe
* 0.5 ± a little, for wvunpack --threads, and 0.85 ± a little for single-threaded wvunpack. Is this the penalty for checking that the file is properly closed, I think WavPack goes to greater lenghts to do that?
* (unreliably measured on ape ... at those speeds it doesn't matter much. If you don't want to wait, use the official GUI that can spawn a thread per file.)
And more:
* wvunpack --threads=<1 through 8>. One number posted in the table.
* Did .wv files encoded with --threads take more or less time to decode? No, all within the variations. ± 0.07
* How did Wombat's most recent flac build (https://hydrogenaud.io/index.php/topic,123176.msg1041251.html#msg1041251) do? .42 to .49 slower.
Timing for ffmpeg -i =N.wav -f wav -y NUL: around 0.4 seconds. This is the only "seconds" here.
Except the latter 0.4 seconds: numbers are differences in the "times faster than", so add twenty percent to get it in seconds.
Finally, file sizes. WAV is 773972684, and the following are compression ratios - the content of old jazz/soul makes for smaller files:
45.8% =N.ofr.--presetmax.ofr
46.3% =N.ofr.--preset7.ofr
47.0% =N.ofr.--preset2.ofr
47.4% =N.tak.-p4m.tak
47.9% =N.ofr.--preset0.ofr
48.1% =N.tak.-p2.tak
49.1% =N.wv--threads.-hhx3.wv
49.1% =N.wv.-hhx3.wv
49.3% =N.wv.-hx2.wv
49.4% =N.wv--threads.-hx2.wv
49.5% =N.flac.-8l32.flac
49.6% =N.als
49.6% =N.als.m4a
49.8% =N.tta
50.0% =N.tak.-p0.tak
50.1% =N.flac.-5.flac
50.1% =N.wv.-x.wv
50.1% =N.wv--threads.-x.wv
50.4% =N.alac.cuetools8.m4a
51.1% =N.alac.refalac.m4a
51.6% =N.alac.ffmpeg.m4a
51.9% =N.wv--threads.-f.wv
51.9% =N.wv.-f.wv
53.0% =N.flac.-2e.flac
58.4% =N.wv.ffmpeg-0.wv
59.9% =N.shn
60.7% =N.flac.-0r0b4096--no-md5.flac
70.9% =N.mlp.mka
Soon to be posted: fast-verification times.
TTA have fixed number of encoded samples for each packet, except last packet in file. There is no ways to do any optimizations here except bruteforce threading.
FFmpeg still does so much better than tta.exe that it adds to the suspicion that the reference implementation isn't very good.
I don't speak code, but the following also indicate that reference tta.exe isn't particularly stellar:
* FFmpeg-tta does things that tta.exe cannot - like detect errors.
* Official foobar2000 component errs out on certain files (I think it is 8-bits, fixed in case's component).
* I have not tested this rewrite, but it claims speedups: https://hydrogenaud.io/index.php/topic,125048.0.html
* tta.exe is picky about WAVE version, and thinks that WAVE sample count is signed integer.
Impressive improvements!
Are FFMPEG lossy encoders also multithreaded? It could be also very interesting for video tools (handbrake…).
Thanks for this table Porcus, it's very interesting :)
Note, I have tested DEcoding here, not ENcoding. What has happened in ffmpeg 7.0 on the encoding ... quoting from https://ffmpeg.org/#cli_threading
Thanks to a major refactoring of the ffmpeg command-line tool, all the major components of the transcoding pipeline (demuxers, decoders, filters, encodes, muxers) now run in parallel. This should improve throughput and CPU utilization, decrease latency, and open the way to other exciting new features.
Note that you should not expect significant performance improvements in cases where almost all computational time is spent in a single component (typically video encoding).
Note, I have tested DEcoding here, not ENcoding.
Ah yes, it's mentioned in the title :-*
Fast-verification times.
WavPack (from format 5, decoder 5.40), Monkey's (CLI from ... year twentytwenty-something) and OptimFROG can verify a file without carrying out the decoding - especially good on the latter two, that incur some CPU load doing decoding. Of course, no decoding does not verify that the audio is what it is supposed to be, but block-level checksums should protect against bit-flips and general corruption.
Other formats, like FLAC, do have block-level checksums and could do the same, but with no application supporting it.
Whether it would offer much value-added for FLAC, which decodes fast and whose users are so accustomed to having audio MD5 being included that the file vendor who supplies FLAC downloads without MD5 gets the evil eye - up to opinion, but at least here is a take on the differences in speed.
Same single file as above. Take note that the fastest of these, WavPack in high mode (fewer blocks?) ran in
0.239 seconds <--- 18358x realtime!
.wvunpack.exe -q -vv =N.wv.-hhx3.wv ran
1.00 ± 0.02 times faster than .wvunpack.exe -q -vv =N.wv.-hx2.wv
1.22 ± 0.02 times faster than .wvunpack.exe -q -vv =N.wv.-x.wv
1.23 ± 0.04 times faster than .wvunpack.exe -q -vv =N.wv.-f.wv
3.02 ± 0.05 times faster than .MAC.exe =N.ape.-c5000.ape -v
3.06 ± 0.05 times faster than .MAC.exe =N.ape.-c3000.ape -v
3.13 ± 0.05 times faster than .MAC.exe =N.ape.-c1000.ape -v
3.47 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--presetmax.ofr
3.54 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--preset7.ofr
3.59 ± 0.07 times faster than .ofr.exe --verify =N.ofr.--preset2.ofr
3.66 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--preset0.ofr
10.20 ± 0.17 times faster than .flac.exe -ss -t =N.flac.-0r0b4096--no-md5.flac
12.37 ± 0.23 times faster than .flac-wombat.exe -ss -t =N.flac.-0r0b4096--no-md5.flac
14.36 ± 0.24 times faster than .flac.exe -ss -t =N.flac.-5.flac
15.09 ± 0.29 times faster than .flac.exe -ss -t =N.flac.-2e.flac
16.35 ± 0.28 times faster than .flac-wombat.exe -ss -t =N.flac.-5.flac
17.00 ± 0.27 times faster than .flac-wombat.exe -ss -t =N.flac.-2e.flac
17.62 ± 0.28 times faster than .flac.exe -ss -t =N.flac.-8l32.flac
19.69 ± 0.33 times faster than .flac-wombat.exe -ss -t =N.flac.-8l32.flac
20.55 ± 0.36 times faster than .takc.exe -t =N.tak.-p0.tak
23.21 ± 0.42 times faster than .takc.exe -t =N.tak.-p2.tak
24.26 ± 0.42 times faster than .takc.exe -t =N.tak.-p4m.tak
28.61 ± 0.48 times faster than .wvunpack.exe -q -vv =N.wv.ffmpeg-0.wv
No "fast" verification in the latter, which is a WavPack version 4 file - that is what ffmpeg creates. Included as a "(s)low anchor".
-q for quiet, -ss for silent, I am not sure if it matters since hyperfine does not display a console, but ... habits, habits. "flac-wombat.exe": renamed the exe of the latest build (link in original post).
hyperfine command in the bat file, the pings take a second each and are for pause in between:
hyperfine.exe -i --style full -r 11 -w 1 --prepare "(for /l %%t IN (1,1,8) DO ping 127.0.0.1 )" <and the command list>
Summarizing:
* WavPack (fastest) verifies around 3x as quickly as Monkey's and OptimFROG. WavPack's block-level checksum is evidently fast.
* Still the slowest frog verifies 73 minutes CDDA in less than a second ...
* ... which in turn is 4x to 5x the speed of FLAC, at least if your flac files have MD5 as they reasonably should.
* TAK to FLAC ratio are what you would expect from decoding, because that is what they do. Same goes for that old WavPack format.
Also tested:
* On USB3-connected spinning drive: tested the fastest .wvunpack.exe -q -vv =N.wv.-hhx3.wv , at like 10 percent time penalty. Also a cursory test on Monkey's confirms that I/O doesn't do that much here.
* Multithreading the fastest wvunpack, that is .wvunpack.exe --threads -q -vv =N.wv.-hhx3.wv . Somewhat surprising, that incurred an additional nine percent-ish penalty on the USB3 spinning drive, but saved nine percent-ish on the SSD.
More discussion on error detection capabilities and robustness at https://hydrogenaud.io/index.php/topic,122094 . Note that the reference FLAC decoder has in the meantime been changed to mute corrupted blocks (so output has the right length) rather than to drop them.
@Porcus
Thanks for your always thorough tests! Interestingly your results differ from mine somewhat (e.g., slower WavPack) and I'm not sure exactly what's going on, but I'll post them here in a table for reference. My technique is not nearly as automated nor exhaustive as yours, but I did run the tests enough times to convince myself that I was getting reasonably accurate results. I tested on FFmpeg 7.0, WavPack 5.7.0, and one of the most recent FLAC builds on a double-album CDDA file (2h18m) encoded to WavPack and FLAC (w/ and w/o MD5) at modes suited for fast decoding.
Your system has 8 threads and mine 12, but I see the same relative speeds on my other Intel 8-thread machine and my 16-core AMD (but I don't test on those because neither are Windows).
One of the limits of WavPack multithreading in its current form is that it can't keep all physical threads continuously busy because it only runs worker threads
during the actual client call into libwavpack. So each call splits the work into the requested number of threads and then waits until the last one finishes before returning to the caller. This might be why adding additional threads beyond those physically available continues to improve performance in sort of a linear way.
Also, using just
--threads is the equivalent (for now) to
--threads=5. There is no determination based on available threads or anything like that, although that could obviously be added at some future date. That value (5) is the point where the trade-off between CPU work and speed starts to significantly deteriorate. In other words,
--threads=12 will almost always be
faster than the default (unless the CPU starts throttling down), but will use
significantly more total CPU time/power due to context switching.
Multithreaded Decoding Test- File details: duration is 2:18:54.44, 16-bit, 44.1-kHz, stereo
- System: Win 10, Intel i7-10710U, 6-core, 12-thread
- FLAC encoding: default, file size = 811883811 bytes (55.22%)
- WavPack encoding: -fx6, file size = 814957694 bytes (55.43%)
- FFmpeg command: ffmpeg [-threads 1] -i <file> -f wav -y NUL
- wvunpack command: wvunpack [--threads=N] <file> -z0qyo NUL
- FLAC command: flac -ss -d <file> -fo NUL
Format | Program | Options | Time | Comment |
flac | FFmpeg | | 2.10 sec | 3968 xRT (3.5 x single-threaded) |
WavPack | wvunpack | --threads=12 | 2.94 sec | 2835 xRT (5.4 x single-threaded) |
WavPack | wvunpack | --threads=8 | 3.62 sec | 2302 xRT (4.4 x single-threaded) |
WavPack | FFmpeg | | 4.14 sec | 2013 xRT (5.4 x single-threaded) |
WavPack | wvunpack | --threads | 4.91 sec | 1697 xRT (3.3 x single-threaded) |
flac-no-md5 | flac | | 6.24 sec | 1336 xRT |
flac | FFmpeg | -threads 1 | 7.28 sec | 1145 xRT |
flac-md5 | flac | | 8.68 sec | 960 xRT |
WavPack | wvunpack | | 16.01 sec | 521 xRT |
WavPack | FFmpeg | -threads 1 | 22.53 sec | 370 xRT |
Final notes:
- I added -z0 to the wvunpack commands to avoid updating the console window title (helps a little)
- I did not discover -hide_banner -loglevel error for FFmpeg until after these tests, so that disadvantages it a little
- It is interesting that FFmpeg manages to implement decent multi-threaded FLAC decoding despite the frame length not being present in the header. How does it do that?
Quick VerifyAs for the significantly faster performance of the WavPack quick-verify mode, your guess is probably right that it's because the checksum I use is very fast. It's far simpler than an MD5 or even a CRC, but it's not quite as simple (or as weak cryptographically) as a true checksum (there's an additional shift and add each byte). However, there is absolutely no support of multithreading with the quick verify, so those differences you show are suspect.
It is interesting that FFmpeg manages to implement decent multi-threaded FLAC decoding despite the frame length not being present in the header. How does it do that?
ffmpeg has strictly seperated decoding and demuxing. So for FLAC it looks for sync codes and does some short integrity checks as part of the demuxing. When decoding FLAC in ffmpeg, you'll see warnings every now and then because of that, when it stumbles upon something it thinks is a frame, but isn't. This has been the case for many years already, because of this strict separation.
Of course, with this mechanism in place, multithreading decoding is rather trivial.
Hm, definitely some confusion on me, as usual:
* I also found out that not only TAK, but also
wvunpack filename.wv -yo NUL writes to NUL.wav, and that seemingly takes more time than stdout redirected to NUL:
wvunpack filename.wv -yo - > NUL* Weird about that fast-verification --threads, the numbers looked consistent enough to conclude, and I didn't think it would tax the CPU
that much. Seven seconds in between a quarter of a second work?!
(Does Windows keep the executable in memory or something?)
Of course, with this mechanism in place, multithreading decoding is rather trivial.
So ... the obvious question is, any reason why not?
The odd event that "a valid frame header" shows up just by random in the data (the FLAC specification doesn't forbid junk between frames, as long as it is byte-aligned, and in any case parsing must take into account that a stream may be broken ...). Or even worse and more odd, an entire "valid frame" starting inside another?
Because of the following:
you'll see warnings every now and then
It makes decoding much more complicated, less predictable and less stable. For ffmpeg it was necessary to fit its model in which decoding and demuxing is completely separated.
In other words, MP4/Matroska/Ogg/CAF is actually better for ffmpeg than the original FLAC container format?
Among these, for only FLAC container fb2k cannot do real-time bitrate display.
In other words, MP4/Matroska/Ogg/CAF is actually better for ffmpeg than the original FLAC container format?
Yes, the inability to reliably skip ahead 1 frame without having to decode it is sometimes a disadvantage. For multithreading this is very valuable. However, relying solely on frame lengths is much less robust, and relying on both adds overhead of course. Maybe FLACs design was a bit too much focussed on reducing overhead.
EDIT 29. April: Big user errors, some numbers included encoding (see replies #23 and 28) and the highest speeds were ffmpeg rejecting file rather than decoding it. Porcus facepalms and thanks mod for help committing edit - and @ktf for reacting to the numbers.
I tested FLAC in containers. Not CAF, I forgot about that one. With and without multithreading ffmpeg. This time I tried a shorter file - half an hour - because there were so many to run through.
With quite extreme settings, including blocksize 16 - that malice paid off ...
Turns out ffmpeg refused to remux the uncompressed flac streams into any of the three containers I tried.
Container overhead
* flac -5 is a sane setting, and the biggest overhead for that one was 0.44 percent (not percentage points) for OGG container
* Blocksize 16 is just nuts, but for what the file sizes are worth - .wav in the middle. No padding:
323 001 659 ¨3x.flac-8b16.flac
328 733 400 ¨3x.flac-8b16.flac.oga
331 702 604 ¨3x.wav
343 738 725 ¨3x.flac-8b16.flac.mp4
354 113 911 ¨3x.flac-8b16.flac.mka
9.6 percent penalty for putting it in Matroska. I used ffmpeg,
comments commands like
ffmpeg -i ¨3x.flac-8b16.flac -acodec copy -vn -sn ¨3x.flac-8b16.flac.mka For sorting I moved the ".oga" etc. to a separate column. ¨3x.flac-5.flac <tab> .oga means the file is an OGG containered ¨3x.flac-5.flac.oga . (The reason for the "¨" is to make sure the test audio files had a character nothing else had.)
Threads | decoder | settings on encoding | container | speed x realtime | comment (in parentheses: edit April 29th thanks to mod) |
1 | flac.exe | ¨3x.flac-0b65535--no-md5--uncompressed.flac | | 500 | (number included encoding) |
1 | ffmpeg | ¨3x.flac-0b65535--no-md5--uncompressed.flac | | 8791 | (ffmpeg failed to decode this) |
7 | ffmpeg | ¨3x.flac-0b65535--no-md5--uncompressed.flac | | 8685 | (ffmpeg failed to decode this) |
| | | | | |
1 | flac.exe | ¨3x.flac-0b65535--no-md5.flac | | 527 | (number included encoding) |
1 | ffmpeg | ¨3x.flac-0b65535--no-md5.flac | | 1474 | about same for containers |
7 | ffmpeg | ¨3x.flac-0b65535--no-md5.flac | | 3544 | slower than containers |
7 | ffmpeg | ¨3x.flac-0b65535--no-md5.flac | .oga | 4919 | |
7 | ffmpeg | ¨3x.flac-0b65535--no-md5.flac | .mp4 | 6013 | mp4 very fast |
7 | ffmpeg | ¨3x.flac-0b65535--no-md5.flac | .mka | 5932 | |
| | | | | |
1 | flac.exe | ¨3x.flac-0r0--no-md5.flac | | 518 | (number included encoding) |
1 | ffmpeg | ¨3x.flac-0r0--no-md5.flac | | 1049 | about same for containers |
7 | ffmpeg | ¨3x.flac-0r0--no-md5.flac | | 1869 | containers are only slightly faster. |
7 | ffmpeg | ¨3x.flac-0r0--no-md5.flac | .oga | 1879 | |
7 | ffmpeg | ¨3x.flac-0r0--no-md5.flac | .mp4 | 1918 | |
7 | ffmpeg | ¨3x.flac-0r0--no-md5.flac | .mka | 1924 | Not that much faster |
| | | | | |
1 | flac.exe | ¨3x.flac-5.flac | | 518 | (number included encoding) |
1 | ffmpeg | ¨3x.flac-5.flac | | 966 | about same for containers |
7 | ffmpeg | ¨3x.flac-5.flac | | 2981 | |
7 | ffmpeg | ¨3x.flac-5.flac | .oga | 3600 | noticeably faster in all containers |
7 | ffmpeg | ¨3x.flac-5.flac | .mp4 | 3827 | |
7 | ffmpeg | ¨3x.flac-5.flac | .mka | 3854 | |
| | | | | |
1 | flac.exe | ¨3x.flac-8b16.flac | | 247 | (number included encoding but still took way less time than ffmpeg decoding) |
1 | ffmpeg | ¨3x.flac-8b16.flac | | 80 | about as slow for containers |
7 | ffmpeg | ¨3x.flac-8b16.flac | | 31 | Even slower! And about as slow for containers |
| | | | | |
1 | ffmpeg | ¨3x.flac-8pr8--lax-l32.flac | | 669 | about the same for containers. Forgot to run flac.exe on this one. |
7 | ffmpeg | ¨3x.flac-8pr8--lax-l32.flac | | 2493 | |
7 | ffmpeg | ¨3x.flac-8pr8--lax-l32.flac | .oga | 2599 | |
7 | ffmpeg | ¨3x.flac-8pr8--lax-l32.flac | .mp4 | 2631 | |
7 | ffmpeg | ¨3x.flac-8pr8--lax-l32.flac | .mka | 2642 |
I am not sure how ffmpeg -threads 1 works, if I should use "0" to get single-threaded? Because it does decode much quicker than reference flac. I also did ffmpeg decoded without -threads command, that uses all 8, and that would improve the flac-in-other-containers slightly (but harm wavpack slightly, I leave that for a separate post).
So table does not list speed for ffmpeg without -threads, nor for the following:
* the same entire thing ran on USB3-connected spinning drive. Differences were just very minor. These figures are on internal SSD.
* ogg/mp4/mkv decoded with ffmpeg -threads 1, those were pretty much the same as .flac speeds
* same for the -8b16 in containers, those were just as horrible as .flac
Yes blocksize 16 decodes slow, but ffmpeg just does it terribly.
(Edit April 29: codebox also with misleading number deleted)
MOD note: The above post was edited by request of the OP.
I already mentioned that current ffmpeg cli utility is useless for extremely small packets. FFmpeg developer that rewrote ffmpeg.c related code did not care and still does not care about this bug. So for small packets use lib calls directly instead of brain-dead ffmpeg.c implementation.
BLUNDER on me and on ffmpeg.
ffmpeg errs out on the ¨3x.flac-0b65535--no-md5--uncompressed.flac also when decoding. Of course I should have checked that when it refuses to demux.
It is not about it using the only-verbatim-subframes flac - likely it is about frames being too big.
The attached 1.3 second flac file - good old Merzbow at it again - has 57330 samples and is created with
-0r0 --no-padding -fb57330 --lax
So one frame, both subframes are FIXED, order 1.
ffmpeg cannot decode it. Recompress it with smaller block size, and it will - 57300 is still too large though.
Edit: reuploaded without artwork, that is not the blame - and seems the "r0" is superfluous.
Fiddling around with files I found out that padding-or-not could even influence the max block size. I got a file where 53207 with default padding is OK, 53208 with default padding is not, 53208 with --no-padding is OK.
@ktf, of course there is nothing wrong with the file? The blame is squarely on ffmpeg?
It makes decoding much more complicated, less predictable and less stable.
You might have had a point ...
FLAC format is brain-dead from stream-oriented usages. Its same like historic shorten format.
Suggesting that someone at ffmpeg did only consider the streamable subset ... ?
Good idea to test that, then. ffmpeg fails the file uploaded at https://hydrogenaud.io/index.php/topic,125848 [edit: botched attachment]
6ch, 96/24. Generated by:
sox -b 24 -c 6 -r 96000 -n whitenoise.wav synth 1 whitenoise
flac whitenoise.wav --channel-map=none -b16384
To the extent TTA is interesting at all, the lack of flexibility suggests that differences can be put down to code quality and not to "prioritized this type of encoding strategy" and the like. And:
ffmpeg -threads 1 DEcodes much faster than the reference, which takes 42 percent more time. Tested the half-hour long file, mean of repeated runs:
533x realtime: DEcoding by ffmpeg -threads 1
375x realtime: DEcoding by tta.exe
ffmpeg also ENcodes faster, but there the difference is small. 451x realtime vs 423x realtime.
Everything to NUL.
TTA container needs buffering all packets in memory when encoding.
Meaning, there is not much to do to optimize encoding - but decoding then, is that due to more efficient WAVE writing? Or am I just guessing wrong from what you write?
Ever heard of doing actual benchmark via perf or any other professional and advanced tool?
It will show where most CPU time is wasted in binary when executing some operations.
Heard of yes, done no - so what was the output?
Since you claimed for TTA that "There is no ways to do any optimizations here except bruteforce threading", then well, ... timing suggests otherwise, and to the extent that I don't think you need a more fine-tuned setup.
Nor do I think that would have tipped over as big differences as the following:
* WavPack verifies three times as fast as Monkey's - and 10x as fast as reference FLAC (because it doesn't do fast-verify) and 28.6x as fast as ffmpeg-generated WavPack 4 files (since those files don't offer the option)
* Different FLAC builds differ in encoding time by a factor of 2.5 on -5, and even more at -8: https://hydrogenaud.io/index.php/topic,123025.msg1029768.html#msg1029768 . Sure one could be interested in an explanation, but you don't need that level of detail to point out that there are big differences.
* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
* ffmpeg outright rejects FLAC files instead of decoding. Heck I even got it to reject subset FLAC. And when it has encoded .wv files it cannot decode itself, then seriously: When flaws are like that, who would whine over the timing tool?!
@ktf, of course there is nothing wrong with the file? The blame is squarely on ffmpeg?
I see nothing wrong with the file. Maybe the problem is that it consist of a single block?
* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
Of course I dove into that, because if FLAC can be twice as fast, that would be great! But I cannot reproduce, not on Linux nor Windows, not on SSD nor ramdisk. Can you check your results and see whether a different way of collecting times gives you the same result?
I'm obviously talking to walls. Bye bye!
* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
Of course I dove into that, because if FLAC can be twice as fast, that would be great! But I cannot reproduce, not on Linux nor Windows, not on SSD nor ramdisk. Can you check your results and see whether a different way of collecting times gives you the same result?
My results (see reply #7 above) show
FFmpeg -threads 1 decoding right in between FLAC
with md5 and FLAC
without md5.
As Porcus mentioned, FFmpeg doesn't seem to pay attention to whether the FLAC file has an md5, but the question is whether it
always calculates the sum (in which case it's faster than FLAC) or
never does (in which case it's slower). My guess would be that it doesn't.
Single-threaded FFmpeg being slower than native WavPack could be explained by its lacking the ASM optimizations, which really don't make sense there for such a niche format (from a maintenance point of view).
WavPack again. Only .wv in this post. wvunpack and ffmpeg
-z0q and time. Does console output slow it down?
TL;DR: -vv is so fast that one can argue that yes it does.
An initial test wasn't dramatic: hyperfine on single untagged -f-encoded file indicated a cost of 0.05 seconds on decoding to NUL and 0.02 seconds for fast verify.
But, then. Since there is only wvunpack against itself to test on this matter, and it supports wildcards, I took two actual albums, with tags and all, in separate tracks, on a spinning drive: one full CD 75 minutes of Bach's organ works, 652 kbit/s using, 28 tracks/files. Black Sabbath s/t, 38:14, a CD split in 7 tracks/files, 903 kbit/s. Both using -x6.
Fast verification: -qz0 is significantly faster, in percents. 0.1 seconds isn't much though, but if one has a full TB of CDs (like 3000 of them), it will add up to a few minutes. 25 runs with two seconds sleep in between:
wvunpack *.wv -vv -qz0: 304 ms on Bach and 187 ms on Black Sabbath
wvunpack *.wv -vv : increases by 103 ms (36 percent) resp. 81 ms (46 percent)
Decoding to -o - > NUL: Since that takes more time, the percentages aren't that impressive of course. The differences in time look suspicious to be honest, but 19 runs and not too outrageous standard deviations:
wvunpack *.wv -z0qy -o - > NUL : 9.265 s ± 0.024 s resp. 4.917 s ± 0.025 s
wvunpack *.wv -yo - > NUL : 9.698 s ± 0.032 s resp. 5.039 s ± 0.013 s
Writing to the same spinning drive would bring it to 30 or 15 seconds, and that kinda sets the perspective.
But the fast decoding is now so fast that it seems that reporting the progress makes for something if one is scanning
Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.
I did the "shorter" half-hour file. Commands were like the following:
ffmpeg -threads 8 -i ¨3x.wv.-fx0.wv -hide_banner -loglevel error -f wav -y NUL
.\wvunpack --threads=8 ¨3x.wv.-fx0.wv -z0qyo -o - > NUL
the latter being the fastest. -fx0 in the filename because that option (it means -f ; -x0 in new releases means "no x", that is good for FOR looping). And yes there are two "o" options because I didn't spot it, but wvunpack didn't object.
Times first, mean ± stdev - having sorted the output by the .wv file and then by speed, fastest at top:
-f:
1.00 .\wvunpack --threads=8
1.09 ± 0.00 .\wvunpack --threads=7
1.13 ± 0.01 ffmpeg -threads 8
1.20 ± 0.01 ffmpeg -threads 7
1.54 ± 0.02 .\wvunpack --threads=4
1.90 ± 0.01 .\wvunpack --threads=3
1.95 ± 0.05 ffmpeg -threads 4
2.35 ± 0.05 ffmpeg -threads 3
3.65 ± 0.01 .\wvunpack --threads=1
4.94 ± 0.06 ffmpeg -threads 1
-x:
1.23 ± 0.04 .\wvunpack --threads=8
1.31 ± 0.01 .\wvunpack --threads=7
1.46 ± 0.01 ffmpeg -threads 8
1.55 ± 0.01 ffmpeg -threads 7
1.86 ± 0.03 .\wvunpack --threads=4
2.30 ± 0.02 .\wvunpack --threads=3
2.60 ± 0.01 ffmpeg -threads 4
2.99 ± 0.04 ffmpeg -threads 3
4.53 ± 0.02 .\wvunpack --threads=1
6.35 ± 0.02 ffmpeg -threads 1
Ran, but omitted from this list: -x --blocksize=4096 to see if it mattered. Not much except ffmpeg -threads 1 was up to 6.93.
-hx2:
1.70 ± 0.01 .\wvunpack --threads=8
1.80 ± 0.01 .\wvunpack --threads=7
2.00 ± 0.01 ffmpeg -threads 8
2.19 ± 0.18 ffmpeg -threads 7
2.48 ± 0.02 .\wvunpack --threads=4
3.08 ± 0.02 .\wvunpack --threads=3
3.43 ± 0.03 ffmpeg -threads 4
3.91 ± 0.03 ffmpeg -threads 3
5.95 ± 0.03 .\wvunpack --threads=1
8.79 ± 0.06 ffmpeg -threads 1
-hhx3:
2.21 ± 0.08 .\wvunpack --threads=8
2.27 ± 0.02 .\wvunpack --threads=7
2.66 ± 0.03 ffmpeg -threads 8
2.82 ± 0.01 ffmpeg -threads 7
3.17 ± 0.03 .\wvunpack --threads=4
3.92 ± 0.02 .\wvunpack --threads=3
4.52 ± 0.02 ffmpeg -threads 4
5.33 ± 0.11 ffmpeg -threads 3
7.67 ± 0.02 .\wvunpack --threads=1
11.77 ± 0.05 ffmpeg -threads 1
And finally, ffmpeg wins this one with maximum blocksize (except single-threaded)
-hhx4 --blocksize=131072
2.57 ± 0.01 ffmpeg -threads 8
2.75 ± 0.02 ffmpeg -threads 7
3.12 ± 0.06 .\wvunpack --threads=8
3.40 ± 0.03 .\wvunpack --threads=7
3.94 ± 0.05 ffmpeg -threads 4
4.84 ± 0.05 ffmpeg -threads 3
4.95 ± 0.03 .\wvunpack --threads=4
5.54 ± 0.02 .\wvunpack --threads=3
7.57 ± 0.04 .\wvunpack --threads=1
11.27 ± 0.09 ffmpeg -threads 1
A couple of remarks:
* I ran ffmpeg and wvunpack alternating with the same threads count, so that none of them should be disadvantaged over the CPU having just reacted to something (throttling or whatever). To make sure, I ran it twice: once with ffmpeg then wvunpack, once with wvunpack then ffmpeg. Very small differences.
* What I didn't think of ... because I am normally juggling a few versions of encoders, I the current wvunpack.exe to the same directory - it shouldn't make for a time advantage?! Maybe fix PATH next time.
WavPack -vv fast verification redux:
TL;DR: Put the --threads differences down on my system not being more consistent than the data below ... arguably,
less consistent.
In case anyone is interested, I dump the hyperfine output in here.
Same files. Ran with and without threads (and a couple extra to check for inconsistencies, not included; one was off, but betrayed by a high stdev).
USB3 spinning drive. Fastest was 105.1 ms ± 2.8 ms
.wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
1.01 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
1.02 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
1.08 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
1.08 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
1.09 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
1.10 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
1.10 ± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
1.10 ± 0.06 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
1.21 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
1.21 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
1.22 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
1.23 ± 0.03 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
1.24 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
1.26 ± 0.08 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
1.58 ± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
1.58 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
1.59 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
Internal SSD. Fastest was 107.2 ms ± 2.2 ms.
.wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
1.00 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
1.01 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
1.09 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
1.09 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
1.10 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
1.10 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
1.10 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
1.11 ± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
1.21 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
1.22 ± 0.03 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
1.23 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
1.24 ± 0.06 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
1.25 ± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
1.26 ± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
1.55 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
1.55 ± 0.04 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
1.57 ± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
More:
* As for running quiet: hyperfine suppresses console output, and a bit of non-rigorous testing indicates that running quiet doesn't matter for ffmpeg when tested by hyperfine (it could matter in real-life ...).
But wvunpack runs the percent progress
in the window's title bar, and hyperfine does not suppress that!
flac then.
@ktf :
* I will re-run when I am back at the same computer. The chance for human error is certainly positive - like, copy + paste suddenly overwrote something in a spreadsheet. I haven't automated this.
* The files that ffmpeg reject are not about only being one frame - the 6ch (subset!) file I posted at https://hydrogenaud.io/index.php/topic,125848 is a full second. What happened to the attachment posted in
this thread was that I traversed the possible -b's and found where the trouble started - then I cut the source down to the precise number of samples to see if "single frame" would still trigger it. Of course, I carelessly posted the last file created.
* Question: it seems that reference flac omits MD5 calculation from -t and -d if the file has no MD5 in STREAMINFO? But there is no way to force it not to otherwise?
FLAC: BLUNT BLUNDER on me :-[
@korth as mod: I don't want to "shield myself against my own mistakes" here, but if you think it is OK - it being on a previous page with misleading numbers right in user's face - would you maybe please moderate in an extra first line in Reply #13 (https://hydrogenaud.io/index.php/topic,125791.msg1043525.html#msg1043525) like e.g. the follows:
Mod note: Porcus facepalms and suggests to read Reply #28 (https://hydrogenaud.io/index.php/topic,125791.msg1043741.html#msg1043741) for correction
Anyway, after having tried ffmpeg -threads this and that and around, I found the mistake not in the ffmpeg command, but in the flac command. It is even in the codebox, where flac was run with options -fo NUL ... without "-d". So it spendt time re-encoding to FLAC. Thank you to
@ktf for spotting the anomaly.
Here are some hopefully more sane numbers, where reference FLAC (1.4.2 was used ... for dumb reasons) beating ffmpeg -threads 1.
Decoding times on SSD to NUL, the 1.060 seconds means 1774x real-time
Encoded with -0b65535 --no-md5 --lax
1.060 s ± 0.009 s flac (1.4.2)
1.278 s ± 0.008 s ffmpeg -threads 1
0.842 s ± 0.007 s ffmpeg -threads 2
0.591 s ± 0.030 s ffmpeg -threads 3
0.522 s ± 0.010 s ffmpeg -threads 4
0.506 s ± 0.012 s ffmpeg -threads 6
0.538 s ± 0.007 s ffmpeg, default threads (detects all eight)
Encoded with -0r0 --no-md5, reference FLAC single-threaded beats ffmpeg -threads 3
1.144 s ± 0.015 s flac (1.4.2)
1.799 s ± 0.005 s ffmpeg -threads 1
1.642 s ± 0.005 s ffmpeg -threads 2
1.163 s ± 0.007 s ffmpeg -threads 3
0.998 s ± 0.014 s ffmpeg -threads 4
0.981 s ± 0.015 s ffmpeg -threads 6
1.019 s ± 0.012 s ffmpeg, default
Encoded with default -5
1.552 s ± 0.011 s flac (1.4.2)
1.967 s ± 0.006 s ffmpeg -threads 1
1.290 s ± 0.012 s ffmpeg -threads 2
0.899 s ± 0.008 s ffmpeg -threads 3
0.725 s ± 0.003 s ffmpeg -threads 4
0.654 s ± 0.052 s ffmpeg -threads 6
0.619 s ± 0.006 s ffmpeg, default
Encoded with -8pl32 -r8 --lax
2.056 s ± 0.019 s flac (1.4.2)
2.818 s ± 0.006 s ffmpeg -threads 1
1.746 s ± 0.029 s ffmpeg -threads 2
1.227 s ± 0.021 s ffmpeg -threads 3
1.038 s ± 0.003 s ffmpeg -threads 4
0.784 s ± 0.004 s ffmpeg -threads 6
0.739 s ± 0.013 s ffmpeg, default
Encoded with -8b16
5.476 s ± 0.034 s flac (1.4.2)
24.328 s ± 0.186 s ffmpeg -threads 1 <------- ooh bad
61.104 s ± 0.582 s ffmpeg -threads 2 <------- and the "even worse" starts already here!
61.290 s ± 0.501 s ffmpeg -threads 3
60.371 s ± 0.438 s ffmpeg -threads 4
60.236 s ± 0.709 s ffmpeg -threads 6
61.878 s ± 0.492 s ffmpeg, default
There is not much gained above 4 threads (this is a 4-core 8-thread CPU) except the 8pl32-etc. file.
Commands given:
flac <file> -ss -dfo NUL
ffmpeg -threads <T> -i <file> -hide_banner -loglevel error -f wav -y NUL
In a way, this is a pity of course, it would have been great if the reference FLAC decoder could learn something from a different implementation and get (much) faster.
FLAC is pity and misery and seditious.
I already told you multiple times, but you ignore it and force your brain-dead ideas.
Latest FFmpeg cli tool will be extremely slow with small packets (small number of samples encoded per frame/packet) if you use example tool from ffmpeg repo or code your own app with no brain-dead ideas like current ffmpeg mt implementation it will be 10000% faster than pity flac tool.
Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.
So a couple things about this. First, I guess you’re using --threads because you have 8-thread hardware. In WavPack’s case the performance continues to improve even when the requested number of threads exceeds the physical threads. I just did an experiment on my 8-thread machine and got a 10% speed improvement (but 10% more total processor time) going from --threads=7 to --threads=12. Of course with thermal throttling and other factors, actual mileage may vary.
FFmpeg does not behave this way, and seems to detect the number of physical threads and ignores specification beyond that.
Interesting about the reduced performance with extra long frames, but it has a simple explanation. To achieve the temporal multithreading, sufficiently large buffers must be provided to libwavpack, and the command-line programs calculate these based on the requested number of threads
and the normal frame lengths. This is done because unfortunately there’s no API provided to determine the actual frame length (this is abstracted away from the library client, and can change from frame to frame), so the best we can do is guess. I would strongly recommend that extra long frames are
not used to achieve better compression! :)
@bryant on WavPack:
Whether wvunpack --threads=<N> vs ffmpeg -threads=<N> is apples to apples ... I don't know. It might be "better" than comparing wvunpack --threads vs ffmpeg (spawning all), and all I could do was to get it down to thread count matters enough to tilt the numbers.
@ktf on FLAC:
Sure if we could just pick up some magic and speed up stuff. Anyway I made a mess out of it, first time I was not sure whether ffmpeg -threads 1 meant "single thread" or "one worker in addition to the bookkeping/parsing", so is part of my lame excuses.
But:
I got something that you could maybe speed up for: it seems 1.3.x does small blocks faster than 1.4.x. Since I got this wrong AND didn't stay completely consistent on which flac executable I used, I ran it together with a test to see when ffmpeg stops making a fool of itself (it was at block sizes below reference's -0/-1/-2, that is good):
(https://i.imgur.com/zaCPpiy.png)
What I did: explained over in the FLAC test thread: https://hydrogenaud.io/index.php/topic,123025.msg1044103.html#msg1044103 With more diagrams, including 1.2.1. And encoding.
Anyway, ffmpeg starts behaving "more normal" at "normal" block sizes.
For reference flac it might look surprising that -5 and -0 make so little difference. But most of the graph is for (too!) small block sizes, and apparently the time penalty for processing those, override the calculation job.
@mycroft :
"seditious" sounds like you just learned a new bad word and is waiting to use it ... couldn't you be constructive and educate instead?
For better or for worse, FLAC's design is frozen long ago, and for now it stays the biggest player in the lossless audio files market, at least disregarding silver discs.
But good that you actually contribute repairing bad code, and not
only whine toxicity.
@bryant on WavPack:
Whether wvunpack --threads=<N> vs ffmpeg -threads=<N> is apples to apples ... I don't know. It might be "better" than comparing wvunpack --threads vs ffmpeg (spawning all), and all I could do was to get it down to thread count matters enough to tilt the numbers.
I believe, based on my experiments, that FFmpeg -threads <n> specifies the
total number of threads to use, which is identical to WavPack’s interpretation.
When specifying nothing (option alone) WavPack simply defaults to 5 threads (a good compromise on all machines, if not necessarily the fastest). FFmpeg either determines the number of physical threads and uses that (my guess), or uses unlimited threads (like “make”). The option is there to
limit the use of threading only.
Another difference is specifying zero; FFmpeg ignores this (I guess it's the default) whereas WavPack disallows it.