HydrogenAudio

Lossless Audio Compression => Lossless / Other Codecs => Topic started by: Porcus on 2024-04-13 09:47:08

Title: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-13 09:47:08
Why test wall time for multithreading? After all, multithreading doesn't use less CPU, only waiting time if the CPU isn't already running busy.  Conversion software that spawns one thread per file would be expected to be more efficient, so why not let it ... ?
But, there are situations where you want to save to compressed, say in DAW plug-ins; then I'd expect even two seconds' wait to be noticeable.  And if you are opening a project, then the same goes for decoding?

Anyway, swine got curious:
 * ffmpeg 7.0 now encodes faster: https://hydrogenaud.io/index.php/topic,125694.msg1042555.html#msg1042555 .  Its fastest encoding is now on par with single-threaded fastest official WavPack (which compresses better!), while 6.1.1 spent 60 percent more time compressing worse.  Seems to be due to one thread processing source and ... ?  (I have no idea of what overhead that creates.)
 * ffmpeg also decodes with some degree of multithreading, unknown to me until pointed out in that same thread. Aha ... so how fast?  The reference FLAC multithreading in git is only for encoding, not for decoding.
 * For multiple files, I just noticed that since decoding is fast, the penalty from a FOR loop is more than noticeable.  Hence decoders that can do wildcards are at advantage.  Of course that advantage is very much real - heck, the time taken to type a FOR loop is also significant here! - but for measuring the decoder ... what then?

For a more apples to apples comparison, this is one (untagged) file, 73 minutes of CDDA on (internal) SSD. Corpus isn't super-important (I hope!):
I took the first ten minutes and a half of each of 7 CDs that are neither classical music nor metal - because the variety of signals, some old near-mono, some this and some that. Compressed better than your average I guess, numbers are given at the end.
So think of it as one full compilation CD as image (no cuesheet, no tags!)

Test was done with the hyperfine benchmarking tool that I recently started using: I run the whole thing 11 times + warmup for the larger part, all with pause in between to keep the CPU reasonably stable, and then because some figures looked suspicious and I would anyway re-run a few (no big changes!), I included some like shn and mlp in that run and pasted them in order. (Since the fastest ran in 1.199 and then 1.200, that makes no difference.)  CPU: i5-1135-G7, 4 cores 8 threads.
I wish hyperfine could be set to report this nice summary with median instead of mean, for robustness to the whims of the OS - but laziness gets the better off me when the output looks as nice as this.  Reformatted slightly, and commented.

  ffmpeg -i =N.flac.-5.flac -f wav -y NUL ran
1.03 ± 0.01times faster thanffmpeg -i =N.flac.-0r0b4096--no-md5.flac -f wav -y NULThis dual mono FLAC with no MD5 was encoded to decode fast.  Seems ffmpeg ignores MD5.
1.16 ± 0.01times faster thanffmpeg -i =N.flac.-8l32.flac -f wav -y NUL-8l32 --lax, to be more precise.  I cannot force flac.exe to use a very high order, but this was intended to decode slower and it did
1.32 ± 0.02times faster thanffmpeg -i =N.tak.-p0.tak -f wav -y NULTAK, fastest one
1.43 ± 0.02times faster thanffmpeg -i =N.tak.-p4m.tak -f wav -y NULWhy this is faster than -p2 ... could be number of frames
1.59 ± 0.02times faster thanffmpeg -i =N.tta -f wav -y NULTTA is a surprise. Look at how much faster than the reference ...
1.73 ± 0.02times faster thanffmpeg -i =N.wv.ffmpeg-0.wv -f wav -y NULWavPack.  ffmpeg decodes WavPack faster than multithreaded wvunpack.exe does
1.83 ± 0.02times faster thanffmpeg -i =N.wv.-f.wv -f wav -y NUL
1.85 ± 0.03times faster thanffmpeg -i =N.flac.-2e.flac -f wav -y NULFLAC with smaller block size, that yields time penalty
1.93 ± 0.03times faster than.\wvunpack.exe -qy --threads=8 =N.wv.-f.wv  -o NULwvunpack with --threads=8.  Only one faster than flac -d.
2.08 ± 0.02times faster than.\flac.exe -ss -d =N.flac.-0r0b4096--no-md5.flac -fo NULFLAC with official decoder.  Here the absence of MD5 matters.  ffmpeg does it twice as fast.
2.08 ± 0.09times faster thanffmpeg -i =N.alac.refalac.m4a -f wav -y NULALAC compressed with refalac
2.20 ± 0.02times faster thanffmpeg -i =N.alac.ffmpeg.m4a -f wav -y NULALAC compressed with ffmpeg
2.45 ± 0.03times faster thanffmpeg -i =N.wv.-x.wv -f wav -y NULWavPack default mode
2.51 ± 0.03times faster thanffmpeg -i =N.alac.cuetools8.m4a -f wav -y NULALAC compressed with CUETools, the slower preset "8".
2.69 ± 0.03times faster than.\wvunpack.exe -qy --threads =N.wv.ffmpeg-0.wv -o NULWavPack by official wvunpack --threads (selecting the thread count itself).  Note, no -m used.
2.69 ± 0.03times faster than.\wvunpack.exe -qy --threads =N.wv.-f.wv -o NUL(-q for "quiet")
2.87 ± 0.03times faster than.\flac.exe -ss -d =N.flac.-5.flac -fo NUL(-ss for "silent")
3.04 ± 0.03times faster than.\flac.exe -ss -d =N.flac.-2e.flac -fo NULBecause block size 1152?
3.30 ± 0.04times faster than.\wvunpack.exe -qy --threads =N.wv.-x.wv -o NULWavPack default mode (-x does not slow down decoding),
3.47 ± 0.17times faster thanffmpeg -i =N.wv.-hx2.wv -f wav -y NULffmpeg on a high mode .wv nearly catches official on a default mode
3.52 ± 0.04times faster than.\flac.exe -ss -d =N.flac.-8l32.flac -fo NULheaviest flac
4.18 ± 0.14times faster than.\wvunpack.exe -qy --threads =N.wv.-hx2.wv -o NULwvunpack takes 27 percent more time than ffmpeg
4.59 ± 0.07times faster thanffmpeg -i =N.wv.-hhx3.wv -f wav -y NUL
5.19 ± 0.17times faster than.\takc.exe -d -overwrite -tn4 =N.tak.-p0.tak NULTAK. "-tn4" would turn on multithreaded encoding, but like FLAC it doesn't multithread decoding.  4x the time of ffmpeg!
5.24 ± 0.13times faster than.\wvunpack.exe -qy --threads =N.wv.-hhx3.wv -o NUL
5.68 ± 0.14times faster than.\takc.exe -d -overwrite -tn4 =N.tak.-p2.tak NUL
5.76 ± 0.12times faster thanffmpeg -i =N.shn -f wav -y NULShorten, for completeness. ffmpeg does that faster too.
5.95 ± 0.15times faster than.\takc.exe -d -overwrite -tn4 =N.tak.-p4m.tak NUL
6.48 ± 0.16times faster than.\wvunpack.exe -qy =N.wv.ffmpeg-0.wv -o NULwvunpack, single-threaded
6.67 ± 0.08times faster than.\wvunpack.exe -qy =N.wv.-f.wv -o NUL
6.95 ± 0.15times faster than.\shorten.exe -x =N.shn NUL
7.39 ± 0.08times faster than.\refalac -D =N.alac.refalac.m4a -o NULrefalac spends 3.6x the time of ffmpeg
8.28 ± 0.16times faster than.\wvunpack.exe -qy =N.wv.-x.wv -o NUL
9.18 ± 0.10times faster than.\tta.exe -d =N.tta NULTTA official spends 5.8x the time of ffmpeg.  Either one is good or the other is bad ... or could it be the large block size?
9.75 ± 0.10times faster thanffmpeg -i =N.ape.-c1000.ape -f wav -y NULMonkey's is also faster with ffmpeg, but not that much
10.55 ± 0.12times faster than.\refalac -D =N.alac.ffmpeg.m4a -o NUL
10.75 ± 0.24times faster than.\wvunpack.exe -qy =N.wv.-hx2.wv -o NUL
11.09 ± 0.12times faster than.\refalac -D =N.alac.cuetools8.m4a -o NULrefalac spends 5.3x the time of ffmpeg on this heavier file.
13.17 ± 0.30times faster than.\MAC.exe =N.ape.-c1000.ape NUL -d
13.68 ± 0.16times faster than.\wvunpack.exe -qy =N.wv.-hhx3.wv -o NULhh-eaviest WavPack file.  2.6x the multithreaded time.  3x ffmpeg time.
14.72 ± 0.16times faster thanffmpeg -i =N.ape.-c3000.ape -f wav -y NUL
20.32 ± 0.41times faster than.\MAC.exe =N.ape.-c3000.ape NUL -d
23.38 ± 0.51times faster thanffmpeg -i =N.mlp.mka -f wav -y NULMLP. I was curious.
45.37 ± 0.63times faster than.\MAC.exe =N.ape.-c5000.ape NUL -d
54.04 ± 0.57times faster thanffmpeg -i =N.ape.-c5000.ape -f wav -y NULThe only one where ffmpeg ran slower than the official.  Yes I re-ran them.
It seems ffmpeg does that thing pretty universally, but not too well on Monkey's.
At this speed I might wonder whether there are significant differences due to whether/how the decoders ensure that the file is properly closed - even if it is null output. Speculations, but wavpack the encoder does close and reopen upon verification ...?

Also tested:

MPEG4-ALS. ffmpeg crashed consistently on this file, of course that made for the "fastest" run and all the other figures wrong. Instead of correcting them: discard and another overnight run.

Extra time to write out.wav on same SSD compared to NUL
* zero-ish: wvunpack
* 0.07 to 0.11: ffmpeg (unreliably measured on .ape)
* 0.36 to 0.55: flac.exe (Xiph and Wombat, -2e was worst) and tta.exe
* 0.7 for refalac

takc.exe writes NUL.wav which takes 1.0 (1.2 seconds) more than just test decode - how much of that is for actual file and how much is for null output, I don't know. But it leads to this:
All official decoders can do test decode - verify by decoding. Extra time for them to do -o NUL compared to verify by decoding:
* 0.06 to 0.10 for flac.exe
* 0.5 ± a little, for wvunpack --threads, and 0.85 ± a little for single-threaded wvunpack.  Is this the penalty for checking that the file is properly closed, I think WavPack goes to greater lenghts to do that?
* (unreliably measured on ape ... at those speeds it doesn't matter much.  If you don't want to wait, use the official GUI that can spawn a thread per file.)

And more:
* wvunpack --threads=<1 through 8>. One number posted in the table.
* Did .wv files encoded with --threads take more or less time to decode?  No, all within the variations. ± 0.07
* How did Wombat's most recent flac build (https://hydrogenaud.io/index.php/topic,123176.msg1041251.html#msg1041251) do? .42 to .49 slower. 
Timing for ffmpeg -i =N.wav -f wav -y NUL: around 0.4 seconds.  This is the only "seconds" here.

Except the latter 0.4 seconds: numbers are differences in the "times faster than", so add twenty percent to get it in seconds.


Finally, file sizes. WAV is 773972684, and the following are compression ratios - the content of old jazz/soul makes for smaller files:
45.8%   =N.ofr.--presetmax.ofr
46.3%   =N.ofr.--preset7.ofr
47.0%   =N.ofr.--preset2.ofr
47.4%   =N.tak.-p4m.tak
47.9%   =N.ofr.--preset0.ofr
48.1%   =N.tak.-p2.tak
49.1%   =N.wv--threads.-hhx3.wv
49.1%   =N.wv.-hhx3.wv
49.3%   =N.wv.-hx2.wv
49.4%   =N.wv--threads.-hx2.wv
49.5%   =N.flac.-8l32.flac
49.6%   =N.als
49.6%   =N.als.m4a
49.8%   =N.tta
50.0%   =N.tak.-p0.tak
50.1%   =N.flac.-5.flac
50.1%   =N.wv.-x.wv
50.1%   =N.wv--threads.-x.wv
50.4%   =N.alac.cuetools8.m4a
51.1%   =N.alac.refalac.m4a
51.6%   =N.alac.ffmpeg.m4a
51.9%   =N.wv--threads.-f.wv
51.9%   =N.wv.-f.wv
53.0%   =N.flac.-2e.flac
58.4%   =N.wv.ffmpeg-0.wv
59.9%   =N.shn
60.7%   =N.flac.-0r0b4096--no-md5.flac
70.9%   =N.mlp.mka


Soon to be posted: fast-verification times.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-13 09:56:43
TTA have fixed number of encoded samples for each packet, except last packet in file. There is no ways to do any optimizations here except bruteforce threading.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-13 10:30:46
FFmpeg still does so much better than tta.exe that it adds to the suspicion that the reference implementation isn't very good.
I don't speak code, but the following also indicate that reference tta.exe isn't particularly stellar:
* FFmpeg-tta does things that tta.exe cannot - like detect errors.
* Official foobar2000 component errs out on certain files (I think it is 8-bits, fixed in case's component).
* I have not tested this rewrite, but it claims speedups: https://hydrogenaud.io/index.php/topic,125048.0.html
* tta.exe is picky about WAVE version, and thinks that WAVE sample count is signed integer.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: guruboolez on 2024-04-13 10:47:18
Impressive improvements!
Are FFMPEG lossy encoders also multithreaded? It could be also very interesting for video tools (handbrake…).

Thanks for this table Porcus, it's very interesting :)
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-13 10:50:11
Note, I have tested DEcoding here, not ENcoding. What has happened in ffmpeg 7.0 on the encoding ... quoting from https://ffmpeg.org/#cli_threading

Thanks to a major refactoring of the ffmpeg command-line tool, all the major components of the transcoding pipeline (demuxers, decoders, filters, encodes, muxers) now run in parallel. This should improve throughput and CPU utilization, decrease latency, and open the way to other exciting new features.
Note that you should not expect significant performance improvements in cases where almost all computational time is spent in a single component (typically video encoding).
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: guruboolez on 2024-04-13 12:10:45
Note, I have tested DEcoding here, not ENcoding.

Ah yes, it's mentioned in the title  :-*
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-13 12:21:48
Fast-verification times.

WavPack (from format 5, decoder 5.40), Monkey's (CLI from ... year twentytwenty-something) and OptimFROG can verify a file without carrying out the decoding - especially good on the latter two, that incur some CPU load doing decoding. Of course, no decoding does not verify that the audio is what it is supposed to be, but block-level checksums should protect against bit-flips and general corruption.

Other formats, like FLAC, do have block-level checksums and could do the same, but with no application supporting it.
Whether it would offer much value-added for FLAC, which decodes fast and whose users are so accustomed to having audio MD5 being included that the file vendor who supplies FLAC downloads without MD5 gets the evil eye - up to opinion, but at least here is a take on the differences in speed.

Same single file as above. Take note that the fastest of these, WavPack in high mode (fewer blocks?) ran in
0.239 seconds <--- 18358x realtime!
  .wvunpack.exe -q -vv =N.wv.-hhx3.wv ran
    1.00 ± 0.02 times faster than .wvunpack.exe -q -vv =N.wv.-hx2.wv
    1.22 ± 0.02 times faster than .wvunpack.exe -q -vv =N.wv.-x.wv
    1.23 ± 0.04 times faster than .wvunpack.exe -q -vv =N.wv.-f.wv
    3.02 ± 0.05 times faster than .MAC.exe =N.ape.-c5000.ape -v
    3.06 ± 0.05 times faster than .MAC.exe =N.ape.-c3000.ape -v
    3.13 ± 0.05 times faster than .MAC.exe =N.ape.-c1000.ape -v
    3.47 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--presetmax.ofr
    3.54 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--preset7.ofr
    3.59 ± 0.07 times faster than .ofr.exe --verify =N.ofr.--preset2.ofr
    3.66 ± 0.06 times faster than .ofr.exe --verify =N.ofr.--preset0.ofr
   10.20 ± 0.17 times faster than .flac.exe -ss -t =N.flac.-0r0b4096--no-md5.flac
   12.37 ± 0.23 times faster than .flac-wombat.exe -ss -t =N.flac.-0r0b4096--no-md5.flac
   14.36 ± 0.24 times faster than .flac.exe -ss -t =N.flac.-5.flac
   15.09 ± 0.29 times faster than .flac.exe -ss -t =N.flac.-2e.flac
   16.35 ± 0.28 times faster than .flac-wombat.exe -ss -t =N.flac.-5.flac
   17.00 ± 0.27 times faster than .flac-wombat.exe -ss -t =N.flac.-2e.flac
   17.62 ± 0.28 times faster than .flac.exe -ss -t =N.flac.-8l32.flac
   19.69 ± 0.33 times faster than .flac-wombat.exe -ss -t =N.flac.-8l32.flac
   20.55 ± 0.36 times faster than .takc.exe -t =N.tak.-p0.tak
   23.21 ± 0.42 times faster than .takc.exe -t =N.tak.-p2.tak
   24.26 ± 0.42 times faster than .takc.exe -t =N.tak.-p4m.tak
   28.61 ± 0.48 times faster than .wvunpack.exe -q -vv =N.wv.ffmpeg-0.wv
No "fast" verification in the latter, which is a WavPack version 4 file - that is what ffmpeg creates. Included as a "(s)low anchor".
-q for quiet, -ss for silent, I am not sure if it matters since hyperfine does not display a console, but ... habits, habits. "flac-wombat.exe": renamed the exe of the latest build (link in original post).
hyperfine command in the bat file, the pings take a second each and are for pause in between:
hyperfine.exe -i --style full -r 11 -w 1 --prepare "(for /l %%t IN (1,1,8) DO ping 127.0.0.1 )" <and the command list>

Summarizing:
* WavPack (fastest) verifies around 3x as quickly as Monkey's and OptimFROG. WavPack's block-level checksum is evidently fast.
* Still the slowest frog verifies 73 minutes CDDA in less than a second ...
* ... which in turn is 4x to 5x the speed of FLAC, at least if your flac files have MD5 as they reasonably should.
* TAK to FLAC ratio are what you would expect from decoding, because that is what they do. Same goes for that old WavPack format.

Also tested:
* On USB3-connected spinning drive: tested the fastest   .wvunpack.exe -q -vv =N.wv.-hhx3.wv , at like 10 percent time penalty. Also a cursory test on Monkey's confirms that I/O doesn't do that much here.
* Multithreading the fastest wvunpack, that is .wvunpack.exe --threads  -q -vv =N.wv.-hhx3.wv .  Somewhat surprising, that incurred an additional nine percent-ish penalty on the USB3 spinning drive, but saved nine percent-ish on the SSD.


More discussion on error detection capabilities and robustness at https://hydrogenaud.io/index.php/topic,122094 . Note that the reference FLAC decoder has in the meantime been changed to mute corrupted blocks (so output has the right length) rather than to drop them.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: bryant on 2024-04-21 05:11:48
@Porcus
Thanks for your always thorough tests! Interestingly your results differ from mine somewhat (e.g., slower WavPack) and I'm not sure exactly what's going on, but I'll post them here in a table for reference. My technique is not nearly as automated nor exhaustive as yours, but I did run the tests enough times to convince myself that I was getting reasonably accurate results. I tested on FFmpeg 7.0, WavPack 5.7.0, and one of the most recent FLAC builds on a double-album CDDA file (2h18m) encoded to WavPack and FLAC (w/ and w/o MD5) at modes suited for fast decoding.

Your system has 8 threads and mine 12, but I see the same relative speeds on my other Intel 8-thread machine and my 16-core AMD (but I don't test on those because neither are Windows).

One of the limits of WavPack multithreading in its current form is that it can't keep all physical threads continuously busy because it only runs worker threads during the actual client call into libwavpack. So each call splits the work into the requested number of threads and then waits until the last one finishes before returning to the caller. This might be why adding additional threads beyond those physically available continues to improve performance in sort of a linear way.

Also, using just --threads is the equivalent (for now) to --threads=5. There is no determination based on available threads or anything like that, although that could obviously be added at some future date. That value (5) is the point where the trade-off between CPU work and speed starts to significantly deteriorate. In other words, --threads=12 will almost always be faster than the default (unless the CPU starts throttling down), but will use significantly more total CPU time/power due to context switching.

Multithreaded Decoding Test


FormatProgramOptionsTimeComment
flacFFmpeg2.10 sec3968 xRT (3.5 x single-threaded)
WavPackwvunpack--threads=122.94 sec2835 xRT (5.4 x single-threaded)
WavPackwvunpack--threads=83.62 sec2302 xRT (4.4 x single-threaded)
WavPackFFmpeg4.14 sec2013 xRT (5.4 x single-threaded)
WavPackwvunpack--threads4.91 sec1697 xRT (3.3 x single-threaded)
flac-no-md5flac6.24 sec1336 xRT
flacFFmpeg-threads 17.28 sec1145 xRT
flac-md5flac8.68 sec960 xRT
WavPackwvunpack16.01 sec521 xRT
WavPackFFmpeg-threads 122.53 sec370 xRT


Final notes:

Quick Verify

As for the significantly faster performance of the WavPack quick-verify mode, your guess is probably right that it's because the checksum I use is very fast. It's far simpler than an MD5 or even a CRC, but it's not quite as simple (or as weak cryptographically) as a true checksum (there's an additional shift and add each byte). However, there is absolutely no support of multithreading with the quick verify, so those differences you show are suspect.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: ktf on 2024-04-21 14:29:25
It is interesting that FFmpeg manages to implement decent multi-threaded FLAC decoding despite the frame length not being present in the header. How does it do that?

ffmpeg has strictly seperated decoding and demuxing. So for FLAC it looks for sync codes and does some short integrity checks as part of the demuxing. When decoding FLAC in ffmpeg, you'll see warnings every now and then because of that, when it stumbles upon something it thinks is a frame, but isn't. This has been the case for many years already, because of this strict separation.

Of course, with this mechanism in place, multithreading decoding is rather trivial.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-22 10:57:37
Hm, definitely some confusion on me, as usual:
* I also found out that not only TAK, but also
wvunpack filename.wv -yo NUL writes to NUL.wav, and that seemingly takes more time than stdout redirected to NUL:
wvunpack filename.wv -yo - > NUL
* Weird about that fast-verification --threads, the numbers looked consistent enough to conclude, and I didn't think it would tax the CPU that much. Seven seconds in between a quarter of a second work?!

(Does Windows keep the executable in memory or something?)


Of course, with this mechanism in place, multithreading decoding is rather trivial.
So ... the obvious question is, any reason why not?
The odd event that "a valid frame header" shows up just by random in the data (the FLAC specification doesn't forbid junk between frames, as long as it is byte-aligned, and in any case parsing must take into account that a stream may be broken ...). Or even worse and more odd, an entire "valid frame" starting inside another?

Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: ktf on 2024-04-22 14:57:33
Because of the following:

you'll see warnings every now and then

It makes decoding much more complicated, less predictable and less stable. For ffmpeg it was necessary to fit its model in which decoding and demuxing is completely separated.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: nu774 on 2024-04-22 17:06:15
In other words, MP4/Matroska/Ogg/CAF is actually better for ffmpeg than the original FLAC container format?
Among these, for only FLAC container fb2k cannot do real-time bitrate display.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: ktf on 2024-04-22 19:54:36
In other words, MP4/Matroska/Ogg/CAF is actually better for ffmpeg than the original FLAC container format?
Yes, the inability to reliably skip ahead 1 frame without having to decode it is sometimes a disadvantage. For multithreading this is very valuable. However, relying solely on frame lengths is much less robust, and relying on both adds overhead of course. Maybe FLACs design was a bit too much focussed on reducing overhead.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-24 16:42:28
EDIT 29. April: Big user errors, some numbers included encoding (see replies #23 and 28) and the highest speeds were ffmpeg rejecting file rather than decoding it. Porcus facepalms and thanks mod for help committing edit - and @ktf for reacting to the numbers.


I tested FLAC in containers. Not CAF, I forgot about that one. With and without multithreading ffmpeg. This time I tried a shorter file - half an hour - because there were so many to run through.
With quite extreme settings, including blocksize 16 - that malice paid off ...
Turns out ffmpeg refused to remux the uncompressed flac streams into any of the three containers I tried.

Container overhead
* flac -5 is a sane setting, and the biggest overhead for that one was 0.44 percent (not percentage points) for OGG container
* Blocksize 16 is just nuts, but for what the file sizes are worth - .wav in the middle. No padding:
323 001 659 ¨3x.flac-8b16.flac
328 733 400 ¨3x.flac-8b16.flac.oga
331 702 604 ¨3x.wav
343 738 725 ¨3x.flac-8b16.flac.mp4
354 113 911 ¨3x.flac-8b16.flac.mka
9.6 percent penalty for putting it in Matroska. I used ffmpeg, comments commands like ffmpeg -i ¨3x.flac-8b16.flac -acodec copy -vn -sn ¨3x.flac-8b16.flac.mka


For sorting I moved the ".oga" etc. to a separate column. ¨3x.flac-5.flac <tab> .oga means the file is an OGG containered ¨3x.flac-5.flac.oga .  (The reason for the "¨" is to make sure the test audio files had a character nothing else had.)
Threadsdecodersettings on encodingcontainerspeed x realtimecomment (in parentheses: edit April 29th thanks to mod)
1flac.exe¨3x.flac-0b65535--no-md5--uncompressed.flac500(number included encoding)
1ffmpeg¨3x.flac-0b65535--no-md5--uncompressed.flac8791(ffmpeg failed to decode this)
7ffmpeg¨3x.flac-0b65535--no-md5--uncompressed.flac8685(ffmpeg failed to decode this)
1flac.exe¨3x.flac-0b65535--no-md5.flac527(number included encoding)
1ffmpeg¨3x.flac-0b65535--no-md5.flac1474about same for containers
7ffmpeg¨3x.flac-0b65535--no-md5.flac3544slower than containers
7ffmpeg¨3x.flac-0b65535--no-md5.flac.oga4919
7ffmpeg¨3x.flac-0b65535--no-md5.flac.mp46013mp4 very fast
7ffmpeg¨3x.flac-0b65535--no-md5.flac.mka5932
1flac.exe¨3x.flac-0r0--no-md5.flac518(number included encoding)
1ffmpeg¨3x.flac-0r0--no-md5.flac1049about same for containers
7ffmpeg¨3x.flac-0r0--no-md5.flac1869containers are only slightly faster.
7ffmpeg¨3x.flac-0r0--no-md5.flac.oga1879
7ffmpeg¨3x.flac-0r0--no-md5.flac.mp41918
7ffmpeg¨3x.flac-0r0--no-md5.flac.mka1924Not that much faster
1flac.exe¨3x.flac-5.flac518(number included encoding)
1ffmpeg¨3x.flac-5.flac966about same for containers
7ffmpeg¨3x.flac-5.flac2981
7ffmpeg¨3x.flac-5.flac.oga3600noticeably faster in all containers
7ffmpeg¨3x.flac-5.flac.mp43827
7ffmpeg¨3x.flac-5.flac.mka3854
1flac.exe¨3x.flac-8b16.flac247(number included encoding but still took way less time than ffmpeg decoding)
1ffmpeg¨3x.flac-8b16.flac80about as slow for containers
7ffmpeg¨3x.flac-8b16.flac31Even slower! And about as slow for containers
1ffmpeg¨3x.flac-8pr8--lax-l32.flac669about the same for containers. Forgot to run flac.exe on this one.
7ffmpeg¨3x.flac-8pr8--lax-l32.flac2493
7ffmpeg¨3x.flac-8pr8--lax-l32.flac.oga2599
7ffmpeg¨3x.flac-8pr8--lax-l32.flac.mp42631
7ffmpeg¨3x.flac-8pr8--lax-l32.flac.mka2642
I am not sure how ffmpeg -threads 1 works, if I should use "0" to get single-threaded? Because it does decode much quicker than reference flac.
I also did ffmpeg decoded without -threads command, that uses all 8, and that would improve the flac-in-other-containers slightly (but harm wavpack slightly, I leave that for a separate post).

So table does not list speed for ffmpeg without -threads, nor for the following:
* the same entire thing ran on USB3-connected spinning drive. Differences were just very minor. These figures are on internal SSD.
* ogg/mp4/mkv decoded with ffmpeg -threads 1, those were pretty much the same as .flac speeds
* same for the -8b16 in containers, those were just as horrible as .flac
Yes blocksize 16 decodes slow, but ffmpeg just does it terribly.

(Edit April 29: codebox also with misleading number deleted)


MOD note: The above post was edited by request of the OP.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-24 21:39:34
I already mentioned that current ffmpeg cli utility is useless for extremely small packets. FFmpeg developer that rewrote ffmpeg.c related code did not care and still does not care about this bug. So for small packets use lib calls directly instead of brain-dead ffmpeg.c implementation.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-24 21:45:28
BLUNDER on me and on ffmpeg.

ffmpeg errs out on the ¨3x.flac-0b65535--no-md5--uncompressed.flac also when decoding. Of course I should have checked that when it refuses to demux.

It is not about it using the only-verbatim-subframes flac - likely it is about frames being too big.
The attached 1.3 second flac file - good old Merzbow at it again - has 57330 samples and is created with
-0r0 --no-padding -fb57330 --lax
So one frame, both subframes are FIXED, order 1.
ffmpeg cannot decode it. Recompress it with smaller block size, and it will - 57300 is still too large though.

Edit: reuploaded without artwork, that is not the blame - and seems the "r0" is superfluous.
Fiddling around with files I found out that padding-or-not could even influence the max block size. I got a file where 53207 with default padding is OK, 53208 with default padding is not, 53208 with --no-padding is OK.


@ktf, of course there is nothing wrong with the file? The blame is squarely on ffmpeg?
It makes decoding much more complicated, less predictable and less stable.
You might have had a point ...
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-24 23:56:02
FLAC format is brain-dead from stream-oriented usages. Its same like historic shorten format.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-25 10:32:22
Suggesting that someone at ffmpeg did only consider the streamable subset ... ?

Good idea to test that, then. ffmpeg fails the file uploaded at https://hydrogenaud.io/index.php/topic,125848  [edit: botched attachment]

6ch, 96/24. Generated by:

sox -b 24 -c 6 -r 96000 -n whitenoise.wav synth 1 whitenoise
flac whitenoise.wav --channel-map=none -b16384
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-26 12:37:13
To the extent TTA is interesting at all, the lack of flexibility suggests that differences can be put down to code quality and not to "prioritized this type of encoding strategy" and the like. And:
ffmpeg -threads 1 DEcodes much faster than the reference, which takes 42 percent more time. Tested the half-hour long file, mean of repeated runs:

533x realtime: DEcoding by ffmpeg -threads 1
375x realtime: DEcoding by tta.exe

ffmpeg also ENcodes faster, but there the difference is small. 451x realtime vs 423x realtime.
Everything to NUL.

Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-26 12:52:44
TTA container needs buffering all packets in memory when encoding.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-26 13:23:49
Meaning, there is not much to do to optimize encoding - but decoding then, is that due to more efficient WAVE writing? Or am I just guessing wrong from what you write?
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-26 14:36:10
Ever heard of doing actual benchmark via perf or any other professional and advanced tool?

It will show where most CPU time is wasted in binary when executing some operations.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-26 16:21:49
Heard of yes, done no - so what was the output?
Since you claimed for TTA that "There is no ways to do any optimizations here except bruteforce threading", then well, ... timing suggests otherwise, and to the extent that I don't think you need a more fine-tuned setup.

Nor do I think that would have tipped over as big differences as the following:
 * WavPack verifies three times as fast as Monkey's - and 10x as fast as reference FLAC (because it doesn't do fast-verify) and 28.6x as fast as ffmpeg-generated WavPack 4 files (since those files don't offer the option)
 * Different FLAC builds differ in encoding time by a factor of 2.5 on -5, and even more at -8: https://hydrogenaud.io/index.php/topic,123025.msg1029768.html#msg1029768 .  Sure one could be interested in an explanation, but you don't need that level of detail to point out that there are big differences.
 * ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
 * ffmpeg outright rejects FLAC files instead of decoding.  Heck I even got it to reject subset FLAC.  And when it has encoded .wv files it cannot decode itself, then seriously: When flaws are like that, who would whine over the timing tool?!
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: ktf on 2024-04-26 20:13:08
@ktf, of course there is nothing wrong with the file? The blame is squarely on ffmpeg?
I see nothing wrong with the file. Maybe the problem is that it consist of a single block?

* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
Of course I dove into that, because if FLAC can be twice as fast, that would be great! But I cannot reproduce, not on Linux nor Windows, not on SSD nor ramdisk. Can you check your results and see whether a different way of collecting times gives you the same result?
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-26 20:13:52
I'm obviously talking to walls. Bye bye!
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: bryant on 2024-04-26 20:38:34
* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
Of course I dove into that, because if FLAC can be twice as fast, that would be great! But I cannot reproduce, not on Linux nor Windows, not on SSD nor ramdisk. Can you check your results and see whether a different way of collecting times gives you the same result?
My results (see reply #7 above) show FFmpeg -threads 1 decoding right in between FLAC with md5 and FLAC without md5.

As Porcus mentioned, FFmpeg doesn't seem to pay attention to whether the FLAC file has an md5, but the question is whether it always calculates the sum (in which case it's faster than FLAC) or never does (in which case it's slower). My guess would be that it doesn't.

Single-threaded FFmpeg being slower than native WavPack could be explained by its lacking the ASM optimizations, which really don't make sense there for such a niche format (from a maintenance point of view).
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-26 22:55:40
WavPack again. Only .wv in this post. wvunpack and ffmpeg


-z0q and time.  Does console output slow it down? 
TL;DR: -vv is so fast that one can argue that yes it does.

An initial test wasn't dramatic: hyperfine on single untagged -f-encoded file indicated a cost of 0.05 seconds on decoding to NUL and 0.02 seconds for fast verify.
But, then. Since there is only wvunpack against itself to test on this matter, and it supports wildcards, I took two actual albums, with tags and all, in separate tracks, on a spinning drive: one full CD 75 minutes of Bach's organ works, 652 kbit/s using, 28 tracks/files. Black Sabbath s/t, 38:14, a CD split in 7 tracks/files, 903 kbit/s. Both using -x6.

Fast verification: -qz0 is significantly faster, in percents. 0.1 seconds isn't much though, but if one has a full TB of CDs (like 3000 of them), it will add up to a few minutes. 25 runs with two seconds sleep in between:
wvunpack *.wv -vv -qz0: 304 ms on Bach and 187 ms on Black Sabbath
wvunpack *.wv -vv : increases by 103 ms (36 percent) resp. 81 ms (46 percent)

Decoding to -o - > NUL: Since that takes more time, the percentages aren't that impressive of course. The differences in time look suspicious to be honest, but 19 runs and not too outrageous standard deviations:
wvunpack *.wv -z0qy -o - > NUL : 9.265 s ±  0.024 s  resp. 4.917 s ±  0.025 s
wvunpack *.wv -yo - > NUL :  9.698 s ±  0.032 s resp. 5.039 s ±  0.013 s
Writing to the same spinning drive would bring it to 30 or 15 seconds, and that kinda sets the perspective.
But the fast decoding is now so fast that it seems that reporting the progress makes for something if one is scanning


Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.

I did the "shorter" half-hour file. Commands were like the following:
ffmpeg -threads 8 -i   ¨3x.wv.-fx0.wv   -hide_banner -loglevel error -f wav -y NUL
.\wvunpack --threads=8   ¨3x.wv.-fx0.wv   -z0qyo -o - > NUL 
the latter being the fastest.  -fx0 in the filename because that option (it means -f ; -x0 in new releases means "no x", that is good for FOR looping). And yes there are two "o" options because I didn't spot it, but wvunpack didn't object.

Times first, mean ± stdev - having sorted the output by the .wv file and then by speed, fastest at top:

-f:
1.00 .\wvunpack --threads=8   
1.09 ± 0.00      .\wvunpack --threads=7   
1.13 ± 0.01      ffmpeg -threads 8
1.20 ± 0.01      ffmpeg -threads 7
1.54 ± 0.02      .\wvunpack --threads=4   
1.90 ± 0.01      .\wvunpack --threads=3   
1.95 ± 0.05      ffmpeg -threads 4
2.35 ± 0.05      ffmpeg -threads 3
3.65 ± 0.01      .\wvunpack --threads=1   
4.94 ± 0.06      ffmpeg -threads 1

-x:
1.23 ± 0.04      .\wvunpack --threads=8   
1.31 ± 0.01      .\wvunpack --threads=7   
1.46 ± 0.01      ffmpeg -threads 8
1.55 ± 0.01      ffmpeg -threads 7
1.86 ± 0.03      .\wvunpack --threads=4   
2.30 ± 0.02      .\wvunpack --threads=3   
2.60 ± 0.01      ffmpeg -threads 4
2.99 ± 0.04      ffmpeg -threads 3
4.53 ± 0.02      .\wvunpack --threads=1   
6.35 ± 0.02      ffmpeg -threads 1

Ran, but omitted from this list: -x --blocksize=4096 to see if it mattered.  Not much except ffmpeg -threads 1 was up to 6.93.

-hx2:
1.70 ± 0.01      .\wvunpack --threads=8   
1.80 ± 0.01      .\wvunpack --threads=7   
2.00 ± 0.01      ffmpeg -threads 8
2.19 ± 0.18      ffmpeg -threads 7
2.48 ± 0.02      .\wvunpack --threads=4   
3.08 ± 0.02      .\wvunpack --threads=3   
3.43 ± 0.03      ffmpeg -threads 4
3.91 ± 0.03      ffmpeg -threads 3
5.95 ± 0.03      .\wvunpack --threads=1   
8.79 ± 0.06      ffmpeg -threads 1

-hhx3:
2.21 ± 0.08      .\wvunpack --threads=8   
2.27 ± 0.02      .\wvunpack --threads=7   
2.66 ± 0.03      ffmpeg -threads 8
2.82 ± 0.01      ffmpeg -threads 7
3.17 ± 0.03      .\wvunpack --threads=4   
3.92 ± 0.02      .\wvunpack --threads=3   
4.52 ± 0.02      ffmpeg -threads 4
5.33 ± 0.11      ffmpeg -threads 3
7.67 ± 0.02      .\wvunpack --threads=1   
11.77 ± 0.05      ffmpeg -threads 1

And finally, ffmpeg wins this one with maximum blocksize (except single-threaded)
-hhx4 --blocksize=131072
2.57 ± 0.01      ffmpeg -threads 8
2.75 ± 0.02      ffmpeg -threads 7
3.12 ± 0.06      .\wvunpack --threads=8   
3.40 ± 0.03      .\wvunpack --threads=7   
3.94 ± 0.05      ffmpeg -threads 4
4.84 ± 0.05      ffmpeg -threads 3
4.95 ± 0.03      .\wvunpack --threads=4   
5.54 ± 0.02      .\wvunpack --threads=3   
7.57 ± 0.04      .\wvunpack --threads=1   
11.27 ± 0.09      ffmpeg -threads 1

A couple of remarks:
* I ran ffmpeg and wvunpack alternating with the same threads count, so that none of them should be disadvantaged over the CPU having just reacted to something (throttling or whatever). To make sure, I ran it twice: once with ffmpeg then wvunpack, once with wvunpack then ffmpeg.  Very small differences.
* What I didn't think of ... because I am normally juggling a few versions of encoders, I the current wvunpack.exe to the same directory - it shouldn't make for a time advantage?!  Maybe fix PATH next time.



WavPack -vv fast verification redux:
TL;DR:
Put the --threads differences down on my system not being more consistent than the data below ... arguably, less consistent.

In case anyone is interested, I dump the hyperfine output in here.
Same files. Ran with and without threads (and a couple extra to check for inconsistencies, not included; one was off, but betrayed by a high stdev).

USB3 spinning drive. Fastest was 105.1 ms ±   2.8 ms
Code: [Select]
  .wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
    1.01 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.02 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.08 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
    1.08 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
    1.09 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
    1.10 ± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
    1.10 ± 0.06 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
    1.21 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
    1.21 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
    1.22 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
    1.23 ± 0.03 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
    1.24 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
    1.26 ± 0.08 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
    1.58 ± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.58 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.59 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
Internal SSD. Fastest was 107.2 ms ±   2.2 ms. 
Code: [Select]
  .wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
    1.00 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.01 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.09 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
    1.09 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
    1.10 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
    1.11 ± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
    1.21 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
    1.22 ± 0.03 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
    1.23 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
    1.24 ± 0.06 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
    1.25 ± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
    1.26 ± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
    1.55 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.55 ± 0.04 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.57 ± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-26 23:02:01
More:

* As for running quiet: hyperfine suppresses console output, and a bit of non-rigorous testing indicates that running quiet doesn't matter for ffmpeg when tested by hyperfine (it could matter in real-life ...).
But wvunpack runs the percent progress in the window's title bar, and hyperfine does not suppress that!


flac then. @ktf :
* I will re-run when I am back at the same computer. The chance for human error is certainly positive - like, copy + paste suddenly overwrote something in a spreadsheet. I haven't automated this.
* The files that ffmpeg reject are not about only being one frame - the 6ch (subset!) file I posted at https://hydrogenaud.io/index.php/topic,125848 is a full second. What happened to the attachment posted in this thread was that I traversed the possible -b's and found where the trouble started - then I cut the source down to the precise number of samples to see if "single frame" would still trigger it. Of course, I carelessly posted the last file created.
* Question: it seems that reference flac omits MD5 calculation from -t and -d if the file has no MD5 in STREAMINFO? But there is no way to force it not to otherwise?
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-04-29 08:17:18
FLAC: BLUNT BLUNDER on me  :-[

@korth as mod: I don't want to "shield myself against my own mistakes" here, but if you think it is OK - it being on a previous page with misleading numbers right in user's face - would you maybe please moderate in an extra first line in Reply #13 (https://hydrogenaud.io/index.php/topic,125791.msg1043525.html#msg1043525) like e.g. the follows:
Mod note: Porcus facepalms and suggests to read Reply #28 (https://hydrogenaud.io/index.php/topic,125791.msg1043741.html#msg1043741) for correction


Anyway, after having tried ffmpeg -threads this and that and around, I found the mistake not in the ffmpeg command, but in the flac command. It is even in the codebox, where flac was run with options -fo NUL ... without "-d". So it spendt time re-encoding to FLAC. Thank you to @ktf for spotting the anomaly.

Here are some hopefully more sane numbers, where reference FLAC (1.4.2 was used ... for dumb reasons) beating ffmpeg -threads 1.
Decoding times on SSD to NUL, the 1.060 seconds means 1774x real-time

Encoded with -0b65535 --no-md5 --lax
 1.060 s ±  0.009 s    flac (1.4.2)
 1.278 s ±  0.008 s    ffmpeg -threads 1
 0.842 s ±  0.007 s    ffmpeg -threads 2
 0.591 s ±  0.030 s    ffmpeg -threads 3
 0.522 s ±  0.010 s    ffmpeg -threads 4
 0.506 s ±  0.012 s    ffmpeg -threads 6
 0.538 s ±  0.007 s    ffmpeg, default threads (detects all eight)
 
Encoded with -0r0 --no-md5, reference FLAC single-threaded beats ffmpeg -threads 3
 1.144 s ±  0.015 s    flac (1.4.2)
 1.799 s ±  0.005 s    ffmpeg -threads 1
 1.642 s ±  0.005 s    ffmpeg -threads 2
 1.163 s ±  0.007 s    ffmpeg -threads 3
 0.998 s ±  0.014 s    ffmpeg -threads 4
 0.981 s ±  0.015 s    ffmpeg -threads 6
 1.019 s ±  0.012 s    ffmpeg, default
 
Encoded with default -5   
 1.552 s ±  0.011 s    flac (1.4.2)
 1.967 s ±  0.006 s    ffmpeg -threads 1
 1.290 s ±  0.012 s    ffmpeg -threads 2
 0.899 s ±  0.008 s    ffmpeg -threads 3
 0.725 s ±  0.003 s    ffmpeg -threads 4
 0.654 s ±  0.052 s    ffmpeg -threads 6
 0.619 s ±  0.006 s    ffmpeg, default
 
Encoded with -8pl32 -r8 --lax
 2.056 s ±  0.019 s    flac (1.4.2)
 2.818 s ±  0.006 s    ffmpeg -threads 1
 1.746 s ±  0.029 s    ffmpeg -threads 2
 1.227 s ±  0.021 s    ffmpeg -threads 3
 1.038 s ±  0.003 s    ffmpeg -threads 4
 0.784 s ±  0.004 s    ffmpeg -threads 6
 0.739 s ±  0.013 s    ffmpeg, default
 
Encoded with -8b16
 5.476 s ±  0.034 s    flac (1.4.2)
24.328 s ±  0.186 s    ffmpeg -threads 1 <------- ooh bad
61.104 s ±  0.582 s    ffmpeg -threads 2 <------- and the "even worse" starts already here!
61.290 s ±  0.501 s    ffmpeg -threads 3
60.371 s ±  0.438 s    ffmpeg -threads 4
60.236 s ±  0.709 s    ffmpeg -threads 6
61.878 s ±  0.492 s    ffmpeg, default

There is not much gained above 4 threads (this is a 4-core 8-thread CPU) except the 8pl32-etc. file.

Commands given:
flac <file> -ss -dfo NUL
ffmpeg -threads <T> -i <file> -hide_banner -loglevel error -f wav -y NUL
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: ktf on 2024-04-30 15:45:29
In a way, this is a pity of course, it would have been great if the reference FLAC decoder could learn something from a different implementation and get (much) faster.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: mycroft on 2024-04-30 16:27:24
FLAC is pity and misery and seditious.

I already told you multiple times, but you ignore it and force your brain-dead ideas.

Latest FFmpeg cli tool will be extremely slow with small packets (small number of samples encoded per frame/packet) if you use example tool from ffmpeg repo or code your own app with no brain-dead ideas like current ffmpeg mt implementation it will be 10000% faster than pity flac tool.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: bryant on 2024-05-01 00:40:17
Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.
So a couple things about this. First, I guess you’re using --threads because you have 8-thread hardware. In WavPack’s case the performance continues to improve even when the requested number of threads exceeds the physical threads. I just did an experiment on my 8-thread machine and got a 10% speed improvement (but 10% more total processor time) going from --threads=7 to --threads=12. Of course with thermal throttling and other factors, actual mileage may vary.

FFmpeg does not behave this way, and seems to detect the number of physical threads and ignores specification beyond that.

Interesting about the reduced performance with extra long frames, but it has a simple explanation. To achieve the temporal multithreading, sufficiently large buffers must be provided to libwavpack, and the command-line programs calculate these based on the requested number of threads and the normal frame lengths. This is done because unfortunately there’s no API provided to determine the actual frame length (this is abstracted away from the library client, and can change from frame to frame), so the best we can do is guess. I would strongly recommend that extra long frames are not used to achieve better compression!  :)
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: Porcus on 2024-05-07 23:23:10
@bryant on WavPack:
Whether wvunpack --threads=<N> vs ffmpeg -threads=<N> is apples to apples ... I don't know. It might be "better" than comparing wvunpack --threads vs ffmpeg (spawning all), and all I could do was to get it down to thread count matters enough to tilt the numbers.


@ktf on FLAC:
Sure if we could just pick up some magic and speed up stuff. Anyway I made a mess out of it, first time I was not sure whether ffmpeg -threads 1 meant "single thread" or "one worker in addition to the bookkeping/parsing", so is part of my lame excuses.
But:
I got something that you could maybe speed up for: it seems 1.3.x does small blocks faster than 1.4.x. Since I got this wrong AND didn't stay completely consistent on which flac executable I used, I ran it together with a test to see when ffmpeg stops making a fool of itself (it was at block sizes below reference's -0/-1/-2, that is good):
(https://i.imgur.com/zaCPpiy.png)
What I did: explained over in the FLAC test thread: https://hydrogenaud.io/index.php/topic,123025.msg1044103.html#msg1044103 With more diagrams, including 1.2.1. And encoding.

Anyway, ffmpeg starts behaving "more normal" at "normal" block sizes.
For reference flac it might look surprising that -5 and -0 make so little difference. But most of the graph is for (too!) small block sizes, and apparently the time penalty for processing those, override the calculation job.




@mycroft :
"seditious" sounds like you just learned a new bad word and is waiting to use it ... couldn't you be constructive and educate instead?
For better or for worse, FLAC's design is frozen long ago, and for now it stays the biggest player in the lossless audio files market, at least disregarding silver discs.
But good that you actually contribute repairing bad code, and not only whine toxicity.
Title: Re: Tested: Lossless decoding speed, multithreaded - and fast verification
Post by: bryant on 2024-05-09 17:37:16
@bryant on WavPack:
Whether wvunpack --threads=<N> vs ffmpeg -threads=<N> is apples to apples ... I don't know. It might be "better" than comparing wvunpack --threads vs ffmpeg (spawning all), and all I could do was to get it down to thread count matters enough to tilt the numbers.
I believe, based on my experiments, that FFmpeg -threads <n> specifies the total number of threads to use, which is identical to WavPack’s interpretation.

When specifying nothing (option alone) WavPack simply defaults to 5 threads (a good compromise on all machines, if not necessarily the fastest). FFmpeg either determines the number of physical threads and uses that (my guess), or uses unlimited threads (like “make”). The option is there to limit the use of threading only.

Another difference is specifying zero; FFmpeg ignores this (I guess it's the default) whereas WavPack disallows it.