Tested: Lossless decoding speed, multithreaded - and fast verification

Topic: Tested: Lossless decoding speed, multithreaded - and fast verification (Read 4239 times) previous topic - next topic

0 Members and 2 Guests are viewing this topic.

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #25 – 2024-04-26 20:38:34

Quote from: ktf on 2024-04-26 20:13:08

Quote from: Porcus on 2024-04-26 16:21:49
* ffmpeg -threads 1 decodes nearly twice as fast as reference flac at -5, but several times slower at low block sizes
Of course I dove into that, because if FLAC can be twice as fast, that would be great! But I cannot reproduce, not on Linux nor Windows, not on SSD nor ramdisk. Can you check your results and see whether a different way of collecting times gives you the same result?

My results (see reply #7 above) show FFmpeg -threads 1 decoding right in between FLAC with md5 and FLAC without md5.

As Porcus mentioned, FFmpeg doesn't seem to pay attention to whether the FLAC file has an md5, but the question is whether it always calculates the sum (in which case it's faster than FLAC) or never does (in which case it's slower). My guess would be that it doesn't.

Single-threaded FFmpeg being slower than native WavPack could be explained by its lacking the ASM optimizations, which really don't make sense there for such a niche format (from a maintenance point of view).

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #26 – 2024-04-26 22:55:40

WavPack again. Only .wv in this post. wvunpack and ffmpeg

-z0q and time. Does console output slow it down?
TL;DR: -vv is so fast that one can argue that yes it does.

An initial test wasn't dramatic: hyperfine on single untagged -f-encoded file indicated a cost of 0.05 seconds on decoding to NUL and 0.02 seconds for fast verify.
But, then. Since there is only wvunpack against itself to test on this matter, and it supports wildcards, I took two actual albums, with tags and all, in separate tracks, on a spinning drive: one full CD 75 minutes of Bach's organ works, 652 kbit/s using, 28 tracks/files. Black Sabbath s/t, 38:14, a CD split in 7 tracks/files, 903 kbit/s. Both using -x6.

Fast verification: -qz0 is significantly faster, in percents. 0.1 seconds isn't much though, but if one has a full TB of CDs (like 3000 of them), it will add up to a few minutes. 25 runs with two seconds sleep in between:
wvunpack *.wv -vv -qz0: 304 ms on Bach and 187 ms on Black Sabbath
wvunpack *.wv -vv : increases by 103 ms (36 percent) resp. 81 ms (46 percent)

Decoding to -o - > NUL: Since that takes more time, the percentages aren't that impressive of course. The differences in time look suspicious to be honest, but 19 runs and not too outrageous standard deviations:
wvunpack *.wv -z0qy -o - > NUL : 9.265 s ± 0.024 s resp. 4.917 s ± 0.025 s
wvunpack *.wv -yo - > NUL : 9.698 s ± 0.032 s resp. 5.039 s ± 0.013 s
Writing to the same spinning drive would bring it to 30 or 15 seconds, and that kinda sets the perspective.
But the fast decoding is now so fast that it seems that reporting the progress makes for something if one is scanning

Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.

I did the "shorter" half-hour file. Commands were like the following:
ffmpeg -threads 8 -i   ¨3x.wv.-fx0.wv   -hide_banner -loglevel error -f wav -y NUL
.\wvunpack --threads=8   ¨3x.wv.-fx0.wv   -z0qyo -o - > NUL
the latter being the fastest. -fx0 in the filename because that option (it means -f ; -x0 in new releases means "no x", that is good for FOR looping). And yes there are two "o" options because I didn't spot it, but wvunpack didn't object.

Times first, mean ± stdev - having sorted the output by the .wv file and then by speed, fastest at top:

-f:
1.00 .\wvunpack --threads=8
1.09 ± 0.00      .\wvunpack --threads=7
1.13 ± 0.01      ffmpeg -threads 8
1.20 ± 0.01      ffmpeg -threads 7
1.54 ± 0.02      .\wvunpack --threads=4
1.90 ± 0.01      .\wvunpack --threads=3
1.95 ± 0.05      ffmpeg -threads 4
2.35 ± 0.05      ffmpeg -threads 3
3.65 ± 0.01      .\wvunpack --threads=1
4.94 ± 0.06      ffmpeg -threads 1

-x:
1.23 ± 0.04      .\wvunpack --threads=8
1.31 ± 0.01      .\wvunpack --threads=7
1.46 ± 0.01      ffmpeg -threads 8
1.55 ± 0.01      ffmpeg -threads 7
1.86 ± 0.03      .\wvunpack --threads=4
2.30 ± 0.02      .\wvunpack --threads=3
2.60 ± 0.01      ffmpeg -threads 4
2.99 ± 0.04      ffmpeg -threads 3
4.53 ± 0.02      .\wvunpack --threads=1
6.35 ± 0.02      ffmpeg -threads 1

Ran, but omitted from this list: -x --blocksize=4096 to see if it mattered. Not much except ffmpeg -threads 1 was up to 6.93.

-hx2:
1.70 ± 0.01      .\wvunpack --threads=8
1.80 ± 0.01      .\wvunpack --threads=7
2.00 ± 0.01      ffmpeg -threads 8
2.19 ± 0.18      ffmpeg -threads 7
2.48 ± 0.02      .\wvunpack --threads=4
3.08 ± 0.02      .\wvunpack --threads=3
3.43 ± 0.03      ffmpeg -threads 4
3.91 ± 0.03      ffmpeg -threads 3
5.95 ± 0.03      .\wvunpack --threads=1
8.79 ± 0.06      ffmpeg -threads 1

-hhx3:
2.21 ± 0.08      .\wvunpack --threads=8
2.27 ± 0.02      .\wvunpack --threads=7
2.66 ± 0.03      ffmpeg -threads 8
2.82 ± 0.01      ffmpeg -threads 7
3.17 ± 0.03      .\wvunpack --threads=4
3.92 ± 0.02      .\wvunpack --threads=3
4.52 ± 0.02      ffmpeg -threads 4
5.33 ± 0.11      ffmpeg -threads 3
7.67 ± 0.02      .\wvunpack --threads=1
11.77 ± 0.05      ffmpeg -threads 1

And finally, ffmpeg wins this one with maximum blocksize (except single-threaded)
-hhx4 --blocksize=131072
2.57 ± 0.01      ffmpeg -threads 8
2.75 ± 0.02      ffmpeg -threads 7
3.12 ± 0.06      .\wvunpack --threads=8
3.40 ± 0.03      .\wvunpack --threads=7
3.94 ± 0.05      ffmpeg -threads 4
4.84 ± 0.05      ffmpeg -threads 3
4.95 ± 0.03      .\wvunpack --threads=4
5.54 ± 0.02      .\wvunpack --threads=3
7.57 ± 0.04      .\wvunpack --threads=1
11.27 ± 0.09      ffmpeg -threads 1

A couple of remarks:
* I ran ffmpeg and wvunpack alternating with the same threads count, so that none of them should be disadvantaged over the CPU having just reacted to something (throttling or whatever). To make sure, I ran it twice: once with ffmpeg then wvunpack, once with wvunpack then ffmpeg. Very small differences.
* What I didn't think of ... because I am normally juggling a few versions of encoders, I the current wvunpack.exe to the same directory - it shouldn't make for a time advantage?! Maybe fix PATH next time.

WavPack -vv fast verification redux:
TL;DR: Put the --threads differences down on my system not being more consistent than the data below ... arguably, less consistent.

In case anyone is interested, I dump the hyperfine output in here.
Same files. Ran with and without threads (and a couple extra to check for inconsistencies, not included; one was off, but betrayed by a high stdev).

USB3 spinning drive. Fastest was 105.1 ms ±   2.8 ms

Code: [Select]

  .wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
    1.01 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.02 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.08 ± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
    1.08 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
    1.09 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
    1.10 ± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
    1.10 ± 0.06 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
    1.21 ± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
    1.21 ± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
    1.22 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
    1.23 ± 0.03 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
    1.24 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
    1.26 ± 0.08 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
    1.58 ± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.58 ± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.59 ± 0.05 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q

Internal SSD. Fastest was 107.2 ms ± 2.2 ms.

Code: [Select]

  .wvunpack --threads=7 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q ran
    1.00 	± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.01 	± 0.03 times faster than .wvunpack ¨3x.wv.-hhx4--blocksize=131072.wv -vv -z0q
    1.09 	± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hx2.wv -vv -z0q
    1.09 	± 0.03 times faster than .wvunpack ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 	± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hhx3.wv -vv -z0q
    1.10 	± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-hx2.wv -vv -z0q
    1.10 	± 0.03 times faster than .wvunpack --threads=7 ¨3x.wv.-hhx3.wv -vv -z0q
    1.11 	± 0.03 times faster than .wvunpack ¨3x.wv.-hx2.wv -vv -z0q
    1.21 	± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-fx0.wv -vv -z0q
    1.22 	± 0.03 times faster than .wvunpack ¨3x.wv.-gx1.wv -vv -z0q
    1.23 	± 0.03 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1.wv -vv -z0q
    1.24 	± 0.06 times faster than .wvunpack ¨3x.wv.-fx0.wv -vv -z0q
    1.25 	± 0.04 times faster than .wvunpack --threads=7 ¨3x.wv.-fx0.wv -vv -z0q
    1.26 	± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1.wv -vv -z0q
    1.55 	± 0.04 times faster than .wvunpack --threads=4 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.55 	± 0.04 times faster than .wvunpack ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q
    1.57 	± 0.05 times faster than .wvunpack --threads=7 ¨3x.wv.-gx1--blocksize=4096.wv -vv -z0q

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #27 – 2024-04-26 23:02:01

More:

* As for running quiet: hyperfine suppresses console output, and a bit of non-rigorous testing indicates that running quiet doesn't matter for ffmpeg when tested by hyperfine (it could matter in real-life ...).
But wvunpack runs the percent progress in the window's title bar, and hyperfine does not suppress that!

flac then. @ktf :
* I will re-run when I am back at the same computer. The chance for human error is certainly positive - like, copy + paste suddenly overwrote something in a spreadsheet. I haven't automated this.
* The files that ffmpeg reject are not about only being one frame - the 6ch (subset!) file I posted at https://hydrogenaud.io/index.php/topic,125848 is a full second. What happened to the attachment posted in this thread was that I traversed the possible -b's and found where the trouble started - then I cut the source down to the precise number of samples to see if "single frame" would still trigger it. Of course, I carelessly posted the last file created.
* Question: it seems that reference flac omits MD5 calculation from -t and -d if the file has no MD5 in STREAMINFO? But there is no way to force it not to otherwise?

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #28 – 2024-04-29 08:17:18

FLAC: BLUNT BLUNDER on me

@korth as mod: I don't want to "shield myself against my own mistakes" here, but if you think it is OK - it being on a previous page with misleading numbers right in user's face - would you maybe please moderate in an extra first line in Reply #13 like e.g. the follows:
Mod note: Porcus facepalms and suggests to read Reply #28 for correction

Anyway, after having tried ffmpeg -threads this and that and around, I found the mistake not in the ffmpeg command, but in the flac command. It is even in the codebox, where flac was run with options -fo NUL ... without "-d". So it spendt time re-encoding to FLAC. Thank you to @ktf for spotting the anomaly.

Here are some hopefully more sane numbers, where reference FLAC (1.4.2 was used ... for dumb reasons) beating ffmpeg -threads 1.
Decoding times on SSD to NUL, the 1.060 seconds means 1774x real-time

Encoded with -0b65535 --no-md5 --lax
1.060 s ± 0.009 s    flac (1.4.2)
1.278 s ± 0.008 s    ffmpeg -threads 1
0.842 s ± 0.007 s    ffmpeg -threads 2
0.591 s ± 0.030 s    ffmpeg -threads 3
0.522 s ± 0.010 s    ffmpeg -threads 4
0.506 s ± 0.012 s    ffmpeg -threads 6
0.538 s ± 0.007 s    ffmpeg, default threads (detects all eight)

Encoded with -0r0 --no-md5, reference FLAC single-threaded beats ffmpeg -threads 3
1.144 s ± 0.015 s    flac (1.4.2)
1.799 s ± 0.005 s    ffmpeg -threads 1
1.642 s ± 0.005 s    ffmpeg -threads 2
1.163 s ± 0.007 s    ffmpeg -threads 3
0.998 s ± 0.014 s    ffmpeg -threads 4
0.981 s ± 0.015 s    ffmpeg -threads 6
1.019 s ± 0.012 s    ffmpeg, default

Encoded with default -5
1.552 s ± 0.011 s    flac (1.4.2)
1.967 s ± 0.006 s    ffmpeg -threads 1
1.290 s ± 0.012 s    ffmpeg -threads 2
0.899 s ± 0.008 s    ffmpeg -threads 3
0.725 s ± 0.003 s    ffmpeg -threads 4
0.654 s ± 0.052 s    ffmpeg -threads 6
0.619 s ± 0.006 s    ffmpeg, default

Encoded with -8pl32 -r8 --lax
2.056 s ± 0.019 s    flac (1.4.2)
2.818 s ± 0.006 s    ffmpeg -threads 1
1.746 s ± 0.029 s    ffmpeg -threads 2
1.227 s ± 0.021 s    ffmpeg -threads 3
1.038 s ± 0.003 s    ffmpeg -threads 4
0.784 s ± 0.004 s    ffmpeg -threads 6
0.739 s ± 0.013 s    ffmpeg, default

Encoded with -8b16
5.476 s ± 0.034 s    flac (1.4.2)
24.328 s ± 0.186 s    ffmpeg -threads 1 <------- ooh bad
61.104 s ± 0.582 s    ffmpeg -threads 2 <------- and the "even worse" starts already here!
61.290 s ± 0.501 s    ffmpeg -threads 3
60.371 s ± 0.438 s    ffmpeg -threads 4
60.236 s ± 0.709 s    ffmpeg -threads 6
61.878 s ± 0.492 s    ffmpeg, default

There is not much gained above 4 threads (this is a 4-core 8-thread CPU) except the 8pl32-etc. file.

Commands given:
flac <file> -ss -dfo NUL
ffmpeg -threads <T> -i <file> -hide_banner -loglevel error -f wav -y NUL

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #29 – 2024-04-30 15:45:29

In a way, this is a pity of course, it would have been great if the reference FLAC decoder could learn something from a different implementation and get (much) faster.

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #30 – 2024-04-30 16:27:24

FLAC is pity and misery and seditious.

I already told you multiple times, but you ignore it and force your brain-dead ideas.

Latest FFmpeg cli tool will be extremely slow with small packets (small number of samples encoded per frame/packet) if you use example tool from ffmpeg repo or code your own app with no brain-dead ideas like current ffmpeg mt implementation it will be 10000% faster than pity flac tool.

Re: Tested: Lossless decoding speed, multithreaded - and fast verification

Reply #31 – 2024-05-01 00:40:17

Quote from: Porcus on 2024-04-26 22:55:40

Decoding speeds again. Tested more multithreading.
TL;DR: wvunpack --threads=<N> beats ffmpeg -threads <N> at decoding - with one exception where block size was forced to maximum.

So a couple things about this. First, I guess you’re using --threads because you have 8-thread hardware. In WavPack’s case the performance continues to improve even when the requested number of threads exceeds the physical threads. I just did an experiment on my 8-thread machine and got a 10% speed improvement (but 10% more total processor time) going from --threads=7 to --threads=12. Of course with thermal throttling and other factors, actual mileage may vary.

FFmpeg does not behave this way, and seems to detect the number of physical threads and ignores specification beyond that.

Interesting about the reduced performance with extra long frames, but it has a simple explanation. To achieve the temporal multithreading, sufficiently large buffers must be provided to libwavpack, and the command-line programs calculate these based on the requested number of threads and the normal frame lengths. This is done because unfortunately there’s no API provided to determine the actual frame length (this is abstracted away from the library client, and can change from frame to frame), so the best we can do is guess. I would strongly recommend that extra long frames are not used to achieve better compression!

Notice