Final release of TAK 2.3.3 ((T)om's lossless (A)udio (K)ompressor)
This release brings an 64-bit decoder library and better unicode support for the GUI version.
It consists of:
- TAK Decoder library 2.3.3 (x86/x64)
- An 64-bit decoder library for the SDK.
- I have also created 64-bit versions of the applications. As expected they are bigger and slower without any advantage. As long as Windows supports 32-bit applications i see no reason to release them. But i will continously maintain them, so that they are ready when needed.
- Unicode support of the GUI version is no longer limited to the open file dialogs. The required switch to a newer version of my development environment is responsible for a 3.5 times bigger program file.
- Tiny encoding speed improvements of not more than 3 percent for Intel cpus based upon the skylake microarchitecture (6th to 10th Generation Core). I could have squeezed out more but only at the expense of significantly slower processing on older platforms. As rule of thumb i am taking into account cpu microarchitectures of the last 10 years.
Here the results for my primary file set.
Test system: Intel i3-8100 (3.6 GHz / 1 Thread), Windows 10.
Preset Enco-Speed Deco-Speed
2.3.2 2.3.3 Win % 2.3.2 2.3.3 Win %
-p0 868.72 893.94 2.90 806.84 810.57 0.46
-p0e 698.08 711.16 1.87 812.44 817.09 0.57
-p0m 418.70 425.54 1.63 815.07 819.19 0.51
-p1 726.44 748.04 2.97 787.03 788.37 0.17
-p1e 482.99 491.72 1.81 788.73 791.12 0.30
-p1m 314.46 322.68 2.61 790.61 794.60 0.50
-p2 580.58 591.20 1.83 715.05 718.12 0.43
-p2e 347.07 348.35 0.37 714.52 719.10 0.64
-p2m 203.02 206.26 1.60 715.99 718.71 0.38
-p3 301.25 306.83 1.85 697.20 703.97 0.97
-p3e 241.91 244.79 1.19 697.88 702.06 0.60
-p3m 131.89 133.65 1.33 699.12 701.40 0.33
-p4 183.23 186.54 1.81 650.44 657.64 1.11
-p4e 158.75 160.61 1.17 651.01 658.40 1.14
-p4m 82.49 83.73 1.50 651.21 656.84 0.86
Speed as multiple of realtime playback.
And to illustrate the speed disadvantage of the 64-bit version:
Preset Enco-Speed Deco-Speed
32 bit 64 bit Win % 32 bit 64 bit Win %
-p0 893.94 827.49 -7.43 810.57 700.80 -13.54
-p0e 711.16 661.05 -7.05 817.09 709.15 -13.21
-p0m 425.54 398.79 -6.29 819.19 710.48 -13.27
-p1 748.04 698.29 -6.65 788.37 689.93 -12.49
-p1e 491.72 461.79 -6.09 791.12 693.09 -12.39
-p1m 322.68 303.56 -5.93 794.60 695.48 -12.47
-p2 591.20 557.27 -5.74 718.12 634.55 -11.64
-p2e 348.35 336.29 -3.46 719.10 633.90 -11.85
-p2m 206.26 193.64 -6.12 718.71 636.49 -11.44
-p3 306.83 287.58 -6.27 703.97 619.50 -12.00
-p3e 244.79 233.77 -4.50 702.06 620.18 -11.66
-p3m 133.65 123.27 -7.77 701.40 621.06 -11.45
-p4 186.54 172.30 -7.63 657.64 577.94 -12.12
-p4e 160.61 145.73 -9.26 658.40 579.34 -12.01
-p4m 83.73 76.29 -8.89 656.84 578.45 -11.93
Speed as multiple of realtime playback.
The next release should add support for the AVX2 instruction set. I achieved encoding speed improvements of about 14 percent for preset -p4m on my primary system (Intel Skylake based CPU), less for other presets. But results of my secondary (Haswell based) system were discouraging: Maximum improvement of 8 percent for presets p4 and p4e and up to 23 percent slower
encoding for p2m, p3m and p4m!
Those presets make the most use of AVX2-instructions and should also benefit the most. But they seem to trigger the automatic down clocking mechanism of the cpu. AVX2 base and turbo frequencies are lower than the regular ones. This wouldn't hurt too much if the encoder would mostly use AVX2 instructions, but that's not the case. I havent profiled the code yet but i would estimate that about 30 percent of the encoding time goes to AVX2 instructions. And this is no continuous block, instead blocks of x86/SSE2 and AVX2 instructions are alternating.
That's bad, beacuse it will cause many transitions between the different clock rates. During such transitions the speed can be much slower than the lower clock rate would suggest. After the last AVX2 instruction the lower clock rate will be maintained for a considerable amount of time, therfore succeeding non-AVX2 instructions will also be excecuted slower.
Well, my haswell cpu is an 35w low power quad core, quite a challenge for an older desktop microarchitecture. The difference between regular and AVX2 clock most likely is considerably bigger than for the common 65W+ cpus.
Nevertheless i am really hesitant to release an AVX2-version which will make encoding on an unknown number of older systems slower. And imho the possible advantage isn't big enough to justify an elaborate study and implementation of a cpu dependend code path.
Currently it's not clear what i will do next. Possibly i will try to improve the encoding speed by algorithmic modifications. Ktf's latest "Lossless codec comparison" also made me think about the (re-) introduction of higher predictor counts.
Features for later Versions:
- Port to Lazarus / Freepascal. Nice for Linux support.
- Fast integrity check without decoding based upon the checksums only.
- Transcode mode.
- Tuning of the encoder for the problem files which have been reported in the past months.
Thnx for the new release. :)
Currently it's not clear what i will do next.
Your website is slightly out-of-date, listing the following items
Unterstützung für Unicode-Zeichensätze.
Eine deutschsprachige Version.
Noch ein bißchen mehr Geschwindigkeit und Kompressionseffizienz...
Anwendungen für andere Plattformen als Windows.
Unterstützung für mehr als 6 Audiokanäle.
- For FLAC I think there is still a bit to gain by improving quantization of predictor coefficients, but it seems this is not applicable to TAK, as it doesn't store raw predictor coefficients like FLAC does. I'm sharing the idea just in case it does make sense
- You could consider switching from MD5 to something faster. I've been told most SHA checksums are faster simply because they don't have a long dependency chain and can more efficiently use the superscalar properties and out-of-order execution capabilities of modern CPUs. (That would obviously break backward compatibility, but only for checksumming) For FLAC checksumming is quite a significant part of decoding CPU load (and encoding for the fastest presets), so I imagine this is also the case for TAK.
- You could consider looking into the arithmetic coder used in Daala/AV1. That is an arithmetic coder that was specifically designed to evade any existing patents. I don't know whether that is fast enough for your liking
Of course I don't know where you would like TAK to go from here. Do you still like to make (big) changes/additions to the format, or do you want to keep things backwards-compatible?
I know you've been talking about open-sourcing, but I can imagine this is a big step. If you'd like to see TAK gain more users, you could consider contributing a bare essentials TAK encoder to ffmpeg for example. I would imagine a TAK encoder without all specific tuning and tweaks, just a simple TAK encoder would already beat FLAC with ease. Or instead of open-sourcing the software, you could open up the format by creating a document describing the structure and sharing the ideas you used. Maybe someone else will do it for you (like the ffmpeg guys did with wavpack)
Please don't feel offended or pressured to do anything, I just wanted to contribute a few ideas.
An 64-bit decoder library for the SDK.
Thank you! I'll include it with the next release of Mp3tag.
Thank you Thomas for the new release! Your details about efforts and insights are always a pleasure to read too.
of course welcome to also have CPUs of the last decade in view, I guess most of us will run those. Reg. to https://en.wikipedia.org/wiki/Advanced_Vector_Extensions (https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) the last 5-7years-generations seems to have AVX2 support, newer CPUs are targeting next generation AVX-512.
My CPU does not have AVX2, next one in the next 12 months will, nevertheless I personally start endcoding and do not watch the progressbar. I rarely take care of some % of encoding as with current enc-speed-range, which is already at an awesome level, this is not the driving factor - to me.
To whom might this be an 'issue' or real downside at all, usecases, scenarios, ..? The group might get smaller and smaller anyway.
IMHO it makes sense to really go for the newer instruction set with next version, which will be sort of minor-version 2.4 then, guess so. And for the almost impossible case of bug fixing release another 2.3.x patch-version of no-AVX2 TAK.
Also like ktf's thoughts.
You could consider switching from MD5 to something faster. I've been told most SHA checksums are faster simply because they don't have a long dependency chain and can more efficiently use the superscalar properties and out-of-order execution capabilities of modern CPUs. (That would obviously break backward compatibility, but only for checksumming) For FLAC checksumming is quite a significant part of decoding CPU load (and encoding for the fastest presets), so I imagine this is also the case for TAK.
The "Fast integrity check without decoding based upon the checksums only" would resolve most of
[note1] that. I think so much that the downside to replacing MD5 outweights the benefits.
In TAK and WavPack, MD5 is optional and disabled by default - in WavPack it is viewed more as a fingerprint, and even more so after having implemented the non-decoding integrity check.
From that point of view, where MD5 is an optional fingerprint, I think it is a great advantage to have the same [note2] algorithm across FLAC/TAK/WavPack(/OptimFROG if anyone cares):
* if you want something quicker, then use the default; integrity verification will be faster than decoding anyway
* if you want a checksum as a fingerprint, you presumably want the one that everyone uses (say, if you transcode: yep, every MD5 appears precisely twice, that is source and target; and if you want to use say foo_bitcompare in the end to be sure, just sort source file list and target file list by MD5).
[note1]: Here is the exception. If one wants to verify not only integrity, but to check an encoding against the original PCM, then one could make a checksum of the PCM, encode, decode and check against the checksum. Then two checksums are calculated and a slow algorithm is a penalty - that would be "unnecessary" if the MD5 is not written to the file, never again to be used.
Now is it worth it to implement a second algorithm for the cases where MD5 is not stored?
(Even if MD5 has to be calculated once to be stored, a more than twice as fast algorithm would save time. But is it worth it?)
[note2]: except well, codecs differ on how to calculate MD5 on 8 bit signals, them being unsigned. Who has a big collection of 8 bit .wav compressed?
You could consider looking into the arithmetic coder used in Daala/AV1. That is an arithmetic coder that was specifically designed to evade any existing patents. I don't know whether that is fast enough for your liking
That hasn't stopped Microsoft from patenting the rANS entropy method, regardless
Which to me leaves us in the exact same situation as if we use ordinary arithmetic coding. Might as well just use that or range coding. :/