Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Speeding up codecs with faster CRC calculations (Read 39810 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Speeding up codecs with faster CRC calculations

While working on the repacker for my multi-threaded MP3 converter, I noticed that LAME uses a relatively slow CRC implementation. Further investigations showed that FLAC, Ogg and Monkey's Audio use similar algorithms for their CRC needs.

I replaced these algorithms with one called slicing-by-8 to see if conversions using these codecs would benefit from it. Turned out the benefit for lossy codecs is marginal, but lossless codecs are sped up quite significantly due to them generating larger output files.

Here are the relative speed ups on my system (Core i7 6900K) using default settings:

CodecEncodeDecode
LAME0.5%-
Opus0.5%1%
Vorbis0.5%2%
Monkey's Audio4%-
FLAC5%5%
Ogg FLAC10%15%
Note that Ogg FLAC benefits twice. The FLAC frames are checksummed by a CRC16 while the Ogg pages are run through a CRC32. Hence the larger speed up.

The patches to change the CRC algorithms to slicing-by-8 can be found here:


And here's the full article about this.

Edit: I updated the FLAC patch to fix a bug when using the --enable-64-bit-words option. See here for more information.
Edit2: A proof-of-concept FLAC build can now be found under post #9 along with some performance numbers.
Edit3: The Monkey's Audio patch has been integrated into the official Monkey's Audio 4.34 release.

Re: Speeding up codecs with faster CRC calculations

Reply #1
Neat! If I ever had free time I'd love to look at optimizing lame or vorbis. I think the last time I checked, a lot of lame optimizations were implemented for an athlon or P4.

 

Re: Speeding up codecs with faster CRC calculations

Reply #2
Excellent to see more real software development being talked about on this site!!!  Software is a major component of audio processing nowadays, and should be encouraged more -- sure there are DSP sites, but sometimes audio specific might not be appropriate on GP DSP sites.   BTW, I think that I am going to continue my posts on DSP related regarding audio gain control/NR/etc on the DSPrelated site.  It isn't a big deal, except for those who are starting a development effort.  I don't have a strong financial incentive, so helping others is more important to me.

Re: Speeding up codecs with faster CRC calculations

Reply #3
Fun. Even for a codec optimized for decoding speed, you can gain five percent just for the CRC?
(I recall someone (Gregory?) complaining over the time taken to create MD5 from the FLAC audio. But MD5 is not CRC* and encoding is done once.)

Re: Speeding up codecs with faster CRC calculations

Reply #4
Neat! If I ever had free time I'd love to look at optimizing lame or vorbis. I think the last time I checked, a lot of lame optimizations were implemented for an athlon or P4.
For LAME there are TMKK's SSE optimizations - the patch uses mostly SSE2, but can use some SSE3 and SSE4.1 instructions if available.

For Vorbis there is the Lancer version of course. It uses SSE, SSE2 or SSE3 instructions.

If you would like to improve these patches, they both could use AVX and AVX2 optimizations which might provide significant performance improvements. Also, the LAME patch is using inline assembly which renders it not very portable. Reimplementing it with intrinsics would be great.

Re: Speeding up codecs with faster CRC calculations

Reply #5
Fun. Even for a codec optimized for decoding speed, you can gain five percent just for the CRC?
Yes. I guess CRC was simply overlooked in previous attempts to optimize FLAC.

However, speeding up FLAC decoding involved more than just replacing the CRC algorithm. The CRC value was updated whenever a word (4 or 8 bytes) had been processed by the decoder. I changed that so that more bytes are processed at once. The CRC is now updated only when the read buffer is flushed and at the end of each frame.

MD5 is a different beast and I see no way to speed it up significantly at this time. Nayuki did some MD5 optimizations, but his assembler version is only 10% faster than the C version - that would not make a noticeable difference when applied to FLAC's MD5 calculation.

Re: Speeding up codecs with faster CRC calculations

Reply #6
Thanks Enzo, for the valuable work and article.

From the article:
Quote
It's possible to speed up the CRC calculations even more using other methods such as using the PCLMULQDQ instruction on modern x86 CPUs. However, that would make the code depend on that platform and probably provide only marginal additional speed gains.
SSE 4.2 also has a CRC instruction that I expect would be largely more efficient.

https://en.wikipedia.org/wiki/SSE4#SSE4.2
https://www.felixcloutier.com/x86/CRC32.html

All modern compilers support the associated intrinsics:
https://msdn.microsoft.com/en-us/library/bb514033(v=vs.120).aspx
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/X86-Built-in-Functions.html
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=crc

However the polynomial value is hardcoded and I don't know if it matches the one used by the various audio formats.
Opus 96 kb/s (Android) / Vorbis -q5 (PC) / WavPack -hhx6m (Archive)

Re: Speeding up codecs with faster CRC calculations

Reply #7
SSE 4.2 also has a CRC instruction that I expect would be largely more efficient.
[...]
However the polynomial value is hardcoded and I don't know if it matches the one used by the various audio formats.
Unfortunately, the hardcoded polynomial is different from the one used by Ogg and Monkey's Audio, so that instruction cannot be used.

LAME and FLAC use CRC16, so the CRC32 instruction is not applicable there anyway.

Re: Speeding up codecs with faster CRC calculations

Reply #8
Fun. Even for a codec optimized for decoding speed, you can gain five percent just for the CRC?
(I recall someone (Gregory?) complaining over the time taken to create MD5 from the FLAC audio. But MD5 is not CRC* and encoding is done once.)

The stock flac decoder is (or at least was 10 years ago) relatively unoptimized. The numbers we have in tockbox where you can decode a flac file in 5 or 10 MHz are for the ffmpeg flac decoder with CRC disabled and a lot of hand written arm assembly.

Re: Speeding up codecs with faster CRC calculations

Reply #9
I built some FLAC compiles for you to try out.

Built for Win64 with GCC 4.9.2 and -O3 -march=nehalem -funroll-loops. FLAC is configured with --enable-64-bit-words.

Here a some numbers converting a 2.5 hour, 48 kHz, 16 Bit Stereo WAV:

CodecEncodeDecode
FLAC35.711s13.646s
FLAC fastCRC32.850s12.411s
Ogg FLAC38.531s16.085s
Ogg FLAC fastCRC33.517s12.627s

Re: Speeding up codecs with faster CRC calculations

Reply #10
Hi enzo,

just tested your x64 binary on my ancient i5-2400 with my usual set of test files.
I can confirm an encoding speed boost of 4,3% when compared with the latest git version.
Thank you very much for your efforts - quite interesting that no one thought of optimizing the CRC functions...

.sundance.

Re: Speeding up codecs with faster CRC calculations

Reply #11
I tested with a small WAV-file (219 MB) and got the following result with my Intel i5 4460 CPU (4x 3,2 GHz):

Using FLAC-1.3.2_Git-2018-04-08_Win_GCC730 I got: Total encoding time: 0:07.629, 170.86x realtime

Using flac-1.3.2-fastcrc-win64 I got: Total encoding time: 0:06.240, 208.89x realtime

Re: Speeding up codecs with faster CRC calculations

Reply #12
Using FLAC-1.3.2_Git-2018-04-08_Win_GCC730 I got: Total encoding time: 0:07.629, 170.86x realtime
Using flac-1.3.2-fastcrc-win64 I got: Total encoding time: 0:06.240, 208.89x realtime
Great, but builds from different compilers and possibly built with different options are not really comparable. That's why I include a regular flac.exe and the flac-fastcrc.exe in my ZIP.

With my build being 22% faster in your run, I suspect the GCC 7.3 build was not compiled with ideal options. It should be closer to about 5% difference otherwise.

Re: Speeding up codecs with faster CRC calculations

Reply #13
Using FLAC-1.3.2_Git-2018-04-08_Win_GCC730 I got: Total encoding time: 0:07.629, 170.86x realtime
Using flac-1.3.2-fastcrc-win64 I got: Total encoding time: 0:06.240, 208.89x realtime
Great, but builds from different compilers and possibly built with different options are not really comparable. That's why I include a regular flac.exe and the flac-fastcrc.exe in my ZIP.

With my build being 22% faster in your run, I suspect the GCC 7.3 build was not compiled with ideal options. It should be closer to about 5% difference otherwise.

Ah, I see. Sorry for not doing things properly.

EDIT:
I tested with the same wav-file again, I got these results using the flac exe's included in flac-1.3.2-fastcrc-win64
flac.exe: Total encoding time: 0:07.316, 178.17x realtime
flac fastcrc: Total encoding time: 0:06.006, 217.03x realtime

Re: Speeding up codecs with faster CRC calculations

Reply #14
Sorry for posting again, but I tested again, with an larger wav-file (607 MB).
These are the results:

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\flac\U2 - Pop.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\flac\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:25.896, 139.46x realtime

--

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\flac-fastcrc\U2 - Pop.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac-fastcrc\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\flac-fastcrc\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:16.100, 224.31x realtime

Re: Speeding up codecs with faster CRC calculations

Reply #15
I tested with the same wav-file again, I got these results using the flac exe's included in flac-1.3.2-fastcrc-win64
flac.exe: Total encoding time: 0:07.316, 178.17x realtime
flac fastcrc: Total encoding time: 0:06.006, 217.03x realtime
OK, so that's still 22% percent difference. I didn't expect that as on my system the difference between those binaries is only about 5% to 7%. Would be interesting to see more results from others.
Destination file: F:\flac-1.3.2-fastcrc-win64\flac\U2 - Pop.flac
Total encoding time: 0:25.896, 139.46x realtime
--
Destination file: F:\flac-1.3.2-fastcrc-win64\flac-fastcrc\U2 - Pop.flac
Total encoding time: 0:16.100, 224.31x realtime
Here it could be that for the first run the .wav was not in the file system cache. The difference seems too big. Could you run the non-fast test again (maybe twice to see if the results are constant)?

Re: Speeding up codecs with faster CRC calculations

Reply #16
I tested with the same wav-file again, I got these results using the flac exe's included in flac-1.3.2-fastcrc-win64
flac.exe: Total encoding time: 0:07.316, 178.17x realtime
flac fastcrc: Total encoding time: 0:06.006, 217.03x realtime
OK, so that's still 22% percent difference. I didn't expect that as on my system the difference between those binaries is only about 5% to 7%. Would be interesting to see more results from others.
Destination file: F:\flac-1.3.2-fastcrc-win64\flac\U2 - Pop.flac
Total encoding time: 0:25.896, 139.46x realtime
--
Destination file: F:\flac-1.3.2-fastcrc-win64\flac-fastcrc\U2 - Pop.flac
Total encoding time: 0:16.100, 224.31x realtime
Here it could be that for the first run the .wav was not in the file system cache. The difference seems too big. Could you run the non-fast test again (maybe twice to see if the results are constant)?

Sure no problem, here are the results for the non-fast flac.exe:

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\U2 - Pop_nonefast1.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop_nonefast1.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:17.020, 212.18x realtime

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\U2 - Pop_nonefast2.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop_nonefast2.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:17.051, 211.80x realtime

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\U2 - Pop_nonefast3.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop_nonefast3.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:17.004, 212.38x realtime

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\U2 - Pop_nonefast4.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop_nonefast4.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:17.144, 210.65x realtime

CLI encoder: flac.exe
Destination file: F:\flac-1.3.2-fastcrc-win64\U2 - Pop_nonefast5.flac
Encoder stream format: 44100Hz / 2ch / 16bps
Command line: "F:\flac-1.3.2-fastcrc-win64\flac.exe" -s --ignore-chunk-sizes -8 - -o "U2 - Pop_nonefast5.flac"
Working folder: F:\flac-1.3.2-fastcrc-win64\
Encoder process still running, waiting...
Encoder process terminated cleanly.
Track converted successfully.
Total encoding time: 0:16.957, 212.97x realtime

Re: Speeding up codecs with faster CRC calculations

Reply #17
Sure no problem, here are the results for the non-fast flac.exe:
Thanks! So it seems to settle around 212x while the fast version was 224x realtime. That's about exactly what I expected - 5 to 6% improvement.

Re: Speeding up codecs with faster CRC calculations

Reply #18
Hello,

I'm still running a Core 2 Duo E7200@2.86GHz and your binary works, even if it is supposed to not support SSE4.2 and POPCNT.

Converting a 592MB wav file (time in seconds):
Run1Run2Run3
flac15.00915.00715.013
flac-fastcrc13.91213.96114.092
Quite interesting.

    AiZ

Re: Speeding up codecs with faster CRC calculations

Reply #19
Hi,

I eventually got some parts to assemble my new PC, now sporting an incredible...
Pentium G5400.
Woohoo!  :D

Same wav file as above (time in seconds):
Run1Run2Run3
flac7.0907.0667.058
flac-fastcrc6.1806.1276.237
Still great.

    AiZ

Re: Speeding up codecs with faster CRC calculations

Reply #20
Enzo,

Do you have any performance profile data for these codecs on current gen CPUs? I don't have time to dig into this right now, but I'd like to eventually and I'm curious if things have changed the last 10-15 years of new CPU hardware.

Re: Speeding up codecs with faster CRC calculations

Reply #21
Do you have any performance profile data for these codecs on current gen CPUs? I don't have time to dig into this right now, but I'd like to eventually and I'm curious if things have changed the last 10-15 years of new CPU hardware.
You mean regarding optimizations in general (with new instruction sets like AVX), yes? Not related to the CRC thing.

No, I didn't do any performance profiling. Same for me as for you: I'd like to take a deeper dive into this and try to further optimize codecs, but don't really have time to do it.

Re: Speeding up codecs with faster CRC calculations

Reply #22
Yeah, was just curious what the breakdown of codec runtime was by function. I think I last looked at lame performance in 2006 on a 32 bit Pentium 4. Probably things have changed since then.