Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Monkey's Audio 9.10 beta with additional SIMD optimizations (Read 7155 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Monkey's Audio 9.10 beta with additional SIMD optimizations

Hi all,

I worked with @MonkeysAudio in the past weeks to add runtime CPU feature detection and additional SIMD optimizations to the codec.

The highlights of this release when it comes to optimization are:

  • Runtime feature detection for x86/x64 optimizations (SSE2/SSE4.1/AVX2)
  • New SSE4.1 and AVX2 optimizations for 24/32 bit coding
  • New ARM Neon optimizations
  • Further optimized existing SSE2 and AVX2 code for improved performance

You can find the full changelog on monkeysaudio.com.

The new optimizations do not make a big difference when using the Normal compression level, but the higher the compression level, the greater the performance gain.

On x86/x64 you should see the most significant improvements with 24/32 bit material due to the newly added SIMD code for high resolution processing.

ARM devices like Apple Silicon Macs benefit greatly from the newly added Neon optimizations even with 16 bit audio. On a MacBook Air I measured a performance gain of 25-30% using the High and Extra high compression levels.

The official Monkey's Audio 9.10 beta is available from the official page and the code has been integrated into fre:ac continuous builds already.

It would be great if those interested in Monkey's Audio could give it a try and report bugs or performance regressions.

Thanks!

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #1
Thank you.   8)
EZ CD Audio Converter

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #2
Thanks!

Here are some test results, comparing the version I used in my last lossless codec comparison with 9.04 and 9.10. Machine has a not too new Intel Kaby Lake-R i5-7200U.

These results are for 16 bit PCM input, I'll do 24 bit PCM later.

X

X

It seems decoding got a bit faster too.
Music: sounds arranged such that they construct feelings.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #3
Here are the results for 24-bit PCM. Machine and methodology is the same

X

X

However, I noticed a very slight regression in compression in one particular track (1,000,000, which is track 2 of NINs album The Slip, available for download in various places, like archive.org, this is available under a creative commons license) This is an older regression, and it is only slight, but still curious, as I'm not used to seeing any change at all in how much Monkey's Audio compressed between releases. Here is the graph

X
Music: sounds arranged such that they construct feelings.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #4
I was unaware that this was even a thing, I thought a -cX000 .ape was a -cX000 ape except for those releases that should have been tagged beta.
But, browsing the changelog, it could be this 8.50 change?
The encoder uses what was previously "legacy mode" for all 24-bit encodings. This provides wider compatibility with only slightly worse compression.
At least, 8.43 compresses bit-identically to the old and 8.51 bit-identically to the new, on the same track. Monkey's MD5 returned for -c4000 "Extra high":
1D4EF35A8AEF8C8625FB298630E93A0C resp. AC6454F131264C710CA1BF552B877D63


As for timings, this i5-1135G7 is in a cooling-constrained fanless computer that throttles randomly, but it seems I gain a very few percent speed (at -c4000) on high resolution. Like, for the above track I got encode times 4.225, 4.274, 3.897 - each being the median of several runs with, respectively, versions 7.58 (because ktf used that), 9.00 (because that's the one I had installed) and 9.10, all 64-bit versions called from command-line.



Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #5
Thanks @ktf and @Porcus!

Yes, there was an incompatibility introduced with the 5.00 release which added 32 bit sample support. Older decoders would not be able to decode most 24 bit files made after that change. The reason was that the old code could cause a 32 bit overflow in one calculation and the new code used 64 bit types (to enable 32 bit sample encoding), so the overflow did not happen.

This was fixed in 8.50 to restore compatibility with older decoders, but reverting to the old overflow behavior caused a slighty worse compression rate for the affected files.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #6
Good day to everybody. First of all, thanks to developers for this beta! Its good to see that development of Monkey's Audio goes on. Also thanks to all of the people who participated in testing!
I decided to add my 5 cents so here it comes )
Please note that my test is kinda dirty and synthetic.
Test material: Generated 300 seconds of spatial stereo brown noise with Adobe Audition and saved it as 32bit float, with further conversion to 16/24/32 bit via LibSndFile.
Hardware: i7-3770K @ 4500 MHz, SSD (SATA3 speed limited)
Software: Windows 7 x64, measurements with ProcProfile 1.5.1
I'm posting results as screenshot from Excel.
Actually I haven't post here for a long time so dunno if it will look OK, so sorry if post will look a little bit clumsy )
In order somebody would like to have an original Excel file, I've attached it too.

X

So, results are kinda interesting and bring a couple of strange things.
1. Decoding is notably slower than encoding. Well, its a curse of almost all symmetric encoders I have seen so far and solution here is either PGO (profile guided optimization), either push out decoder to standalone executable, or simply don't care :)
2. -c1000 mode for 24bit resolution presumably broken or lacks the optimizations. Both -c2000 and -c3000 are actually faster than -c1000 for 24bit audio. Well, at least for this file.
Can anybody confirm it ?
3. Results for 32bit -c5000 look strange to me. For 9.10 beta we have:
For 16bit, -c5000 mode 2.25 times slower than -c4000
For 24bit, -c5000 mode 2.19 times slower than -c4000
For 32bit, -c5000 mode 2.57 times slower than -c4000
Of course I do not expect that there should be some kind of linear coefficients for speed but anyway, just pointing to this.

Anyway, with more than 25% (maximum) and almost 9% (average) speedup, the results are simply astonishing!
My congratulations !

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #7
I am not encoding APE files anymore, so some decoding benchmarks from my existing files, all of them are CDDA images. All decoding done on RAM, i3-12100, win10 x64 and x64 executable.

Fast:
Code: [Select]
--- Monkey's Audio Console Front End (v 8.19) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 10.4 seconds total)

--- Monkey's Audio Console Front End (v 9.10) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 10.5 seconds total)
Normal:
Code: [Select]
--- Monkey's Audio Console Front End (v 8.19) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 15.8 seconds total)

--- Monkey's Audio Console Front End (v 9.10) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 15.5 seconds total)
High:
Code: [Select]
--- Monkey's Audio Console Front End (v 8.19) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 18.0 seconds total)

--- Monkey's Audio Console Front End (v 9.10) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 17.2 seconds total)

Extra high:
Code: [Select]
--- Monkey's Audio Console Front End (v 8.19) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 18.3 seconds total)

--- Monkey's Audio Console Front End (v 9.10) (c) Matthew T. Ashland ---
Progress: 100.0% (0.0 seconds remaining, 17.4 seconds total)

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #8
Thank you all for testing!

-c1000 mode for 24bit resolution presumably broken or lacks the optimizations. Both -c2000 and -c3000 are actually faster than -c1000 for 24bit audio. Well, at least for this file.
Can anybody confirm it ?
I was curious about your 24 bit -c1000 results, but cannot reproduce those here. Tried my own MinGW and the official MSVC binaries and -c1000 is consistently much faster than -c2000 or -c3000. Really don't know what might go wrong there on your machine.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #9
Results for 32bit -c5000 look strange to me. For 9.10 beta we have:
For 16bit, -c5000 mode 2.25 times slower than -c4000
For 24bit, -c5000 mode 2.19 times slower than -c4000
For 32bit, -c5000 mode 2.57 times slower than -c4000
Just guessing, but the higher encoding time ratio for 32 bit might be because of more overflows happening in the filter stage with 32 bit encoding.

Both, 24 and 32 bit encoding, use 64 bit types for calculations and the larger numbers operated on with 32 bit samples significantly increase the chance for overflows.

Overflows de-correlate the filter output from the input signal and make it harder (i.e. more time consumning) to encode the resulting values.

This combined with the additional high order filter used at the -c5000 level could potentially explain the larger ratio for 32 bit vs. 24 bit encoding.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #10
@Skymmer, I was curious about your 24 bit -c1000 results, but cannot reproduce those here.
You're completely right. I looked into my table and found that my results are little bit incorrect. Pure human factor :)
Somehow I duplicated 16bit -c4000 results for 24bit -c1000 entry. Shame for me...

X

So, right in the morning, I re-worked the batch in order to exclude any semi-manual results parsing.
All tests executed one more time and in more clean environment, i.e. with no background processes and activity.
New results below as same as fixed Excel.

X


Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #11
You're completely right. I looked into my table and found that my results are little bit incorrect. Pure human factor :)
Somehow I duplicated 16bit -c4000 results for 24bit -c1000 entry. Shame for me...
Thanks for checking again! Good to know that there's no actual issue there.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #12
Got this computer to confess some to some numbers that are consistent enough to report. Biggest boost for high resolution you said - then ten percent speed-up for Extra High encoding is pretty decent for CDDA?

24 hours of CDDA in 23 files (23 albums in my signature, selected to be as good as spot-on 24 hours)
9.04 encoding Extra High: 554, 554, 555, 556, 557 all to stdout; 562 and 577 to file.
9.10 encoding Extra High: 492, 492, 492, 492, 492 all to stdout; then a 498 to file.

In CMD, running a
mac.exe "long_file_to_warm_up_the_cpu_hoping_for_consistent_results.wav" throwaway.ape -c5000 & timer64 this_bat_file_FORloops_mac.exe_through_the_twentythree_files.bat 

Now doing the same for decoding ... oh shrugs. The fastest I got was 565 (that's using 9.10, as you might have guessed). Since Monkey's seems to generally decode slower than it encodes, then at least new 9.10 is nearly as fast decoding as 9.04 encodes.
But the times were too inconsistent really, so that is why I don't report a list - when times are unreliable, the difference between the two fastest (that is likely the ones where Windows did not start something behind my back?) would be even less reliable. But anyway, I guess the encode times are enough evidence that you did something right.

this i5-1135G7 is in a cooling-constrained fanless computer that throttles randomly
Not bought to time codecs. Bought to decode and play music and STFU while doing so. Didn't get what I didn't pay for.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #13
Since Monkey's seems to generally decode slower than it encodes, then at least new 9.10 is nearly as fast decoding as 9.04 encodes.
Yes, unfortunately decoding is always a bit slower than encoding with Monkey's Audio. I might attempt to improve decoding performance at some point. Got some ideas, but I'm not sure those will actually work in practice.

I just noticed today that I have an i7-1185G7 in one of my systems, which is very similar to your i5 - and just like yours it supports AVX-512...

Up to now I wasn't aware I had an AVX-512 capable system at my disposal and thus couldn't develop and test such code. Turns out I was wrong.

You'll get some additional speedup from that in the upcoming Monkey's Audio release. ;)

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #14
I just wanted to share a thank you to Enzo.  He's a genius and I'm so grateful for all his help.

A while back I had slowed Monkey's Audio down a lot when I added 32-bit support.  I wasn't sure how to speed it up again.

Enzo (Robert) showed up and switched everything to templates and it was just pure magic to me!

And now he just helped some more and brought my faster AVX assembly to Linux and Mac.

Thanks!

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #15
For extra credit Enzo had another genius idea today that resulted in about a 10% speed-up on my computer.

Download here:
https://monkeysaudio.com/download.html

Thanks to everyone for the help :)

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #16
For extra credit Enzo had another genius idea today that resulted in about a 10% speed-up on my computer.
Thanks for the kudos!

That change is for decoding only and should bring decoding performance much closer to encoding speed.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #17
9.15 also seems to speed up encoding over 9.10, although not as big impact as for decoding. Not sure if that can be explained by anything but my CPU?
But the thing is, I got some fairly consistent results now by looping decoding-encoding-decoding during nighttime and just deleting a couple of "bad" exceptions. Maybe I should have randomized order ...
 
Times, anyway, all done for Extra High, encoding/decoding to stdout > NUL with the same corpus.

9.04 & 9.10 encoding: representative figures of 553 and 494 means they are quite a lot the same as the following.
9.04 encoding Extra High: 554, 554, 555, 556, 557 all to stdout; 562 and 577 to file.
9.10 encoding Extra High: 492, 492, 492, 492, 492 all to stdout; then a 498 to file.
Yet 9.15 improves:
9.15 encoding Extra High: 462, 466, 467.

Then decoding, here is where some "651" goes out. To get reasonably comparable numbers, here is for each version the range of second fastest to fifth fastest:
9.04 decoding: 608 to 613
9.10 decoding: 563 to 565
9.15 decoding: 505 to 507

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #18
9.15 also seems to speed up encoding over 9.10, although not as big impact as for decoding.
That should be because of the AVX-512 optimizations added in 9.11.
Then decoding, here is where some "651" goes out. To get reasonably comparable numbers, here is for each version the range of second fastest to fifth fastest:
9.04 decoding: 608 to 613
9.10 decoding: 563 to 565
9.15 decoding: 505 to 507
Great! Good to see you can reproduce the performance gains. Thank you for testing!

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #19
For speeding up the decoder, consider merging the CalculateDotProduct() and Adapt() functions. FFMpeg did that in its implementation and it help with higher compression levels.

Re: Monkey's Audio 9.10 beta with additional SIMD optimizations

Reply #20
For speeding up the decoder, consider merging the CalculateDotProduct() and Adapt() functions. FFMpeg did that in its implementation and it help with higher compression levels.

Thanks.  Are you proficient enough to suggest how to merge AdaptAVX2 and CalculateDotProductAVX2?  Those are the two functions used by modern processors.

Thanks again.