HydrogenAudio

Lossless Audio Compression => WavPack => Topic started by: GeezerHz on 2019-09-10 15:56:24

Title: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-10 15:56:24
I have two machines:

#1 is a Ryzen 3900X with 32GB of 3200MHz RAM running Windows 10 x64
#2 is an i7-4770 with 16GB DDR3 running Windows 7 x64

Running Windows 64-bit wavpack.exe 5.1.0 on machine #1 is 64% SLOWER than machine #2.

Code: [Select]
wavpack -hh -x3 -m test.wav

Machine #1 takes 1:23 to complete the task while machine #2 takes 0:50. This is completely counter intuitive to me since machine #1 bests machine #2 in all benchmarks, even single-threaded integer performance (24% faster). What am I missing? More importantly, is there anything I can do to speed up wavpack on machine #1?

EDIT - Just in case anyone was wondering, it's not a hard drive issue. Machine #1 has an NVMe drive and machine #2 is just spinning platters.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-10 18:43:32
WavPack has a lot of hand-optimized MMX code so it's quite possible it accidentally has a sequence in a critical loop that incurs a penalty on non-Intel CPUs.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-10 18:52:32
Interesting. AMD has all of the instruction sets that the i7 has. I wondered if maybe the binaries were built with an Intel compiler, which is known to kneecap other processors on purpose.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-10 19:00:35
The code is hand written so that's not an issue here.

Most time is spent in this loop:
https://github.com/dbry/WavPack/blob/master/src/pack_x64.asm#L599-L633

But nothing jumps out as being an issue on Ryzen.

What's the effective clock speed while the wavpack binary is running? Is it possible the Intel machine is getting a much higher boost?
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-10 19:09:53
The issue, if I recall correctly, with Intel's compiler is that it will not generate MMX, SSE, etc. for any processor other than Intel, even if they support those instructions. It's not about how the source is written, but the machine code generated by the Intel compiler.

I'm looking through the sources right now to see if there's something checking for an Intel CPU instead of checking for MMX support.

On the i7 machine I'm getting 3.7 GHz during encoding. When I get to the other machine I'll get you the actual clock running the wavpack process. I don't suspect this is the issue since the benchmarks on machine #1 far surpass the same benchmarks on machine #2. If there were a boost clock issue I wouldn't think that would be the case.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-10 19:45:35
Quote
It's not about how the source is written, but the machine code generated by the Intel compiler.

Let me repeat this once again because it clearly didn't register: the relevant part of the source code is written directly in machine language by hand. There is nothing for the compiler to compile. The code IS MMX code.

The 64-bit binary assumes MMX because that's guaranteed to be available in AMD64 mode. It has no need to detect the CPU vendor to know it can use MMX.

Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-10 19:53:04
Let me repeat this once again because it clearly didn't register: the relevant part of the source code is written directly in machine language by hand. There is nothing for the compiler to compile. The code IS MMX code.

You're not repeating yourself, you're clarifying your earlier statement:

The code is hand written so that's not an issue here.

Any ideas on tracking this down other than checking the obvious (i.e., Ryzen dropping into low frequencies while running wavpack)?
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-10 20:01:11
Quote
Any ideas on tracking this down other than checking the obvious (i.e., Ryzen dropping into low frequencies while running wavpack)?

A precise enough profiler (e.g. Linux perf) might be able to show if there's a particular hotspot that is unexpected (and that might be compared to the same output on an Intel system).

I wonder if it's worth to compile the WavPack code with the assembler code disabled and see what it does to performance (likely worse, but you never know...)
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-10 20:03:49
I wonder if it's worth to compile the WavPack code with the assembler code disabled and see what it does to performance (likely worse, but you never know...)

I'm happy to test that, but trying to avoid downloading MSVC. I wish it was easy to build on Cygwin.

The weird thing is that Process Explorer shows that wavpack is only one thread using about 11.5% CPU on the i7 system the whole time. It doesn't seem like it's constrained by computational power.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: lvqcl on 2019-09-10 23:46:12
What's weird? Almost all audio encoders are single-threaded, AFAIK.
So yes, wavpack is constrained by single-threaded performance.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 00:03:55
What's weird? Almost all audio encoders are single-threaded, AFAIK.
So yes, wavpack is constrained by single-threaded performance.
It's weird that it's not even using 100% of the single core it's running on.

Anyways, I broke down and downloaded the 2GB of Visual Studio and built wavpack without the ASM. It STILL runs faster on this i7-4770 than the assembly-enabled version runs on the Ryzen 3990X ... by about 30%.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: m14u on 2019-09-11 00:30:53
[offtopic]
[...] Almost all audio encoders are single-threaded, AFAIK.[...]
TAK, qaac...
[/offtopic]
@TS - try live-cd-linux for delete "win10"-factor (hypothetical).
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 03:37:04
Set wavpack affinity for core #0 on the Ryzen 3900X and core #0 boosts to 4.5GHz while running wavpack. Still 60% slower than the i7-4770 running at 3.7GHz. CLock-for-clock that's almost precisely double the speed.

I tried the non-asm version of wavpack I built and it's even slower than the asm-enabled version.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: saratoga on 2019-09-11 04:42:54
What's weird? Almost all audio encoders are single-threaded, AFAIK.
So yes, wavpack is constrained by single-threaded performance.
It's weird that it's not even using 100% of the single core it's running on.

Since you have 8 hardware threads, and you're getting 1/8th utilization from a single threaded program, it sounds like everything is normal.

Anyways, I broke down and downloaded the 2GB of Visual Studio and built wavpack without the ASM. It STILL runs faster on this i7-4770 than the assembly-enabled version runs on the Ryzen 3990X ... by about 30%.

VS has a profiler built in: https://docs.microsoft.com/en-us/visualstudio/profiling/beginners-guide-to-performance-profiling?view=vs-2019

You could profile on both of your systems and see what the difference is.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: sven_Bent on 2019-09-11 04:48:30
What's weird? Almost all audio encoders are single-threaded, AFAIK.
So yes, wavpack is constrained by single-threaded performance.
It's weird that it's not even using 100% of the single core it's running on.

Its because you don't understand  what the numbers represent
Your CPU has 8 logical Cores
100% of 1/8 = 12.5%
So yeah its using close 100% of a Core but that's is only close to 12.5% of the CPU
11.5% of you CPU = 92% of a core

So that does not seem weird at all for a single threaded process



Going back to your Ryzen. Some of the performance you are missing can also be coming from CCX jumping which results large cache latancy  penalties
Try using affinty to lock it to just one of your CCX's
Avoiding CCX jumping has shown to give up to 23% FPS boost in games, so I would think it could give some boost for something even more CPU bound

The Ryzen CCX built up really hurts  Thread performance if you don't take control of your threads
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 05:33:37
Its because you don't understand  what the numbers represent

I understand what they mean. I think some people are misled because they're thinking 11.5% = 12.5%. That's like Intel floating point math. The i7-4770 is 4 cores, with 4 pretend cores. 100% utilization of a core should be at least 12.5%. 11.5% means it's not using 100% of the core. If there's no I/O bottleneck, what's the deal with the 8% speed penalty?

Going back to your Ryzen. Some of the performance you are missing can also be coming from CCX jumping which results large cache latancy  penalties
Try using affinty to lock it to just one of your CCX's
Avoiding CCX jumping has shown to give up to 23% FPS boost in games, so I would think it could give some boost for something even more CPU bound

The Ryzen CCX built up really hurts  Thread performance if you don't take control of your threads

That's why I was saying I locked the affinity to core #0 to no effect. The fact that it's close enough to exactly 2x clock-for-clock to be within the margin of calculation error got me to thinking about the way the first generation Ryzen processors did two AVX128 to simulate AVX256, which incurred a 50% speed penalty relative to Intel CPUs with AVX256, but this is different. This is the first program I've noticed this issue with, so there's something peculiar about that loop that really hurts on Ryzen for some reason, even with double the RAM at twice the speed.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: saratoga on 2019-09-11 05:39:34
Its because you don't understand  what the numbers represent
I understand what they mean. I think some people are misled because they're thinking 11.5% = 12.5%. That's like Intel floating point math. The i7-4770 is 4 cores, with 4 pretend cores. 100% utilization of a core should be at least 12.5%. 11.5% means it's not using 100% of the core.

That is normal sampling error.  If you want the exact stats you'll need to log them yourself using a profiler.

Also, you don't have 4 real and 4 pretend cores.  You have 4 cores, each of which is 2 threads wide. 
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 14:58:15
If I profile wavpack release version, I don't get the function names, but if I profile a debug build it's like 75% slower and might be masking the speed issue. Not sure which is better. I'm not knowledgeable enough to interpret the results, but this is what I see profiling the release version on the i7-4770:

(https://i.imgur.com/48Fvdk8.png)

And this is with the debug version:

(https://i.imgur.com/BZ1yDFm.png)
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Peter on 2019-09-11 15:16:16
Thanks for sharing the results.

Unfortunately, results from the debug build are likely not meaningful because debug build should have no optimizations whatsoever, as well as additional debug checks adding overhead.

In order to get function names in release build, you need to enable debug symbols in release build, rebuild and run again.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 16:30:08
In order to get function names in release build, you need to enable debug symbols in release build, rebuild and run again.

Not sure how to do that with Visual Studio. I'll try and figure it out, but if someone knows how it would probably save a lot of time figuring it out.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 16:45:45
Got it. It ran at full speed. Profiler report looks like this:

(https://i.imgur.com/SBFAZS3.png)

If I'm reading things right, the most time is spent on the:

Code: [Select]
pmaddwd mm1, mm5

instruction that appears in a couple of places. But this is the fast i7-4770 machine. Now I need to see what's happening on the 3990X.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-11 20:36:14
It occurs to me that this information may not be helpful. The 3900X could be 10X slower but still spend the same percentage of time in the same loops. Is there a way to quantify the actual number of instructions or the amount of time to execute an instruction using this profiler? Is that what the numbers next to the percentages show? For example, the 18,656 next to the pack_decorr_stereo_pass_cont_common: Is that the number of times that particular loop executed in the 30.685 seconds of profiling (i.e., 608 per second)?
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: saratoga on 2019-09-12 02:25:43
It occurs to me that this information may not be helpful. The 3900X could be 10X slower but still spend the same percentage of time in the same loops. Is there a way to quantify the actual number of instructions or the amount of time to execute an instruction using this profiler?

You would need a lower level profiler (or at least I don't think VS can do it but I could be wrong) like Intel's Vtune or AMD's CodeXL.  The problem with that is that you'd have to use each vendor's tool (most likely) since you want access to low level timing information and hardware counters.

However, your profiling above says that essentially the entire runtime is spent in that one bit of hand written MMX.  Unfortunately since it looks like the linker inlines, you're not able to see finer detail.  Have you tried profiling the version with MMX disabled and just letting the compiler generate its own code?

Is that what the numbers next to the percentages show? For example, the 18,656 next to the pack_decorr_stereo_pass_cont_common: Is that the number of times that particular loop executed in the 30.685 seconds of profiling (i.e., 608 per second)?

That is the number of milliseconds spent in that function and anything it calls.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-12 02:49:05
Looks like this on the 3900X:

(https://i.imgur.com/q0UDepp.png)

You can see the time is longer by a long shot.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-12 02:59:46
Looks like this after compiled with /Ob (disabled inline expansion):

(https://i.imgur.com/uUX5NST.png)

OUCH!!!

(https://i.imgur.com/zQSEjNe.png)
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: kode54 on 2019-09-12 04:07:22
Maybe this is something that would benefit from having a path for the newer FMA instructions? You know, fixed multiply-add.

Maybe you could PM me an uploaded copy of your test.wav and I can see how my Ryzen 7 2700 handles it?
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-12 14:17:27
The profile confirms that the loop identified earlier is where all time is spent in, but the per-instruction sampling doesn't look like it can be accurate. You''ll likely need CodeXL or something like that to get per-instruction data.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-12 16:05:09
Samuel Neves has pointed out that the loop in question is *faster* on Ryzen than it is on Haswell:
https://gcc.godbolt.org/z/ZhKJkE

Assuming everything is in cache! Some bad cache interaction or access pattern that is fooling the prefetcher?

If you can reproduce it on some .wav that you can share, I can have a look on my machine.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: GeezerHz on 2019-09-12 18:23:36
I see the same thing no matter what wav file I'm processing. It's not peculiar to any particular wav. Everything I've been processing (over 150 of them so far) is a 500 - 800 MB 44.1 MHz 16-bit wav from a CD. If other people are not seeing the same issue on their Ryzen machines, then something else is going on. I'm just not sure how that could be when the one core is fully-loaded and running at 4.5GHz. That's as fast as she can go.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Garf on 2019-09-13 12:48:14
I tested a 460M WAV file with the settings you gave on my 3900X.

The file encoded in 76s, at 4 GHz (the machine was loaded so no 4.5GHz boost...).

A friend with an i7-4771 tested the same file.

The file encoded in 92s, at 3.7 GHz.

That's both using the release binary from the WavPack site. My conclusion is that OP fucked up and you aren't using the same settings or binary on both machines, i.e. that the i7-4770 machine is using a lower compression preset. Or you magically have a Haswell machine that is twice as fast as anyone else's...
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: Zao on 2019-09-13 13:52:07
Ran a quick test with the hardware I had up and running:


743MB WAV, converted from a 320kbps MP3 with foobar2000
http://splendidbeats.com/psilicon-dreams-nov-2009/
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: saratoga on 2019-09-14 00:23:59
Maybe this is something that would benefit from having a path for the newer FMA instructions? You know, fixed multiply-add.

Maybe you could PM me an uploaded copy of your test.wav and I can see how my Ryzen 7 2700 handles it?

AVX2 I think since lossless is mostly integer. optimization for modern instruction sets would be nice in a lot of encoders.
Title: Re: Wavpack.exe Slow on AMD Ryzen
Post by: kode54 on 2019-09-16 05:00:08
Only not-nice for Intel processors, where AVX/AVX2/AVX512 actually push the CPU's TDP to 150-200%. Still plenty nice for AMD processors that support the relevant instruction sets, which is AVX and AVX2.