Skip to main content
Topic: MMX optimised WavPack encoder (Read 7609 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

MMX optimised WavPack encoder

This is just a quick hack by using GCC's built-in functions, so it's not really hand optimised yet. Sorry, but I already post these first attempts, because the next four weeks I'll be too busy to think about it. 

My changes effect the encoding speed of the extra modes when operating on stereo files. On my old AMD K6 (200 Mhz) I noticed a speedup of 12-17%.    I'm not sure, if it's worth to spend more work on this. Would be nice, if some people could post their own test results here so that we can see how it performs on recent processors.

You can download the package mmx.tar.gz, which contains two Linux binaries (plain and MMX version) and a source diff - maybe someone could compile a Windows binary with MinGW?

Please test with 'wavpack -f -x' etc. and compare the compressed files. They must be identical. I'm not sure, if this works for 32 bit files too.

Thanks
Jo.

MMX optimised WavPack encoder

Reply #1
If this is true (12-17% increase!), lossless compression is really improving, recently!

MMX optimised WavPack encoder

Reply #2
This is very cool! I will download your sources and try some of this out myself (and peek at the code), hopefully this weekend. Thanks for looking into this!

I have been meaning to play around with x86 hand-optimization myself, but just haven't gotten the chance. I did assembly optimizations for the ColdFire and ARM CPUs for Rockbox, and the improvement was pretty significant (at least 25% faster decoding).

However, the parts I would look at first are the basic decorrelation loops in the pack.c and unpack.c modules. This is where the real time is taken, and the results could be folded back into the -x mode stuff eventually, if desired.

As for the -x mode, it was originally just an experimental mode to find out what kind of maximum compression might be achievable. I never really intended for it to remain in the code, essentially untouched for years! I'm sure that it could be optimized to be much faster by changing the algorithm around rather than using assembly language.

In fact, one of the improvements I have in mind for the next release is trying to get some portion of the -x mode improvement (maybe half...?) built into the standard modes without any encode speed hit. If I have good luck with that I might be able to retire the -x mode for only special cases.

MMX optimised WavPack encoder

Reply #3
[quote author=Shade[ST| link=msg=382208 date=1144976418]If this is true (12-17% increase!), lossless compression is really improving, recently! [/quote]Wow, I agree.  Yalac is looking very interesting as a future prospect; There's MP4ALS that I know little about; Garf's FLAC improvements; and now we hear that we may get some improvements in WavPack's speed and compression rates for "free"!

Good news all round.  Best of luck with it David, and possibly thanks to he-jo.
I'm on a horse.

MMX optimised WavPack encoder

Reply #4
This is very cool! I will download your sources and try some of this out myself (and peek at the code), hopefully this weekend. Thanks for looking into this!

...

As for the -x mode, it was originally just an experimental mode to find out what kind of maximum compression might be achievable. I never really intended for it to remain in the code, essentially untouched for years! I'm sure that it could be optimized to be much faster by changing the algorithm around rather than using assembly language.

In fact, one of the improvements I have in mind for the next release is trying to get some portion of the -x mode improvement (maybe half...?) built into the standard modes without any encode speed hit. If I have good luck with that I might be able to retire the -x mode for only special cases.


Please, don't waste your time with my dirty hacks! Yes, if you already have some algorithmic improvements in mind, it would be better to implement them first. I'm looking forward to try some tuning on your next release.

MMX optimised WavPack encoder

Reply #5
[quote author=he-jo link=msg=382199 date=1144973417]This is just a quick hack by using GCC's built-in functions, so it's not really hand optimised yet. Sorry, but I already post these first attempts, because the next four weeks I'll be too busy to think about it.  [/quote]

You don't really have to write low level asm code anymore. At least if you use gcc 4.x, it schedules intrinsics really well and you will have much less hassle than using asm code. (Esp is you decide to use -fPIC or alike, gcc will handle register allocation, thus no starvation.) Furthermore the code with just work on x86 *and* x86_64.

If you want to make your asm code more "portable", ie let other compilers be able to compile it, take a look at OpenAL portable module. I did some "magic" in a header file to achive this.

gcc 3.x is still faster using its built-ins than using Intel style intrinsics which would be more portable otherwise, so I decided to use gcc built-ins style. But the code compiles with gcc, icc and msvc++ (current express edition).


MMX optimised WavPack encoder

Reply #7
I uploaded a new MMX patch (wp-4.32-jh_mmx.diff.gz), which converts the entire function 'decorr_stereo_pass' of the "extra" modes. But I'm afraid, the speedup (if any) to the previous version isn't that noticeable: On my AMD K6 with Linux and gcc 4.0.2, it seems to be only 4%. If I remember right, my former tests showed that the loops for the negative 'term' values run less often and also with less iterations. So the reason for the speedup could also be the removal of the unions, as suggested by PrakashP.

wisodev will probably provide MSVC binaries soon. I'm very interested in test results on different processors.

Thanks for your support!
Jo.


MMX optimised WavPack encoder

Reply #9
I just tested your binary on a Celeron and found that there must be a bug in your code - the output files differ. If you (or someone else) want to test this with Linux (GCC 4 required) you could do something like
Code: [Select]
cd wavpack-4.32
./configure --disable-shared
make CFLAGS='-pipe -O3 -march=athlon-xp'
mv wavpack wavpackORI
zcat wp-4.32-jh_mmx.diff.gz | patch -p0
make CFLAGS='-pipe -O3 -march=athlon-xp -mmx'
mv wavpack wavpackMMX

You'll have two binaries: 'wavpackORI' and 'wavpackMMX'. Run some tests like time ./wavpackORI -q -f -x6 -o t_ori.wv test.wav and time ./wavpackMMX -q -f -x6 -o t_mmx.wv test.wav Compare the output files with cmp t_ori.wv t_mmx.wv

It would be good, if you could also test on larger files again. To me it seems that they benefit a bit more from the new MMX code then the files I suggested for the "jfl2b" tests.

MMX optimised WavPack encoder

Reply #10
Ok, I will do tests on linux (SUSE LINUX 10.1, gcc 4.1.0), and check the output files. I have some time ago compiled the 4.32 but only to check if it works.

The latest version was not tested for output (binary comparison) but I will carefully do binary comparison tests this time.

Beside, the problem (on my side) was that this new patch was very different from previous sources and I have to do some more tweaking to extra2.c code. You can compare your sources with my version to see what I have changed, maybe you can find solution. Anyway I will also check the code.

MMX optimised WavPack encoder

Reply #11
Could you please check, if this patch helps for MSVC? The first hunk just removes duplicate defines.

Code: [Select]
--- extra2.c
+++ extra2.c
@@ -57,8 +57,6 @@
 #else
    #include <mmintrin.h>
    typedef __m64 int_mmx;
-    #define __builtin_ia32_pslld(m1, m2) _m_pslld(m1, m2)
-    #define __builtin_ia32_psubd(m1, m2) _m_psubd(m1, m2)
    #define __builtin_ia32_psrld(m1, m2) _m_psrld(m1, m2)
    #define __builtin_ia32_pand(m1, m2) _m_pand(m1, m2)
    #define __builtin_ia32_pmaddwd(m1, m2) _m_pmaddwd(m1, m2)
@@ -74,7 +72,7 @@
    #define __builtin_ia32_punpckldq(m1, m2) _m_punpckldq(m1, m2)
    #define __builtin_ia32_paddsw(m1, m2) _m_paddsw(m1, m2)
    #define __builtin_ia32_emms() _mm_empty()
-    #define set_int_mmx(m1, m2) _mm_set_pi32(m1, m2)
+    #define set_int_mmx(m1, m2) _mm_set_pi32(m2, m1)
 #endif // __GNUC__
 
 // MMX optimized decorr_stereo_pass (-x switch only for stereo input files).

MMX optimised WavPack encoder

Reply #12
earlier was this:
Code: [Select]
#if __GNUC__ && !__INTEL_COMPILER
        const int_mmx
            delta = { dpp->delta, dpp->delta },
            msk0 = { 0x7fff, 0x7fff },
            msk1 = { 0xffff, 0xffff },
            round = { 512, 512 },
            zero = { 0, 0 };
#else // NO GCC
        const int_mmx
            delta = _mm_set_pi32(dpp->delta, dpp->delta),
            msk0 = _mm_set_pi32(0x7fff, 0x7fff),
            msk1 = _mm_set_pi32(0xffff, 0xffff),
            round = _mm_set_pi32(512, 512),
            zero = _mm_set_pi32(0, 0);
#endif // __GNUC__


now is:
Code: [Select]
    const int_mmx
        delta = set_int_mmx(dpp->delta, dpp->delta),
        fill = set_int_mmx(0x7bff, 0x7bff),
        msk0 = set_int_mmx(0x7fff, 0x7fff),
        msk1 = set_int_mmx(0xffff, 0xffff),
        round = set_int_mmx(512, 512),
        zero = set_int_mmx(0, 0);
    int_mmx
        sum_AB = set_int_mmx(0, 0),
        weight_AB = set_int_mmx(
            restore_weight (store_weight (dpp->weight_A)),
            restore_weight (store_weight (dpp->weight_B))
        ),
        left_right, sam_AB, tmp0, tmp1, samples_AB [MAX_TERM];


the set_int_mmx takes two same args (values), so switching the m1 with m2 do not gives any change,
only with weight_AB is the differece, where A and B are switched!!!

but i will try this patch


MMX optimised WavPack encoder

Reply #14
As I am going to leave the computer for a few days, I already uploaded an updated patch (wp-4.32-jh_mmx.diff.gz). It contains some code cleanup, and other minor changes, which make it a bit faster on some of my test files. Maybe someone wants to give it a try.

I actually plan to apply some more changes, but this could take some time, because now I carefully evaluate each step. The next release will hopefully also be a bit more "wisodev friendly"

wisodev, could you please change your defines to use those shift instructions with a constant argument, like
Code: [Select]
#define __builtin_ia32_pslld(m1, m2) _m_pslldi(m1, m2)
#define __builtin_ia32_psrld(m1, m2) _m_psrldi(m1, m2)
#define __builtin_ia32_psrad(m1, m2) _m_psradi(m1, m2)
(GCC has only one form and chooses the right one automatically). It makes the code a bit more compact, and reduces register load. This way you also shouldn't need to change my part of the sources.

Thank you very much for your work! It really helps a lot. I would like to encourage more people to publish their test results.


Have a nice weekend!
Jo.

MMX optimised WavPack encoder

Reply #15
OK I will add this to code and the new patch and post results, sources and binarys at monday!

---

The bug was fixed, your diagnose was correct! I have done some more tests and found ntresting things. I will run more tests in nnear future on lot of different files. he-jo please take a look at results. The fixed sources, test results and binarys are in upload thread.

---

Have a nice weekend too!!!

 

MMX optimised WavPack encoder

Reply #16
A new patch is available: wp-4.32-jh_mmx.diff.gz

I applied a small change to the code (very minor speedup) and converted everything to Intel intrinsics, since they are some kind of standard. Test showed that they are slower with GCC 4.0, so I included some defines to get the best results (like wisodev did). It also compiles with GCC 3.4 now.

To compile with the MMX code, you need to define OPT_MMX - e.g. when using gcc run make CFLAGS='-O3 -mmmx -DOPT_MMX'

This patch should work with MSVC without modifications. wisodev, please have a look at it.

Thanks,
Jo.

 
SimplePortal 1.0.0 RC1 © 2008-2019