Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: optimised WavPack encoder (Read 25349 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

optimised WavPack encoder

Hi,

I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.

On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.

You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://base91.sourceforge.net/download/wavpack/ (this binary should work on a G3 and higher  - do a 'chmod 555 wavpack' after gunzip).

optimised WavPack encoder

Reply #1
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.

Thanks!

Quote
Hi,

I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.

On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.

You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://he-jo.net/download/wavpack/ (this binary should work on a G3 and higher  - do a 'chmod 555 wavpack' after gunzip).
[a href="index.php?act=findpost&pid=378134"][{POST_SNAPBACK}][/a]

optimised WavPack encoder

Reply #2
Do contribute the changes back to David if you havent already, thanks.
WavPack 5.6.0 -b384hx6cmv / qaac64 2.80 -V 100

optimised WavPack encoder

Reply #3
Quote
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.

Thanks!
[a href="index.php?act=findpost&pid=378271"][{POST_SNAPBACK}][/a]

Yes, I could provide a Universal Binary, but first I want to try, if I can tune the encoder for Intel processors too. I hope, that I have some test results soon.

optimised WavPack encoder

Reply #4
Now I've ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I'm afraid that it won't be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.

Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://base91.sourceforge.net/download/wavpack/:

linux-x86.tar.gz
macosx-x86.tar.gz

Choose the right one for your OS (sorry Windows users). After unpacking you'll find two binaries: 'wavpackA' and 'wavpackB'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.

optimised WavPack encoder

Reply #5
Cool... I hope Bryant catches wind of this soon.

 

optimised WavPack encoder

Reply #6
wavpackB is about 15% faster than vanilla wavpack on my system (Athlon XP2500+ Barton), compiled with CFLAGS="-O2 -march=athlon-xp"  wavpackA is slightly slower than wavpackB (3 seconds slower on a 4 minutes encoding with -hx).

Edit: the speed gain is about the same with wavpack -hx6 (21m 38s vs. 25m 23s).

optimised WavPack encoder

Reply #7
This is probably a stupid question, but do you guys check whether the compressed file is still bit-identical to the encoded one?

optimised WavPack encoder

Reply #8
I did.

optimised WavPack encoder

Reply #9
Quote
Now I\'ve ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I\'m afraid that it won\'t be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.

Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://he-jo.net/download/wavpack/ :

linux-x86.tar.gz
macosx-x86.tar.gz

Choose the right one for your OS (sorry Windows users). After unpacking you\'ll find two binaries: \'wavpackA\' and \'wavpackB\'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.
[a href="index.php?act=findpost&pid=378730"][{POST_SNAPBACK}][/a]


Can you upload diff of sources for x86 optimizations. I would like to compile Wavpack for windows and check on Athlon 2000+.

optimised WavPack encoder

Reply #10
I tested the Linux binaries on a P2-450 and md5summed the results to ensure they were identical.  Testing was done out of a tmpfs to keep disk cache out of the picture.

The B version is faster for me.  Not by much on -m, but -fx completed in 86% of the time the A version took to do the same file.  I only tested on one file.

optimised WavPack encoder

Reply #11
Thanks to the testers so far!

Please note that binary A was compiled from the unmodified sources, while version B contains my changes. So only the difference between both binaries matters.

On my K6, variant B was much slower. Now it seems that only processors from Intel can really benefit from my changes. Would be interesting, how recent Athlons perform with B.

I'm currently trying to further increase the encoder speed, and will provide the sources, when I've found a cleaner solution. My changes currently depend on GCC extensions.

optimised WavPack encoder

Reply #12
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.

wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz. Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do

When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.

Thanks,
Jo.

optimised WavPack encoder

Reply #13
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.

wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz from http://he-jo.net/download/wavpack/. Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do

When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.

Thanks,
Jo.


Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.

Thanks for the update and your work.

wisodev

optimised WavPack encoder

Reply #14
Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.


Hm, I only had a 3 days weekend 



Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.

Anyway, thanks for your help!

optimised WavPack encoder

Reply #15

Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.


Hm, I only had a 3 days weekend 



Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.

Anyway, thanks for your help!


No problem! But anyway ;-) it will be nice if you post (or update) the modified asm source code, or just post the changes in reply.

optimised WavPack encoder

Reply #16
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the archive wavpack-bsr-zip

There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.

I'm sure wisodev will provide a binary which will also include the MMX optimisations.

optimised WavPack encoder

Reply #17
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the file 'wavpack-bsr-zip' on http://he-jo.net/download/wavpack/

There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.

I'm sure wisodev will provide a binary which will also include the MMX optimisations.


Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

I will compare your build to mine and find-out the best solution. I will post results and binarys (including sources) later today.

wisodev

optimised WavPack encoder

Reply #18
Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower.  This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?

Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html and calculate the speed differences from the reported 'User Time'.

optimised WavPack encoder

Reply #19

Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower.  This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?

Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html and calculate the speed differences from the reported 'User Time'.


Yes on AMD Athlon XP 2000+, WinXP SP2.

I have uploaded my binarys here. In this package are original, latest mmx optimized and mmx-bsr optimized binarys plus source code and some quick tests on my PC.

It looks like your binarys are slower the my, but this is compiler issue.

Thanks for tip about timing, this will be very useful.

wisodev

optimised WavPack encoder

Reply #20
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.

But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).

Interestingly this patch seems to give me a few % in static case (but slows a bit on shared lib case). Actually the patch makes the loop a bit slower, but eleminating code seems to make the cache happier - at least for me in a quick test.

Code: [Select]
--- extra2.c	2006-04-06 06:42:25.000000000 +0200
+++ extra2-opt.c 2006-05-06 16:11:08.000000000 +0200
@@ -63,36 +63,14 @@
  dpp->samples_B [i] = exp2s (log2s (dpp->samples_B [i]));
}
 
- if (dpp->term == 17) {
- while (num_samples--) {
- int32_t left, right;
- int32_t sam_A, sam_B;
-
- sam_A = 2 * dpp->samples_A [0] - dpp->samples_A [1];
- dpp->samples_A [1] = dpp->samples_A [0];
- dpp->samples_A [0] = left = in_samples [0];
- left -= apply_weight (dpp->weight_A, sam_A);
- update_weight (dpp->weight_A, dpp->delta, sam_A, left);
- dpp->sum_A += dpp->weight_A;
- out_samples [0] = left;
-
- sam_B = 2 * dpp->samples_B [0] - dpp->samples_B [1];
- dpp->samples_B [1] = dpp->samples_B [0];
- dpp->samples_B [0] = right = in_samples [1];
- right -= apply_weight (dpp->weight_B, sam_B);
- update_weight (dpp->weight_B, dpp->delta, sam_B, right);
- dpp->sum_B += dpp->weight_B;
- out_samples [1] = right;
- in_samples += dir;
- out_samples += dir;
- }
- }
- else if (dpp->term == 18) {
+ if (dpp->term == 17 || dpp->term == 18) {
+ int term17 = dpp->term - 17;
+ int term15 = dpp->term - 15;
  while (num_samples--) {
  int32_t left, right;
  int32_t sam_A, sam_B;
 
- sam_A = (3 * dpp->samples_A [0] - dpp->samples_A [1]) >> 1;
+ sam_A = (term15 * dpp->samples_A [0] - dpp->samples_A [1]) >> term17;
  dpp->samples_A [1] = dpp->samples_A [0];
  dpp->samples_A [0] = left = in_samples [0];
  left -= apply_weight (dpp->weight_A, sam_A);
@@ -100,7 +78,7 @@
  dpp->sum_A += dpp->weight_A;
  out_samples [0] = left;
 
- sam_B = (3 * dpp->samples_B [0] - dpp->samples_B [1]) >> 1;
+ sam_B = (term15 * dpp->samples_B [0] - dpp->samples_B [1]) >> term17;
  dpp->samples_B [1] = dpp->samples_B [0];
  dpp->samples_B [0] = right = in_samples [1];
  right -= apply_weight (dpp->weight_B, sam_B);

optimised WavPack encoder

Reply #21
and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

That's really strange since shorter instructions are decoded faster. Maybe it's an alignment issue.
It looks like your binarys are slower the my, but this is compiler issue.

Probably - I used '-mtune=pentium3' for my builds.
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.

But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).

Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.

I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.

optimised WavPack encoder

Reply #22
Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.

I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.

Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%).  But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).

This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...
Code: [Select]
--- extra2.c	2006-05-06 23:28:07.000000000 +0200
+++ extra2mmx.c 2006-05-06 15:42:30.000000000 +0200
@@ -57,42 +57,44 @@
if (dpp->term > 0) {
  const int_mmx
  delta = { dpp->delta, dpp->delta },
- msk0 = { 0x7fff, 0x7fff },
- msk1 = { 0xffff, 0xffff },
+ msk0 = { 0x00007fffL, 0x00007fffL },
+ msk1 = { 0x0000ffffL, 0x0000ffffL },
  round = { 512, 512 },
  zero = { 0, 0 };
  int_mmx left_right, sam_AB, tmp0, tmp1;
- union {
+ /*union {
  int_mmx q [MAX_TERM];
  int d [2 * MAX_TERM];
- } samples_AB;
- union {
+ } samples_AB;*/
+ int_mmx samples_AB[MAX_TERM];
+ /*union {
  int_mmx q;
  int d [2];
- } weight_AB, sum_AB;
+ } weight_AB, sum_AB;*/
+ int_mmx weight_AB, sum_AB ={0,0};
 
- sum_AB.d [0] = 0;
- sum_AB.d [1] = 0;
- weight_AB.d [0] = restore_weight (store_weight (dpp->weight_A));
- weight_AB.d [1] = restore_weight (store_weight (dpp->weight_B));
+ //sum_AB.d [0] = 0;
+ //sum_AB.d [1] = 0;
+ *(int*)&weight_AB = restore_weight (store_weight (dpp->weight_A));
+ *((int*)&weight_AB+1) = restore_weight (store_weight (dpp->weight_B));
  for (k = 0; k < MAX_TERM; ++k) {
- samples_AB.d [k * 2] = exp2s (log2s (dpp->samples_A [k]));
- samples_AB.d [k * 2 + 1] = exp2s (log2s (dpp->samples_B [k]));
+ *((int*)&samples_AB + k * 2) = exp2s (log2s (dpp->samples_A [k]));
+ *((int*)&samples_AB + k * 2 + 1) = exp2s (log2s (dpp->samples_B [k]));
  }
 
  if (dpp->term == 17) {
  while (num_samples--) {
- sam_AB = __builtin_ia32_pslld (samples_AB.q [0], 1);
- sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB.q [1]);
+ sam_AB = __builtin_ia32_pslld (samples_AB [0], 1);
+ sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB [1]);
 
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -107,9 +109,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -119,20 +121,20 @@
  }
  else if (dpp->term == 18) {
  while (num_samples--) {
- tmp0 = samples_AB.q [0];
- sam_AB = __builtin_ia32_psubd (tmp0, samples_AB.q [1]);
+ tmp0 = samples_AB [0];
+ sam_AB = __builtin_ia32_psubd (tmp0, samples_AB [1]);
  tmp0 = __builtin_ia32_pslld (tmp0, 1);
  sam_AB = __builtin_ia32_paddd (sam_AB, tmp0);
  sam_AB = __builtin_ia32_psrad (sam_AB, 1);
 
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -147,9 +149,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -161,14 +163,14 @@
  while (num_samples--) {
  k = (m + dpp->term) & (MAX_TERM - 1);
 
- sam_AB = samples_AB.q [m];
- samples_AB.q [k] = left_right = *(int_mmx *) in_samples;
+ sam_AB = samples_AB [m];
+ samples_AB [k] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -183,9 +185,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -194,13 +196,13 @@
  m = (m + 1) & (MAX_TERM - 1);
  }
  }
- dpp->sum_A = sum_AB.d [0];
- dpp->sum_B = sum_AB.d [1];
- dpp->weight_A = weight_AB.d [0];
- dpp->weight_B = weight_AB.d [1];
+ dpp->sum_A = *(int*)&sum_AB;
+ dpp->sum_B = *((int*)&sum_AB+1);
+ dpp->weight_A = *(int*)&weight_AB;
+ dpp->weight_B = *((int*)&weight_AB+1);
  for (k = 0; k < MAX_TERM; ++k) {
- dpp->samples_A [k] = samples_AB.d [m * 2];
- dpp->samples_B [k] = samples_AB.d [m * 2 + 1];
+ dpp->samples_A [k] = *((int*)&samples_AB+m * 2);
+ dpp->samples_B [k] = *((int*)&samples_AB+ m * 2 + 1);
  m = (m + 1) & (MAX_TERM - 1);
  }
  __builtin_ia32_emms ();

[!--sizeo:1--][span style=\"font-size:8pt;line-height:100%\"][!--/sizeo--]Moderation: CODE to CODEBOX[/size]

optimised WavPack encoder

Reply #23
Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%).  But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).

This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...

Thanks for looking into this! Did you use gcc 4.1 for your tests again? It's too bad that it still has problems with unions and builtins. I'll do some testing without unions next week. That will hopefully enable other minor optimisations as well.