HydrogenAudio

Lossless Audio Compression => WavPack => Topic started by: he-jo on 2006-04-02 11:07:53

Title: optimised WavPack encoder
Post by: he-jo on 2006-04-02 11:07:53
Hi,

I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.

On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.

You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://base91.sourceforge.net/download/wavpack/ (http://base91.sourceforge.net/download/wavpack/) (this binary should work on a G3 and higher  - do a 'chmod 555 wavpack' after gunzip).
Title: optimised WavPack encoder
Post by: goodnews on 2006-04-02 21:33:13
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.

Thanks!

Quote
Hi,

I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.

On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.

You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) (this binary should work on a G3 and higher  - do a 'chmod 555 wavpack' after gunzip).
[a href="index.php?act=findpost&pid=378134"][{POST_SNAPBACK}][/a]
Title: optimised WavPack encoder
Post by: DARcode on 2006-04-02 22:41:35
Do contribute the changes back to David if you havent already, thanks.
Title: optimised WavPack encoder
Post by: he-jo on 2006-04-03 07:40:52
Quote
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.

Thanks!
[a href="index.php?act=findpost&pid=378271"][{POST_SNAPBACK}][/a]

Yes, I could provide a Universal Binary, but first I want to try, if I can tune the encoder for Intel processors too. I hope, that I have some test results soon.
Title: optimised WavPack encoder
Post by: he-jo on 2006-04-03 22:06:02
Now I've ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I'm afraid that it won't be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.

Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://base91.sourceforge.net/download/wavpack/ (http://base91.sourceforge.net/download/wavpack/):

linux-x86.tar.gz
macosx-x86.tar.gz

Choose the right one for your OS (sorry Windows users). After unpacking you'll find two binaries: 'wavpackA' and 'wavpackB'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.
Title: optimised WavPack encoder
Post by: Supacon on 2006-04-03 22:49:35
Cool... I hope Bryant catches wind of this soon.
Title: optimised WavPack encoder
Post by: skamp on 2006-04-03 23:23:32
wavpackB is about 15% faster than vanilla wavpack on my system (Athlon XP2500+ Barton), compiled with CFLAGS="-O2 -march=athlon-xp"  wavpackA is slightly slower than wavpackB (3 seconds slower on a 4 minutes encoding with -hx).

Edit: the speed gain is about the same with wavpack -hx6 (21m 38s vs. 25m 23s).
Title: optimised WavPack encoder
Post by: Shade[ST] on 2006-04-04 11:00:17
This is probably a stupid question, but do you guys check whether the compressed file is still bit-identical to the encoded one?
Title: optimised WavPack encoder
Post by: skamp on 2006-04-04 11:11:51
I did.
Title: optimised WavPack encoder
Post by: wisodev on 2006-04-04 11:31:55
Quote
Now I\'ve ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I\'m afraid that it won\'t be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.

Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) :

linux-x86.tar.gz
macosx-x86.tar.gz

Choose the right one for your OS (sorry Windows users). After unpacking you\'ll find two binaries: \'wavpackA\' and \'wavpackB\'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.
[a href="index.php?act=findpost&pid=378730"][{POST_SNAPBACK}][/a]


Can you upload diff of sources for x86 optimizations. I would like to compile Wavpack for windows and check on Athlon 2000+.
Title: optimised WavPack encoder
Post by: dch on 2006-04-04 21:49:55
I tested the Linux binaries on a P2-450 and md5summed the results to ensure they were identical.  Testing was done out of a tmpfs to keep disk cache out of the picture.

The B version is faster for me.  Not by much on -m, but -fx completed in 86% of the time the A version took to do the same file.  I only tested on one file.
Title: optimised WavPack encoder
Post by: he-jo on 2006-04-06 07:11:54
Thanks to the testers so far!

Please note that binary A was compiled from the unmodified sources, while version B contains my changes. So only the difference between both binaries matters.

On my K6, variant B was much slower. Now it seems that only processors from Intel can really benefit from my changes. Would be interesting, how recent Athlons perform with B.

I'm currently trying to further increase the encoder speed, and will provide the sources, when I've found a cleaner solution. My changes currently depend on GCC extensions.
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-02 21:35:02
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.

wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz (http://base91.sourceforge.net/download/wavpack/). Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do

When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.

Thanks,
Jo.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-04 06:36:57
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.

wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz from http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/). Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do

When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.

Thanks,
Jo.


Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.

Thanks for the update and your work.

wisodev
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-04 08:24:12
Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.


Hm, I only had a 3 days weekend 



Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.

Anyway, thanks for your help!
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-04 08:42:59

Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)

I will build today the binarys (if everything goes OK) and run tests.


Hm, I only had a 3 days weekend 



Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.

Anyway, thanks for your help!


No problem! But anyway ;-) it will be nice if you post (or update) the modified asm source code, or just post the changes in reply.
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-04 13:51:09
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the archive wavpack-bsr-zip (http://base91.sourceforge.net/download/wavpack/)

There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.

I'm sure wisodev will provide a binary which will also include the MMX optimisations.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-05 07:25:17
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the file 'wavpack-bsr-zip' on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/)

There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.

I'm sure wisodev will provide a binary which will also include the MMX optimisations.


Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

I will compare your build to mine and find-out the best solution. I will post results and binarys (including sources) later today.

wisodev
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-05 13:06:14
Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower.  This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?

Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html (http://7-zip.org/igor.html) and calculate the speed differences from the reported 'User Time'.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-05 20:07:42

Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower.  This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?

Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html (http://7-zip.org/igor.html) and calculate the speed differences from the reported 'User Time'.


Yes on AMD Athlon XP 2000+, WinXP SP2.

I have uploaded my binarys here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731&st=0&gopid=389669&). In this package are original, latest mmx optimized and mmx-bsr optimized binarys plus source code and some quick tests on my PC.

It looks like your binarys are slower the my, but this is compiler issue.

Thanks for tip about timing, this will be very useful.

wisodev
Title: optimised WavPack encoder
Post by: PrakashP on 2006-05-06 15:25:21
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.

But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).

Interestingly this patch seems to give me a few % in static case (but slows a bit on shared lib case). Actually the patch makes the loop a bit slower, but eleminating code seems to make the cache happier - at least for me in a quick test.

Code: [Select]
--- extra2.c	2006-04-06 06:42:25.000000000 +0200
+++ extra2-opt.c 2006-05-06 16:11:08.000000000 +0200
@@ -63,36 +63,14 @@
  dpp->samples_B [i] = exp2s (log2s (dpp->samples_B [i]));
}
 
- if (dpp->term == 17) {
- while (num_samples--) {
- int32_t left, right;
- int32_t sam_A, sam_B;
-
- sam_A = 2 * dpp->samples_A [0] - dpp->samples_A [1];
- dpp->samples_A [1] = dpp->samples_A [0];
- dpp->samples_A [0] = left = in_samples [0];
- left -= apply_weight (dpp->weight_A, sam_A);
- update_weight (dpp->weight_A, dpp->delta, sam_A, left);
- dpp->sum_A += dpp->weight_A;
- out_samples [0] = left;
-
- sam_B = 2 * dpp->samples_B [0] - dpp->samples_B [1];
- dpp->samples_B [1] = dpp->samples_B [0];
- dpp->samples_B [0] = right = in_samples [1];
- right -= apply_weight (dpp->weight_B, sam_B);
- update_weight (dpp->weight_B, dpp->delta, sam_B, right);
- dpp->sum_B += dpp->weight_B;
- out_samples [1] = right;
- in_samples += dir;
- out_samples += dir;
- }
- }
- else if (dpp->term == 18) {
+ if (dpp->term == 17 || dpp->term == 18) {
+ int term17 = dpp->term - 17;
+ int term15 = dpp->term - 15;
  while (num_samples--) {
  int32_t left, right;
  int32_t sam_A, sam_B;
 
- sam_A = (3 * dpp->samples_A [0] - dpp->samples_A [1]) >> 1;
+ sam_A = (term15 * dpp->samples_A [0] - dpp->samples_A [1]) >> term17;
  dpp->samples_A [1] = dpp->samples_A [0];
  dpp->samples_A [0] = left = in_samples [0];
  left -= apply_weight (dpp->weight_A, sam_A);
@@ -100,7 +78,7 @@
  dpp->sum_A += dpp->weight_A;
  out_samples [0] = left;
 
- sam_B = (3 * dpp->samples_B [0] - dpp->samples_B [1]) >> 1;
+ sam_B = (term15 * dpp->samples_B [0] - dpp->samples_B [1]) >> term17;
  dpp->samples_B [1] = dpp->samples_B [0];
  dpp->samples_B [0] = right = in_samples [1];
  right -= apply_weight (dpp->weight_B, sam_B);
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-06 17:08:19
and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.

That's really strange since shorter instructions are decoded faster. Maybe it's an alignment issue.
It looks like your binarys are slower the my, but this is compiler issue.

Probably - I used '-mtune=pentium3' for my builds.
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.

But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).

Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.

I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.
Title: optimised WavPack encoder
Post by: PrakashP on 2006-05-06 22:38:11
Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.

I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.

Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%).  But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).

This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...
Code: [Select]
--- extra2.c	2006-05-06 23:28:07.000000000 +0200
+++ extra2mmx.c 2006-05-06 15:42:30.000000000 +0200
@@ -57,42 +57,44 @@
if (dpp->term > 0) {
  const int_mmx
  delta = { dpp->delta, dpp->delta },
- msk0 = { 0x7fff, 0x7fff },
- msk1 = { 0xffff, 0xffff },
+ msk0 = { 0x00007fffL, 0x00007fffL },
+ msk1 = { 0x0000ffffL, 0x0000ffffL },
  round = { 512, 512 },
  zero = { 0, 0 };
  int_mmx left_right, sam_AB, tmp0, tmp1;
- union {
+ /*union {
  int_mmx q [MAX_TERM];
  int d [2 * MAX_TERM];
- } samples_AB;
- union {
+ } samples_AB;*/
+ int_mmx samples_AB[MAX_TERM];
+ /*union {
  int_mmx q;
  int d [2];
- } weight_AB, sum_AB;
+ } weight_AB, sum_AB;*/
+ int_mmx weight_AB, sum_AB ={0,0};
 
- sum_AB.d [0] = 0;
- sum_AB.d [1] = 0;
- weight_AB.d [0] = restore_weight (store_weight (dpp->weight_A));
- weight_AB.d [1] = restore_weight (store_weight (dpp->weight_B));
+ //sum_AB.d [0] = 0;
+ //sum_AB.d [1] = 0;
+ *(int*)&weight_AB = restore_weight (store_weight (dpp->weight_A));
+ *((int*)&weight_AB+1) = restore_weight (store_weight (dpp->weight_B));
  for (k = 0; k < MAX_TERM; ++k) {
- samples_AB.d [k * 2] = exp2s (log2s (dpp->samples_A [k]));
- samples_AB.d [k * 2 + 1] = exp2s (log2s (dpp->samples_B [k]));
+ *((int*)&samples_AB + k * 2) = exp2s (log2s (dpp->samples_A [k]));
+ *((int*)&samples_AB + k * 2 + 1) = exp2s (log2s (dpp->samples_B [k]));
  }
 
  if (dpp->term == 17) {
  while (num_samples--) {
- sam_AB = __builtin_ia32_pslld (samples_AB.q [0], 1);
- sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB.q [1]);
+ sam_AB = __builtin_ia32_pslld (samples_AB [0], 1);
+ sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB [1]);
 
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -107,9 +109,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -119,20 +121,20 @@
  }
  else if (dpp->term == 18) {
  while (num_samples--) {
- tmp0 = samples_AB.q [0];
- sam_AB = __builtin_ia32_psubd (tmp0, samples_AB.q [1]);
+ tmp0 = samples_AB [0];
+ sam_AB = __builtin_ia32_psubd (tmp0, samples_AB [1]);
  tmp0 = __builtin_ia32_pslld (tmp0, 1);
  sam_AB = __builtin_ia32_paddd (sam_AB, tmp0);
  sam_AB = __builtin_ia32_psrad (sam_AB, 1);
 
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -147,9 +149,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -161,14 +163,14 @@
  while (num_samples--) {
  k = (m + dpp->term) & (MAX_TERM - 1);
 
- sam_AB = samples_AB.q [m];
- samples_AB.q [k] = left_right = *(int_mmx *) in_samples;
+ sam_AB = samples_AB [m];
+ samples_AB [k] = left_right = *(int_mmx *) in_samples;
 
  tmp0 = __builtin_ia32_psrld (sam_AB, 15);
  tmp1 = __builtin_ia32_pand (sam_AB, msk0);
  tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
  tmp1 = __builtin_ia32_paddd (tmp1, round);
  tmp0 = __builtin_ia32_pslld (tmp0, 5);
  tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -183,9 +185,9 @@
  tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
  tmp0 = __builtin_ia32_por (tmp0, sam_AB);
  tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
 
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
 
  *(int_mmx *) out_samples = left_right;
 
@@ -194,13 +196,13 @@
  m = (m + 1) & (MAX_TERM - 1);
  }
  }
- dpp->sum_A = sum_AB.d [0];
- dpp->sum_B = sum_AB.d [1];
- dpp->weight_A = weight_AB.d [0];
- dpp->weight_B = weight_AB.d [1];
+ dpp->sum_A = *(int*)&sum_AB;
+ dpp->sum_B = *((int*)&sum_AB+1);
+ dpp->weight_A = *(int*)&weight_AB;
+ dpp->weight_B = *((int*)&weight_AB+1);
  for (k = 0; k < MAX_TERM; ++k) {
- dpp->samples_A [k] = samples_AB.d [m * 2];
- dpp->samples_B [k] = samples_AB.d [m * 2 + 1];
+ dpp->samples_A [k] = *((int*)&samples_AB+m * 2);
+ dpp->samples_B [k] = *((int*)&samples_AB+ m * 2 + 1);
  m = (m + 1) & (MAX_TERM - 1);
  }
  __builtin_ia32_emms ();

[!--sizeo:1--][span style=\"font-size:8pt;line-height:100%\"][!--/sizeo--]Moderation: CODE to CODEBOX[/size]
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-07 06:57:57
Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%).  But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).

This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...

Thanks for looking into this! Did you use gcc 4.1 for your tests again? It's too bad that it still has problems with unions and builtins. I'll do some testing without unions next week. That will hopefully enable other minor optimisations as well.
Title: optimised WavPack encoder
Post by: PrakashP on 2006-05-07 09:12:50
Thanks for looking into this! Did you use gcc 4.1 for your tests again?


Yup.
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-20 09:11:05
Sorry, I'm a bit busy these days. Now, that I finally found some time, I finished one of the alternatives to the BSR stuff. This one gives me a speedup of up to 8% on a P3 Celeron (-f -x6), and it's also faster than the original code on my old AMD K6.

wisodev, could you do some testing again? Please try 'nasm -O2' first. You find the patch in this file: wp-4.32-jfl2b.diff.gz
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-22 05:20:51
Quote
wisodev, could you do some testing again?


OK, I will post results (and binarys+sources) tomorrow.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-23 19:00:28
Sorry, I'm a bit busy these days. Now, that I finally found some time, I finished one of the alternatives to the BSR stuff. This one gives me a speedup of up to 8% on a P3 Celeron (-f -x6), and it's also faster than the original code on my old AMD K6.

wisodev, could you do some testing again? Please try 'nasm -O2' first. You find the patch in this file: wp-4.32-jfl2b.diff.gz


OK, I have added this optimazations but only one quick test was done by me, results are not so good, the binarys and sources are here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=43731&view=findpost&p=395239). Maybe someone else can do some tests to confirm my results.
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-28 11:05:56
I've updated the patch wp-4.32-jfl2b.diff.gz (http://base91.sourceforge.net/download/wavpack/) to avoid a memory stall. On a Celeron (Tualatin) I measured a speedup of 10% compared to the original wavpack sources.

wisodev, it would be nice, if you could provide binaries and test results again.

Thanks a lot,
Jo.
Title: optimised WavPack encoder
Post by: krmathis on 2006-05-28 11:34:14
What happend to the PowerPC optimized version?
The wavpack-4.31-ppc.diff fails on the WavPack 4.31 and 4.32 source code:
Quote
patch -p0 < wavpack-4.31-ppc.diff
patching file bits.c
Hunk #1 FAILED at 149.
1 out of 1 hunk FAILED -- saving rejects to file bits.c.rej
patching file pack.c
Hunk #1 FAILED at 567.
1 out of 1 hunk FAILED -- saving rejects to file pack.c.rej
patching file unpack3.c
Hunk #1 FAILED at 1583.
Hunk #2 FAILED at 1604.
Hunk #3 FAILED at 1988.
3 out of 3 hunks FAILED -- saving rejects to file unpack3.c.rej
patching file wavpack.h
Hunk #1 FAILED at 415.
Hunk #2 FAILED at 518.
2 out of 2 hunks FAILED -- saving rejects to file wavpack.h.rej
patching file words.c
Hunk #1 FAILED at 69.
Hunk #2 FAILED at 93.
Hunk #3 FAILED at 144.
Hunk #4 FAILED at 419.
Hunk #5 FAILED at 1302.
Hunk #6 FAILED at 1325.
Hunk #7 FAILED at 1376.
7 out of 7 hunks FAILED -- saving rejects to file words.c.rej

Am I missing something, or do you simply need to update the patch (hopefully for 4.32)?
Thanks!
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-28 23:06:29
The problem is, that the WavPack sources for Unix have DOS line endings too. You need to remove the carriage returns from these files before you can apply the patch. You could do that with the following commands:

Code: [Select]
mkdir src
for FILE in *.[ch]; do tr -d '\r' < "$FILE" > "src/$FILE"; done
rm *.[ch]
mv src/* .
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-29 14:30:06
Quote
I\'ve updated the patch \'wp-4.32-jfl2b.diff.gz\' on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) to avoid a memory stall. On a Celeron (Tualatin) I measured a speedup of 10% compared to the original wavpack sources.

wisodev, it would be nice, if you could provide binaries and test results again.

Thanks a lot,
Jo.


as stated in Upload thread the new binarys will be available very soon

PS. sorry for double post, but I missed this post
Title: optimised WavPack encoder
Post by: krmathis on 2006-05-29 20:23:19
The problem is, that the WavPack sources for Unix have DOS line endings too.
Ok, I see.
I'll change the line endings and try again...

Thanks!
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-29 22:08:50
No problem. Would be nice, if you could publish your own test results here.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-30 07:04:08
I have uploaded latest optimizations, the binarys and sources are available here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731) for download. It looks very good, for more details check the Upload thread.
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-30 11:06:14
Oh, this is great! I'm very happy, now that we have a faster binary for Athlons as well. I hoped to find a solution that is good for all common processors, but the P4 may still be a problem. Will see, if we can get rid of the BSR code in future.
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-31 05:46:41
Are you considering adding MMX optimizations for negatives values (dpp->term), this was discussed some time ago. I think there are few percent of improvement there too. This part of code executes about 25% of all calls to function decorr_stereo_pass. Or there is reason to not optimize this? This is just a suggestion, I now it takes lot of time to do such things.

Beside the one test on P4 showed that BSR and MMX works pretty nice on this machines. The newest optimization was not tested on P4, but I think it should do as well. But I am just speculating here ;-)

I am looking for your future optimizations!!!
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-31 08:20:19
Quote
Are you considering adding MMX optimizations for negatives values (dpp->term), this was discussed some time ago. I think there are few percent of improvement there too. This part of code executes about 25% of all calls to function decorr_stereo_pass. Or there is reason to not optimize this? This is just a suggestion, I now it takes lot of time to do such things.
I already did this one month ago, but tests on my old AMD K6 showed, that the code was actually slower. You can imagine, that this result wasn't very motivating. But in the meanwhile, I got the impression, that testing on this obsolete processor isn't really reliable. So maybe it's worth to have a look at the code again.
Quote
Beside the one test on P4 showed that BSR and MMX works pretty nice on this machines. The newest optimization was not tested on P4, but I think it should do as well. But I am just speculating here ;-)
Yes, that's what I meant. bsr could still be faster on a P4, because the 'jfl2b' code uses floating point instructions, which have a long latency on these processors. Theoretically, I could nearly double the throughput of the function by improving parallelism, but I'm afraid that this wouldn't give much gain at the end.
Quote
I am looking for your future optimizations!!!
Me too.  I really like to do this work, and it's nice to make people happy that way. But, you know, it's often just a matter of time. Will see, what I'm able to do.
Title: optimised WavPack encoder
Post by: askoff on 2006-05-31 08:37:31
I made the test with my P4. I can see some improvements but BSR is still quite a lot faster. Keep up the good work he-jo.
Title: optimised WavPack encoder
Post by: smz on 2006-05-31 10:52:47
I have uploaded latest optimizations, the binarys and sources are available here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731) for download. It looks very good, for more details check the Upload thread.

I find this a bit confusing: your link points to post #1, where one is redirected to post #21, but that post seems to be old as well (last edited on May 23, while the new binary should be of May 30) so I'm not sure if it is really the correct one.

Wouldn't be better to just have Post #1 to point to the latest and greatest version available?

Cheers and compliments for the great work you're all doing!

Sergio
Title: optimised WavPack encoder
Post by: he-jo on 2006-05-31 10:56:16
I made the test with my P4. I can see some improvements but BSR is still quite a lot faster. Keep up the good work he-jo.
Thanks! At least I can be happy, that it isn't slower than the original code anymore. 
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-31 11:23:07
Quote
I find this a bit confusing: your link points to post #1, where one is redirected to post #21, but that post seems to be old as well (last edited on May 23, while the new binary should be of May 30) so I\'m not sure if it is really the correct one.

Wouldn\'t be better to just have Post #1 to point to the latest and greatest version available?

Cheers and compliments for the great work you\'re all doing!

Sergio


Yes, it is my fault. I have not updated this redirections and I know that this is not the best solution. This will be corrected I hope this time I can solve this problem and make things clear as possible, I am only a human ;-).
Title: optimised WavPack encoder
Post by: smz on 2006-05-31 12:17:36
Don't worry, wisodev, we all are human or at least we should be! :-))
Anyway, the link in post #21 is the good one, isn't it?

Cheers

Sergio

Edit: spelling

Edit 2: Oh, now I see! You edited post #1 and the correct link is now in post #31. Thanks!
Title: optimised WavPack encoder
Post by: wisodev on 2006-05-31 12:28:45
No problem!

---

I think when next update will be available I will edit post #1 and place there all downloads and in the original posts (that are now containing links) will be added redirection to post #1, this will be less confusing and keeping all downloads in one post will clarify the situation. But I am open to other solutions to.