Hi,
I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.
On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.
You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://base91.sourceforge.net/download/wavpack/ (http://base91.sourceforge.net/download/wavpack/) (this binary should work on a G3 and higher - do a 'chmod 555 wavpack' after gunzip).
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.
Thanks!
Hi,
I've had a short look at the sources of WavPack 4.31, and tried to tune it a bit on my iMac G5 by avoiding expensive instructions like conditional branches. Some of my changes may be in interest for the mainline too, but I achieved the highest gain by writing a PowerPC asm optimised log2 function, so this hack is mostly useless for Intel users.
On average I noticed a speedup of about 20% compared to the original encoder, e.g. the time for encoding a 64.5 MiB WAV file with 'wavpack -q -hx' went from 3:44 min down to 2:57 min.
You can find the patch and a profiled GCC 4.0.1 build for Mac OS X on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) (this binary should work on a G3 and higher - do a 'chmod 555 wavpack' after gunzip).
[a href="index.php?act=findpost&pid=378134"][{POST_SNAPBACK}][/a]
Do contribute the changes back to David if you havent already, thanks.
Do you have (or know where I can acquire) a Universal Binary version of WavPack, as I run a Intel based iMac and prefer a WavPack compile that has Intel support native to speedup on my Mac? A Universal Binary version would have both Power PC and Intel code.
Thanks!
[a href="index.php?act=findpost&pid=378271"][{POST_SNAPBACK}][/a]
Yes, I could provide a Universal Binary, but first I want to try, if I can tune the encoder for Intel processors too. I hope, that I have some test results soon.
Now I've ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I'm afraid that it won't be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.
Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://base91.sourceforge.net/download/wavpack/ (http://base91.sourceforge.net/download/wavpack/):
linux-x86.tar.gz
macosx-x86.tar.gz
Choose the right one for your OS (sorry Windows users). After unpacking you'll find two binaries: 'wavpackA' and 'wavpackB'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.
Cool... I hope Bryant catches wind of this soon.
wavpackB is about 15% faster than vanilla wavpack on my system (Athlon XP2500+ Barton), compiled with CFLAGS="-O2 -march=athlon-xp" wavpackA is slightly slower than wavpackB (3 seconds slower on a 4 minutes encoding with -hx).
Edit: the speed gain is about the same with wavpack -hx6 (21m 38s vs. 25m 23s).
This is probably a stupid question, but do you guys check whether the compressed file is still bit-identical to the encoded one?
I did.
Now I\'ve ported my changes to x86. Since I only have an AMD K6 (200 MHz) for testing, I cannot say for sure, how this will behave on a modern Intel processor. I\'m afraid that it won\'t be as beneficial as on PowerPC: My code uses instructions that have been very expensive on older processors.
Could somebody please test my binaries on a Pentium (3/4/M etc.) or Athlon? I uploaded two packages to http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) :
linux-x86.tar.gz
macosx-x86.tar.gz
Choose the right one for your OS (sorry Windows users). After unpacking you\'ll find two binaries: \'wavpackA\' and \'wavpackB\'. I would be very grateful, if you could tell me, which one encodes faster on your system. Please also compare the output files. They must be identical.
[a href="index.php?act=findpost&pid=378730"][{POST_SNAPBACK}][/a]
Can you upload diff of sources for x86 optimizations. I would like to compile Wavpack for windows and check on Athlon 2000+.
I tested the Linux binaries on a P2-450 and md5summed the results to ensure they were identical. Testing was done out of a tmpfs to keep disk cache out of the picture.
The B version is faster for me. Not by much on -m, but -fx completed in 86% of the time the A version took to do the same file. I only tested on one file.
Thanks to the testers so far!
Please note that binary A was compiled from the unmodified sources, while version B contains my changes. So only the difference between both binaries matters.
On my K6, variant B was much slower. Now it seems that only processors from Intel can really benefit from my changes. Would be interesting, how recent Athlons perform with B.
I'm currently trying to further increase the encoder speed, and will provide the sources, when I've found a cleaner solution. My changes currently depend on GCC extensions.
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.
wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz (http://base91.sourceforge.net/download/wavpack/). Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do
When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.
Thanks,
Jo.
I rewrote the x86 stuff for NASM. Since dch reported a speedup of 14% on a P2, I think it's worth to add this to the MMX optimised compile.
wisodev, could you build a Windows binary, please? You can download the package wavpack-bsr.tar.gz from http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/). Apply the diff to the wavpack sources, assemble opt.asm with 'nasm -O2', and link everything together. You'll probably need to adjust the global labels in the asm file for Windows - I'm sure, you know what to do
When the binary is available, please test the extra modes. I'd suggest to compare the speed against the latest MMX binaries. Would be nice, if you could also verify the output files.
Thanks,
Jo.
Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)
I will build today the binarys (if everything goes OK) and run tests.
Thanks for the update and your work.
wisodev
Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)
I will build today the binarys (if everything goes OK) and run tests.
Hm, I only had a 3 days weekend
Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.
Anyway, thanks for your help!
Sorry for so late answer, but in my country we had a very long (5 days) weekend ;-)
I will build today the binarys (if everything goes OK) and run tests.
Hm, I only had a 3 days weekend
Please, don't hurry! I've done a minor change to the asm code in the meanwhile. Today, I'll probably also have the opportunity to do a MinGW build. I hope to be able to post my results later.
Anyway, thanks for your help!
No problem! But anyway ;-) it will be nice if you post (or update) the modified asm source code, or just post the changes in reply.
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the archive wavpack-bsr-zip (http://base91.sourceforge.net/download/wavpack/)
There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.
I'm sure wisodev will provide a binary which will also include the MMX optimisations.
Since MinGW still uses gcc-3.4 (with inferior MMX builtins), I have just built a binary with my latest asm changes. You find everything you need in the file 'wavpack-bsr-zip' on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/)
There are two binaries in this package: wavpackA.exe (built from original sources) and wavpackB.exe (with my asm optimisations). I was able to run a quick test with '-f -x6' on some kind of a Celeron machine and measured a speedup of about 12%. Would be nice, if somebody could test this on an Athlon processor.
I'm sure wisodev will provide a binary which will also include the MMX optimisations.
Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.
I will compare your build to mine and find-out the best solution. I will post results and binarys (including sources) later today.
wisodev
Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.
Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower. This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?
Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html (http://7-zip.org/igor.html) and calculate the speed differences from the reported 'User Time'.
Well I have included your asm optimizations (not from wavpack-bsr-zip, but previous one) in MMX version, and done some quick tests. But I am not sure of results, output files are same as from original version (binary comparison), but speedup is from 1% to 3% relating to MMX version (need more testing) and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.
Did you test on an AMD processor? This would explain the small speedup. At least, I could be happy that it isn't slower. This code probably gives the best results on Intel's 6th generation processors (Pentium Pro/2/3) - I'm not sure about Pentium M and Core Duo. Please, can somebody do a test on a Pentium 4?
Btw.: If you want to do exact timing, you can use the command line tool 'timer' from http://7-zip.org/igor.html (http://7-zip.org/igor.html) and calculate the speed differences from the reported 'User Time'.
Yes on AMD Athlon XP 2000+, WinXP SP2.
I have uploaded my binarys here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731&st=0&gopid=389669&). In this package are original, latest mmx optimized and mmx-bsr optimized binarys plus source code and some quick tests on my PC.
It looks like your binarys are slower the my, but this is compiler issue.
Thanks for tip about timing, this will be very useful.
wisodev
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.
But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).
Interestingly this patch seems to give me a few % in static case (but slows a bit on shared lib case). Actually the patch makes the loop a bit slower, but eleminating code seems to make the cache happier - at least for me in a quick test.
--- extra2.c 2006-04-06 06:42:25.000000000 +0200
+++ extra2-opt.c 2006-05-06 16:11:08.000000000 +0200
@@ -63,36 +63,14 @@
dpp->samples_B [i] = exp2s (log2s (dpp->samples_B [i]));
}
- if (dpp->term == 17) {
- while (num_samples--) {
- int32_t left, right;
- int32_t sam_A, sam_B;
-
- sam_A = 2 * dpp->samples_A [0] - dpp->samples_A [1];
- dpp->samples_A [1] = dpp->samples_A [0];
- dpp->samples_A [0] = left = in_samples [0];
- left -= apply_weight (dpp->weight_A, sam_A);
- update_weight (dpp->weight_A, dpp->delta, sam_A, left);
- dpp->sum_A += dpp->weight_A;
- out_samples [0] = left;
-
- sam_B = 2 * dpp->samples_B [0] - dpp->samples_B [1];
- dpp->samples_B [1] = dpp->samples_B [0];
- dpp->samples_B [0] = right = in_samples [1];
- right -= apply_weight (dpp->weight_B, sam_B);
- update_weight (dpp->weight_B, dpp->delta, sam_B, right);
- dpp->sum_B += dpp->weight_B;
- out_samples [1] = right;
- in_samples += dir;
- out_samples += dir;
- }
- }
- else if (dpp->term == 18) {
+ if (dpp->term == 17 || dpp->term == 18) {
+ int term17 = dpp->term - 17;
+ int term15 = dpp->term - 15;
while (num_samples--) {
int32_t left, right;
int32_t sam_A, sam_B;
- sam_A = (3 * dpp->samples_A [0] - dpp->samples_A [1]) >> 1;
+ sam_A = (term15 * dpp->samples_A [0] - dpp->samples_A [1]) >> term17;
dpp->samples_A [1] = dpp->samples_A [0];
dpp->samples_A [0] = left = in_samples [0];
left -= apply_weight (dpp->weight_A, sam_A);
@@ -100,7 +78,7 @@
dpp->sum_A += dpp->weight_A;
out_samples [0] = left;
- sam_B = (3 * dpp->samples_B [0] - dpp->samples_B [1]) >> 1;
+ sam_B = (term15 * dpp->samples_B [0] - dpp->samples_B [1]) >> term17;
dpp->samples_B [1] = dpp->samples_B [0];
dpp->samples_B [0] = right = in_samples [1];
right -= apply_weight (dpp->weight_B, sam_B);
and one more thing -O2 switch was slower then -O1 with NASM, but like I said more testing is needed.
That's really strange since shorter instructions are decoded faster. Maybe it's an alignment issue.
It looks like your binarys are slower the my, but this is compiler issue.
Probably - I used '-mtune=pentium3' for my builds.
I tested the MMX (intrinsics version, the nasm stuff only seems to have bsr optimisiation? I haven't tested it...) version on Linux using gcc-4.1. It is not faster for me on my Athlon XP (at least using no additional parameters for wavpack). Well, as the MMX code only parallelises by 2 and Athlon has higher latency compared to P3 core with MMX, this explains it - esp as the MMX code needs quite a few instructions to emulate the 32bit multiply.
But I found a rather easy method to optimise run-time on Linux: Compile the lib static (thus non-PIC) and link it into the executable. Run-time immediately was more than 10% faster for me (using a small test case though).
Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.
I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.
Yes, the NASM code only includes the BSR optimisation, and I always linked libwavpack statically. The normal modes of WavPack are generally not affected by my optimisations. You'll probably measure the highest speedup with '-x6'.
I'm already working on a few alternatives to the BSR code. Will see, if I can post some positive news next week.
Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%). But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).
This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...
--- extra2.c 2006-05-06 23:28:07.000000000 +0200
+++ extra2mmx.c 2006-05-06 15:42:30.000000000 +0200
@@ -57,42 +57,44 @@
if (dpp->term > 0) {
const int_mmx
delta = { dpp->delta, dpp->delta },
- msk0 = { 0x7fff, 0x7fff },
- msk1 = { 0xffff, 0xffff },
+ msk0 = { 0x00007fffL, 0x00007fffL },
+ msk1 = { 0x0000ffffL, 0x0000ffffL },
round = { 512, 512 },
zero = { 0, 0 };
int_mmx left_right, sam_AB, tmp0, tmp1;
- union {
+ /*union {
int_mmx q [MAX_TERM];
int d [2 * MAX_TERM];
- } samples_AB;
- union {
+ } samples_AB;*/
+ int_mmx samples_AB[MAX_TERM];
+ /*union {
int_mmx q;
int d [2];
- } weight_AB, sum_AB;
+ } weight_AB, sum_AB;*/
+ int_mmx weight_AB, sum_AB ={0,0};
- sum_AB.d [0] = 0;
- sum_AB.d [1] = 0;
- weight_AB.d [0] = restore_weight (store_weight (dpp->weight_A));
- weight_AB.d [1] = restore_weight (store_weight (dpp->weight_B));
+ //sum_AB.d [0] = 0;
+ //sum_AB.d [1] = 0;
+ *(int*)&weight_AB = restore_weight (store_weight (dpp->weight_A));
+ *((int*)&weight_AB+1) = restore_weight (store_weight (dpp->weight_B));
for (k = 0; k < MAX_TERM; ++k) {
- samples_AB.d [k * 2] = exp2s (log2s (dpp->samples_A [k]));
- samples_AB.d [k * 2 + 1] = exp2s (log2s (dpp->samples_B [k]));
+ *((int*)&samples_AB + k * 2) = exp2s (log2s (dpp->samples_A [k]));
+ *((int*)&samples_AB + k * 2 + 1) = exp2s (log2s (dpp->samples_B [k]));
}
if (dpp->term == 17) {
while (num_samples--) {
- sam_AB = __builtin_ia32_pslld (samples_AB.q [0], 1);
- sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB.q [1]);
+ sam_AB = __builtin_ia32_pslld (samples_AB [0], 1);
+ sam_AB = __builtin_ia32_psubd (sam_AB, samples_AB [1]);
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
tmp0 = __builtin_ia32_psrld (sam_AB, 15);
tmp1 = __builtin_ia32_pand (sam_AB, msk0);
tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
tmp1 = __builtin_ia32_paddd (tmp1, round);
tmp0 = __builtin_ia32_pslld (tmp0, 5);
tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -107,9 +109,9 @@
tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
tmp0 = __builtin_ia32_por (tmp0, sam_AB);
tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
*(int_mmx *) out_samples = left_right;
@@ -119,20 +121,20 @@
}
else if (dpp->term == 18) {
while (num_samples--) {
- tmp0 = samples_AB.q [0];
- sam_AB = __builtin_ia32_psubd (tmp0, samples_AB.q [1]);
+ tmp0 = samples_AB [0];
+ sam_AB = __builtin_ia32_psubd (tmp0, samples_AB [1]);
tmp0 = __builtin_ia32_pslld (tmp0, 1);
sam_AB = __builtin_ia32_paddd (sam_AB, tmp0);
sam_AB = __builtin_ia32_psrad (sam_AB, 1);
- samples_AB.q [1] = samples_AB.q [0];
- samples_AB.q [0] = left_right = *(int_mmx *) in_samples;
+ samples_AB [1] = samples_AB [0];
+ samples_AB [0] = left_right = *(int_mmx *) in_samples;
tmp0 = __builtin_ia32_psrld (sam_AB, 15);
tmp1 = __builtin_ia32_pand (sam_AB, msk0);
tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
tmp1 = __builtin_ia32_paddd (tmp1, round);
tmp0 = __builtin_ia32_pslld (tmp0, 5);
tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -147,9 +149,9 @@
tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
tmp0 = __builtin_ia32_por (tmp0, sam_AB);
tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
*(int_mmx *) out_samples = left_right;
@@ -161,14 +163,14 @@
while (num_samples--) {
k = (m + dpp->term) & (MAX_TERM - 1);
- sam_AB = samples_AB.q [m];
- samples_AB.q [k] = left_right = *(int_mmx *) in_samples;
+ sam_AB = samples_AB [m];
+ samples_AB [k] = left_right = *(int_mmx *) in_samples;
tmp0 = __builtin_ia32_psrld (sam_AB, 15);
tmp1 = __builtin_ia32_pand (sam_AB, msk0);
tmp0 = __builtin_ia32_pand (tmp0, msk1);
- tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB.q);
- tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB.q);
+ tmp1 = __builtin_ia32_pmaddwd (tmp1, weight_AB);
+ tmp0 = __builtin_ia32_pmaddwd (tmp0, weight_AB);
tmp1 = __builtin_ia32_paddd (tmp1, round);
tmp0 = __builtin_ia32_pslld (tmp0, 5);
tmp1 = __builtin_ia32_psrad (tmp1, 10);
@@ -183,9 +185,9 @@
tmp0 = __builtin_ia32_pcmpeqd (left_right, zero);
tmp0 = __builtin_ia32_por (tmp0, sam_AB);
tmp0 = __builtin_ia32_pandn (tmp0, tmp1);
- weight_AB.q = __builtin_ia32_paddd (weight_AB.q, tmp0);
+ weight_AB = __builtin_ia32_paddd (weight_AB, tmp0);
- sum_AB.q = __builtin_ia32_paddd (sum_AB.q, weight_AB.q);
+ sum_AB = __builtin_ia32_paddd (sum_AB, weight_AB);
*(int_mmx *) out_samples = left_right;
@@ -194,13 +196,13 @@
m = (m + 1) & (MAX_TERM - 1);
}
}
- dpp->sum_A = sum_AB.d [0];
- dpp->sum_B = sum_AB.d [1];
- dpp->weight_A = weight_AB.d [0];
- dpp->weight_B = weight_AB.d [1];
+ dpp->sum_A = *(int*)&sum_AB;
+ dpp->sum_B = *((int*)&sum_AB+1);
+ dpp->weight_A = *(int*)&weight_AB;
+ dpp->weight_B = *((int*)&weight_AB+1);
for (k = 0; k < MAX_TERM; ++k) {
- dpp->samples_A [k] = samples_AB.d [m * 2];
- dpp->samples_B [k] = samples_AB.d [m * 2 + 1];
+ dpp->samples_A [k] = *((int*)&samples_AB+m * 2);
+ dpp->samples_B [k] = *((int*)&samples_AB+ m * 2 + 1);
m = (m + 1) & (MAX_TERM - 1);
}
__builtin_ia32_emms ();
[!--sizeo:1--][span style=\"font-size:8pt;line-height:100%\"][!--/sizeo--]Moderation: CODE to CODEBOX[/size]
Ah OK, I tried -x6 and now I see a difference - but now in fact the MMX version is slower (~20%). But, as I know that gcc is a bit bitchy, I modified your patch and now it is actually faster (~10%). (I cannot reliably test, as I have backround processes running).
This is not cleaned up...but the trick is to *not* use unions. gcc unfortunately treats them differently... Perhaps one wants to test, whether my version is faster for one, as well (using gcc/mingw)? I hope I didn't mess anything up...
Thanks for looking into this! Did you use gcc 4.1 for your tests again? It's too bad that it still has problems with unions and builtins. I'll do some testing without unions next week. That will hopefully enable other minor optimisations as well.
Thanks for looking into this! Did you use gcc 4.1 for your tests again?
Yup.
Sorry, I'm a bit busy these days. Now, that I finally found some time, I finished one of the alternatives to the BSR stuff. This one gives me a speedup of up to 8% on a P3 Celeron (-f -x6), and it's also faster than the original code on my old AMD K6.
wisodev, could you do some testing again? Please try 'nasm -O2' first. You find the patch in this file: wp-4.32-jfl2b.diff.gz
wisodev, could you do some testing again?
OK, I will post results (and binarys+sources) tomorrow.
Sorry, I'm a bit busy these days. Now, that I finally found some time, I finished one of the alternatives to the BSR stuff. This one gives me a speedup of up to 8% on a P3 Celeron (-f -x6), and it's also faster than the original code on my old AMD K6.
wisodev, could you do some testing again? Please try 'nasm -O2' first. You find the patch in this file: wp-4.32-jfl2b.diff.gz
OK, I have added this optimazations but only one quick test was done by me, results are not so good, the binarys and sources are here (http://www.hydrogenaudio.org/forums/index.php?s=&showtopic=43731&view=findpost&p=395239). Maybe someone else can do some tests to confirm my results.
I've updated the patch wp-4.32-jfl2b.diff.gz (http://base91.sourceforge.net/download/wavpack/) to avoid a memory stall. On a Celeron (Tualatin) I measured a speedup of 10% compared to the original wavpack sources.
wisodev, it would be nice, if you could provide binaries and test results again.
Thanks a lot,
Jo.
What happend to the PowerPC optimized version?
The wavpack-4.31-ppc.diff fails on the WavPack 4.31 and 4.32 source code:
patch -p0 < wavpack-4.31-ppc.diff
patching file bits.c
Hunk #1 FAILED at 149.
1 out of 1 hunk FAILED -- saving rejects to file bits.c.rej
patching file pack.c
Hunk #1 FAILED at 567.
1 out of 1 hunk FAILED -- saving rejects to file pack.c.rej
patching file unpack3.c
Hunk #1 FAILED at 1583.
Hunk #2 FAILED at 1604.
Hunk #3 FAILED at 1988.
3 out of 3 hunks FAILED -- saving rejects to file unpack3.c.rej
patching file wavpack.h
Hunk #1 FAILED at 415.
Hunk #2 FAILED at 518.
2 out of 2 hunks FAILED -- saving rejects to file wavpack.h.rej
patching file words.c
Hunk #1 FAILED at 69.
Hunk #2 FAILED at 93.
Hunk #3 FAILED at 144.
Hunk #4 FAILED at 419.
Hunk #5 FAILED at 1302.
Hunk #6 FAILED at 1325.
Hunk #7 FAILED at 1376.
7 out of 7 hunks FAILED -- saving rejects to file words.c.rej
Am I missing something, or do you simply need to update the patch (hopefully for 4.32)?
Thanks!
The problem is, that the WavPack sources for Unix have DOS line endings too. You need to remove the carriage returns from these files before you can apply the patch. You could do that with the following commands:
mkdir src
for FILE in *.[ch]; do tr -d '\r' < "$FILE" > "src/$FILE"; done
rm *.[ch]
mv src/* .
I\'ve updated the patch \'wp-4.32-jfl2b.diff.gz\' on http://he-jo.net/download/wavpack/ (http://he-jo.net/download/wavpack/) to avoid a memory stall. On a Celeron (Tualatin) I measured a speedup of 10% compared to the original wavpack sources.
wisodev, it would be nice, if you could provide binaries and test results again.
Thanks a lot,
Jo.
as stated in Upload thread the new binarys will be available very soon
PS. sorry for double post, but I missed this post
The problem is, that the WavPack sources for Unix have DOS line endings too.
Ok, I see.
I'll change the line endings and try again...
Thanks!
No problem. Would be nice, if you could publish your own test results here.
I have uploaded latest optimizations, the binarys and sources are available here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731) for download. It looks very good, for more details check the Upload thread.
Oh, this is great! I'm very happy, now that we have a faster binary for Athlons as well. I hoped to find a solution that is good for all common processors, but the P4 may still be a problem. Will see, if we can get rid of the BSR code in future.
Are you considering adding MMX optimizations for negatives values (dpp->term), this was discussed some time ago. I think there are few percent of improvement there too. This part of code executes about 25% of all calls to function decorr_stereo_pass. Or there is reason to not optimize this? This is just a suggestion, I now it takes lot of time to do such things.
Beside the one test on P4 showed that BSR and MMX works pretty nice on this machines. The newest optimization was not tested on P4, but I think it should do as well. But I am just speculating here ;-)
I am looking for your future optimizations!!!
Are you considering adding MMX optimizations for negatives values (dpp->term), this was discussed some time ago. I think there are few percent of improvement there too. This part of code executes about 25% of all calls to function decorr_stereo_pass. Or there is reason to not optimize this? This is just a suggestion, I now it takes lot of time to do such things.
I already did this one month ago, but tests on my old AMD K6 showed, that the code was actually slower. You can imagine, that this result wasn't very motivating. But in the meanwhile, I got the impression, that testing on this obsolete processor isn't really reliable. So maybe it's worth to have a look at the code again.
Beside the one test on P4 showed that BSR and MMX works pretty nice on this machines. The newest optimization was not tested on P4, but I think it should do as well. But I am just speculating here ;-)
Yes, that's what I meant. bsr could still be faster on a P4, because the 'jfl2b' code uses floating point instructions, which have a long latency on these processors. Theoretically, I could nearly double the throughput of the function by improving parallelism, but I'm afraid that this wouldn't give much gain at the end.
I am looking for your future optimizations!!!
Me too. I really like to do this work, and it's nice to make people happy that way. But, you know, it's often just a matter of time. Will see, what I'm able to do.
I made the test with my P4. I can see some improvements but BSR is still quite a lot faster. Keep up the good work he-jo.
I have uploaded latest optimizations, the binarys and sources are available here (http://www.hydrogenaudio.org/forums/index.php?showtopic=43731) for download. It looks very good, for more details check the Upload thread.
I find this a bit confusing: your link points to post #1, where one is redirected to post #21, but that post seems to be old as well (last edited on May 23, while the new binary should be of May 30) so I'm not sure if it is really the correct one.
Wouldn't be better to just have Post #1 to point to the latest and greatest version available?
Cheers and compliments for the great work you're all doing!
Sergio
I made the test with my P4. I can see some improvements but BSR is still quite a lot faster. Keep up the good work he-jo.
Thanks! At least I can be happy, that it isn't slower than the original code anymore.
I find this a bit confusing: your link points to post #1, where one is redirected to post #21, but that post seems to be old as well (last edited on May 23, while the new binary should be of May 30) so I\'m not sure if it is really the correct one.
Wouldn\'t be better to just have Post #1 to point to the latest and greatest version available?
Cheers and compliments for the great work you\'re all doing!
Sergio
Yes, it is my fault. I have not updated this redirections and I know that this is not the best solution. This will be corrected I hope this time I can solve this problem and make things clear as possible, I am only a human ;-).
Don't worry, wisodev, we all are human or at least we should be! :-))
Anyway, the link in post #21 is the good one, isn't it?
Cheers
Sergio
Edit: spelling
Edit 2: Oh, now I see! You edited post #1 and the correct link is now in post #31. Thanks!
No problem!
---
I think when next update will be available I will edit post #1 and place there all downloads and in the original posts (that are now containing links) will be added redirection to post #1, this will be less confusing and keeping all downloads in one post will clarify the situation. But I am open to other solutions to.