FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Topic: FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda) (Read 472804 times) previous topic - next topic

0 Members and 2 Guests are viewing this topic.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #50 – 2009-09-15 04:57:18

Quote from: Case on 2009-09-14 21:18:31

Impressive.

Confirmed.

Code: [Select]

                  CPU -8    GPUv3 lv6   GPUv4 lv6   GPUv3 lv7   GPUv4 lv7   GPUv4 lv11
ZUN                49.8x       60.4x      71.4x       46.8x      52.0x      13.57x
Rammstein          49.5x       63.4x      74.7x       48.9x      54.3x      13.59x

And the file sizes are down again. Decode times don't seem to suffer against the CPU Flac-encoded files at similar sizes either (592x vs 600x and the file is 0.3% smaller).
Keep up the good work!

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #51 – 2009-09-15 05:24:02

So far I wasn't much interested in the small compression gain, but V0.4 speed gain is impressive.
Anyway I will not buy an Nvidia card just for Cuda... sadly I own a geforce 7600GS & my next graphic card will be integrated to my future Core I3 530 ... so I guess I will never use flacuda ;(

It makes me wonder how fast could be a multi-threaded flacuda -4 encoding runned on a sandy bridge octo-core with a geforce 300 ... more than 16X faster compared to my old athlon XP 3000+ (barton) I guess

The sad thing for flacuda is that in a near future cheap GPU will be integrated to low end CPU as soon as 2010 (clarkdale, 2 core+45mn GPU), & the middle-end CPU as soon as 2011 (sandy bridge, 4 core+32mn GPU) for intel, & AMD will follow (one year late as always), all these integrated GPU will have hardware acceleration for blu-ray video codecs so unless you're a die-hard gamer, buying an nvidia card will be a pure loss of money.

The coming years will be hard for nvidia. I am not even sure it will survive.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #52 – 2009-09-15 06:50:25

I wouldn't kill nVidia just yet. AFAIK, as of now, it is the only card that supports GPU video transcoding, and it is heavily used in newer encoding applications, as well in new Photoshop for calculations of some effects.
While we are on the subject, where is this multithreaded flac encoder, BINARY, so I can test it?

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #53 – 2009-09-15 07:05:35

IMHO audio or video encoding will not help nvidia survive long because if the only purpose of buying GPU become to accelerating encoding then you'd better buy a higher-end CPU, being written in a lower process than GPU, CPU will always have the advantage in brute encoding force vs. power consumption & heat.

As for a multithreaded flac encoder, AFAIK there is none, I think I recall I read about some very experimental proof-of-concept code on some mailing list, but nothing serious.

Maybe we should start a donation to buy a quad core for Josh, it cannot be more useless than buying a PC for Klemm afterall

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #54 – 2009-09-15 12:43:45

Quote from: sauvage78 on 2009-09-15 07:05:35

As for a multithreaded flac encoder, AFAIK there is none, ..

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #55 – 2009-09-15 20:34:05

Just converted my entire FLAC -8 library to FlaCuda -8 and it became 0.157% smaller. That's a bit smaller difference than my sample file suggested (0.166%).

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #56 – 2009-09-16 08:45:13

Seems like i found a strange behaviour. If you have a 16bit file not using all of them it gets much larger as with flac or even your CUETools.Flake.exe encoder.

For the example i used a 16bit file made it to 8bit and back to 16. So 8bit are unused.
All for -8

flac 1.21 8.561.886 Bytes
your flake 8.572.572 Bytes
FlaCuda 45.509.060 Bytes

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #57 – 2009-09-16 09:11:35

Quote from: GeSomeone on 2009-09-15 12:43:45

Quote from: sauvage78 on 2009-09-15 07:05:35
As for a multithreaded flac encoder, AFAIK there is none, ..

The simpelest way to use multi threading for any encoder is to run multiple encoders simultaneously (foobar2000 can do that). The number of usable threads depends on when the hard disk becomes the bottleneck.

Now we just need a way to simultanously run CUDA and CPU encoders

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #58 – 2009-09-16 18:36:25

Quote from: Wombat on 2009-09-16 08:45:13

Seems like i found a strange behaviour. If you have a 16bit file not using all of them it gets much larger as with flac or even your CUETools.Flake.exe encoder.

Could you try to encode a lossywav-processed file to see if it shows the same behaviour? ( In this case, wav -> flacuda would have the same size than wav -> lossywav -> flacuda).

If that is true, it would seem that this FLAC implementation misses part of the specification (and maybe could reduce the size even further).

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #59 – 2009-09-16 19:09:26

FlaCuda_0.4 with "-8" switch, original test file: 975 kbps;

After LossyWAV --standard:

FlaCuda_0.4 -8: 996 kbps;
FlaCuda_0.4 -8 -b 512: 1011 kbps.

Flake_0.11 -8: 1000 kbps (Flake encoder from Winamp Essentials Pack 5.55).

Flac_1.2.1 -5 -b 512: 462 kbps.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #60 – 2009-09-16 19:14:14

Quote from: lvqcl on 2009-09-16 19:09:26

FlaCuda_0.4 with "-8" switch, original test file: 975 kbps;

After LossyWAV --standard:

FlaCuda_0.4 -8: 996 kbps;
FlaCuda_0.4 -8 -b 512: 1011 kbps.

Flake_0.11 -8: 1000 kbps (Flake encoder from Winamp Essentials Pack 5.55).

Flac_1.2.1 -5 -b 512: 462 kbps.

I can second that behaviour. One thing is that Mr. Chudovs CUETools.Flake.exe does compress ~as good as Flac 1.21 on lossywav also.
So hopefully it will be only need a small fix.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #61 – 2009-09-17 08:51:32

Quote from: odyssey on 2009-09-16 09:11:35

Now we just need a way to simultanously run CUDA and CPU encoders

Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread.
NCQ in AHCI mode should help a lot with more threads, but it didn't when I tested it a while ago. Physically different source/target drives can alleviate this bottleneck quite a bit.
Fast SSDs are worth a try too
This CUDA encoder can be a different solution, in case of one instance it's faster than the reference encoder running on one core of my CPU (converting one file at a time is the least disk-bottlenecked way to do it).
A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #62 – 2009-09-17 23:12:25

Added lossyWav support. It shouldn't make any difference for normal wavs.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #63 – 2009-09-17 23:30:04

Thanks for your fast work on that!!
Works flawless now on the 8bit and lossywav file.
Using it on normal files at -8 gets even a few bytes smaller as 0.4

Edit: For the ones that use lossywav. Standard flac seems to compress a bit better on that but not much.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #64 – 2009-09-18 19:05:07

Is there anybody here who knows the math behind Cholesky decomposition used in ffmpeg as an alternative method of LPC coefficients search?
This method is too slow for CPU, but i thought i'd give it a shot on GPU.
The problem is, GPU doesn't do double precision very well.
The lls code from ffmpeg doesn't work on single precision floats due to overflows.
My first idea was to scale down the signal to avoid overflows, but results were poor.
There's something i don't understand about this algorithm: in theory, LPC coeffs shouldn't depend on the scale of the signal - after all, they are linear
I have a suspicion that in practice this algorithm does depend on the scale of the signal a lot. I don't pretend to understand this math, but:
First suspicious piece of code is this (from av_solve_lls):

Code: [Select]

            double sum= covar[i][j];
            for(k=i-1; k>=0; k--)
                sum -= factor[i][k]*factor[j][k];

When the signal is multiplied by 10, covar[j] is multiplied by 100, and both factor[k] and factor[j][k] are multiplied by 100, so factor[k]*factor[j][k] is multiplied by 10000. So this sum doesn't scale in any predictable fashion.

I also don't understand this magic 'threshold' business.

Code: [Select]

                if(sum < threshold)
                    sum= 1.0;

How should the threshold scale with the signal? Should the sum always be set to 1.0 if it's below threshold, or to some value depending on the scale of the signal? Or am i on the wrong track completely?

I also found this old post from Josh:

Quote from: jcoalson on 2006-07-24 07:04:38

I have actually been doing experiments solving the full prediction linear system with SVD; this should give a lower bound on the compression achievable by the FLAC filter.

Is there any working code left from those experiments, and how successful were they?

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #65 – 2009-09-18 19:54:14

I must add, that when computations are done in double precision, lls coeffs do not depend much on the scale of the signal, so the algorithm works, despite non-linear scaling of intermediate values.
But in single precision they start to drift much more. Which is wierd, because in literature Cholesky decomposition is said to be more stable than Levinson-Durbin recursion, with regard to rounding errors.

Here is a sample of this drift in double precision:

Code: [Select]

SCALE: 1.0/6 COEFF[31,0..2]: 0.523100 0.287037 0.204438; COVAR[31,0..2]: 43226.383239 170398.007602 -241511.245261
SCALE: 1.0/7 COEFF[31,0..2]: 0.523086 0.287057 0.204432; COVAR[31,0..2]: 37051.186185 146055.437263 -207009.641880

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #66 – 2009-09-19 09:38:56

Quote from: alvaro84 on 2009-09-17 08:51:32

Sounds fun, though I'm afraid we'd bump into a strong bottleneck because of disk head positioning Even converting with 2 threads one HDD seeks like crazy - but it's still a lot faster than 1 thread. […] A natively multithreaded CPU-based encoder (working on segments of one single track) is another option.

Ideally you would run multiple instances of a single-threaded encoder (one track per CPU core) and one instance of the CUDA encoder per GPU at the same time - it's just a matter of making sure that all instances are kept busy.

When the number of remaining tracks gets lower than the number of available cores, you prioritize the GPU instance (since it's faster than a single-threaded encoder on a single CPU core), but also run (if available) a multi-threaded encoder; one MT encoder over two cores is likely to be slower than two instances of a ST encoder over the same number of cores (see the Lancer builds of the Ogg Vorbis encoder). In other words, an MT encoder is particularly useful for keeping CPU cores busy when the workload dries up.

In short, the priorities go like this (if you have a multi-core CPU, that is):
ST * n CPU cores > GPU > MT

As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).

I already use all available CPU cores when I encode my rips to FLAC or any other codec (one track per core); what I could really use, even before a MT FLAC encoder comes up, is a simple, command-line, multi-threaded Replay Gain utility. As I've said in the past, computing RG values on an album now takes longer than encoding it in the first place (because the former uses only one core while the latter uses all 4 cores on my quadcore CPU).

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #67 – 2009-09-19 13:04:26

Quote from: skamp on 2009-09-19 09:38:56

As for the I/O bottlenecks, that's when a large enough RAMdisk comes in very handy. Even just 1GiB is often enough for encoding a whole album (WAV + FLAC or FLAC + Ogg Vorbis or whatever on the RAMdisk).

You're absolutely right, I don't know how I could forget about RAMdisks. I used them all the time when 8MiB felt plenty of RAM, but somehow I never thought about them since we have multiple GiBs at our disposal... talk about contradictions...

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #68 – 2009-09-19 22:39:37

I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

Thank you again, Gregory. Very cool stuff.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #69 – 2009-09-19 23:10:38

Quote from: gib on 2009-09-19 22:39:37

I've gotten flacuda to work with the old but still handy Flac Frontend. The only little issue is that flacuda doesn't recognize the -V option as verify like the flac.exe does, so I can't use the verify checkbox in the Frontend. It's a tiny thing, but it would be cool if, maybe along with a future update, -V was added to flacuda. If not, I'll just go about setting it up to work with Foobar.

Thank you again, Gregory. Very cool stuff.

Since you can´t use replaygain with Flac Frontend and FlaCuda and you still want its simple layout just try Multi-Frontend from the same author. There you can define your line with --verify.
I even resurrected frontah for mirroring old files to new folder and FlaCuda and tags with one click. Its ini is simple to adjust to make it work. To sad frontah developement was stopped.

Edit: When anyone recommends foobar now, please tell me how you can simple mirror (reencode) folders + copying Tag + replaygaininfo in one go. I didn´t manage to do it that simple but i read here and there "Use foobar" but no detailed info how. Maybe i do misunderstand its functionality.

Edit2:
Finished the reencode of my collection. Since i used flac 1.10-1.21, flake and some other builds over the years i suppose it is of no use to calculate my space savings as a guiding value.
On some albums there were big savings. A few albums come out bigger, mainly very silent music or with many silent parts in it. I can imagine on some collections with special kinds of music it won´t save as much space as expected.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #70 – 2009-09-20 00:41:23

Wombat, thanks for the suggestion of using Multi Frontend. I can't believe that I haven't downloaded it before. Thanks again!

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #71 – 2009-09-25 23:45:07

Here is version that is a tiny bit faster, i hope. Since for previous version HDD is a bottleneck, i was able to measure the speed improvement only when using RAMDisk.

I'm still curious about alternative algorithms to Levinson-Durbin (i commented above on my problems with ffmpeg's least-square model). Any help would be appreciated.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #72 – 2009-09-26 01:58:10

I eally can´t tell if your FlaCuda became any faster cause it was damn fast before. All i can say it is kind of fun having the GPU doing its job while you don´t notice your system being under heavy stress. So encoding with FlaCuda you can still do heavy tasks in Front. I love it.

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #73 – 2009-09-26 11:19:30

This is getting ridiculous. New FlaCuda 0.6 is faster even in mode -8 than 0.4 was in -0.
[attachment=5408:flac_vs_flacuda.png]

FLACCL: CUDA-enabled FLAC encoder by Gregory S. Chudov (prev. FlaCuda)

Reply #74 – 2009-09-26 12:54:37

I second, it's ridiculous
Now FlaCuda 0.6 at -6 is a almost as fast as Flac 1.2.1 -8 running on two threads... and this is a stock 8600GT standing up against a pretty much overclocked core2 duo... If I give the geforce a little bit of overclock, it comes out faster than the 2 instances of Flac1.2.1 together... the file sizes are even a bit smaller than with the CPU encoder and there are 'more hardcore' settings... it's true that heavier compression takes a toll on decoding speed too, so I stick with the original -8-ish compression when I use FLAC.
TAK is somewhat slower to decode, but it compresses better than even FlaCuda does at -11 and that's beyond the speed crossover point: that -11 FLAC is slower to decode than the -p2m TAK (which is 18kbps smaller in case of my test material).
No, it's not a TAK marketing remark, I'm just testing that too, it's interesting for me to compare these codecs.

Notice