Tech Demo: Parallel AC3 Decoding

2003-02-15 12:38:53

If you are a newbie, this is probably too complicated for you, so better forget about it...

So, for excersising purposes I made a parallel (or rather distributed) azid based ac3 decoder, well not an decoder actually. It only searches for the peak of an ac3 file. As I used MPI, you can use maschines connected via LAN as a cluster and thus accelerate the process, so you don't need any high-cost maschines.

You need this to run the app:

- At least one Win NT machine. The version of MPI I used only runs on NT machines.
- Get "NT MPICH" (search in google) and install it. You need to install the cluster service on every machine and add the lib (and optionally bin) dir to the path (see quickstart.pdf).
- Place my app with azid.dll on every machine in the SAME dir, eg. c:\mpi
- Place any ac3 you want to test in the c:\mpi dir on the server only. Rename the ac3 to input.ac3.
- As I couldn't get rexecshell to run, I used mpiexec.exe. Use it folowing way on your "server":

mpiexec -wdir c:\MPI -user USER -domain local -password PASS -n 2 -host machine1,machine2 c:\MPI\MPI.exe

where USER and PASS are your NT login DATA (and if you have installed it as a domain, specify it). The -n parameter specifies the amount of processes to use and the -host parameter which machines to use. Place your server as first. You can specify a machine multiple times whcih makes sense for testing purposes or if you have a dual system. The wdir specifies the working dir and finally you tell mpiexec which app to execute.

If you have troubles running the app, don't ask me for help. Try reading the docs which come with NT MPICH. If anything goes wrong with the app, you will probably get a deadlock as I din't put effort in proper shut-down of it. So just close the dos-box and wait till MPI has finished closing the processes. You probably should disable any firewall, as it could interfere with MPI.

I am still coding, so either wait till I upload a usefull fast version or mail me for a test version.

Here a quick bench I did with a 212MB 2ch ac3:

server only:
3:19.116 min:sec.ms

cluster (2 machines):
2:59.738 min:sec.ms

Yeah, not so much improvement, but consider that I only have a crappy 10Mbit BNC network!!! And the code is very unoptimized ie, it was a debug build I used for testing and it uses blocking communication yet. So the CPU on the second machine was only used to about 60% and the main one about 88%. As you can see, there is plenty of room. Expect a nice boost as soon as I have non-blocking communication in.

Machine 1: Athlon XP @ 1.5GHz, w2k
Machine 2: Duron (old) @ 0.8GHz, w2k

I put in a weighing factor to take into account that machines are of different speed. A good start would be using the GHz as weights. I found the current optimum at about 1.5 for server and 0.65 for the Duron. Even by the weights you can see there is plenty of room to optimize.

Some questions which might arise:

- Will you release the source? - Probably not. I might explain in detail how this app works, so "everyone" could code it him/herself and maybe create better code.
- Will you incorporate MPI into HeadAC3he? - No.
- So what for you did all this crap? - I need to get in touch with MPI and as I said it is an personal exercise for me. Furthermore one should see how much possibilities to save time and use unused resources MPI enables...
- Where to get your tech demo? - As I said, wait a bit and try to setup NT MPICH in the meanwhile, or pass me an email AFTER you did latter one successfully, if you can't wait.

[later]
So, non-blocking comm is in, and used a release builds. New times with the same AC3:

server only:
2:40.281

cluster:
2:21.934

[even later]
Well, I decided to thread I/O and send/receive (using blocking communication again, non-blocking makes no sense). I want to use sync sends, which is even faster, but that errors out. Dunno whether I did something stupid or it is NT MPICH's fault.

server only:
2:36.xxx (dunno remember)

cluster:
2:14.053

Well, the speed-up is quite OK if you calc it in %, but far from the theor. possibilites. I guess my 10MBit net really is the bottleneck.

So, I attached the file for you to test it on your own "cluster". i would really be interested how much speed-up you get on a fast LAN with several machines.

To specify weights (see above), create a MPITD.INI in the local dir on each machine. Just put your number inside, in my case: 1.533 on the server and 0.8 on the second machine. If you have troubles to get it run with mpiexec, you can try it manually. Therefore you have to go to each machine you want to use and start the process there (or use remote soft, executing via network, will start the process on the machine you are sitting at, so don't do this...).

So manually (see above for parameter):
server: MPITD.exe -n 2
clients: MPITD.exe -m SERVER

Start the server first and have fun! Please post your results here or email [da1r1k1av@g1mx.1ne1t remove the 1s....] them me.

The file will hopefully be available here
http://www.everwicked.com/forums/showthrea...10721#post10721
http://forum.doom9.org/showthread.php?s=&p...1117#post261117

Tech Demo: Parallel AC3 Decoding

Reply #1 – 2003-04-07 17:37:11

So, finally I have got a 100mbit link between the two machines and did another test. (BTW, I was mistaken, the second machine is clocked at 900MHz.) I tested with another ac3, as I can't recall which one I used in the first test. This is is a 500MB 5.1 AC3 file. Times:

server:
5:19.830

cluster:
3:39.636

So, in the end I reached about the expected speed-up! The weights used were 0.9 and 1.533, just the GHz values. CPU usage was usually >93%. I suspect that the problem if CPU not beeing used at 100% is one the one hand the dependency of both processes, as I coded it in a simple manner: At each iteration both have to finish till next iteration begins. Another problem seems HD reading: Using the cluster stresses the hd, as first process gets beginning of ac3, and second one the second "half", so in eqach iteration the head has to reposition.

Notice