Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: FAAD2 optimization (Read 3534 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

FAAD2 optimization

Long shot, but has anyone performed or seen any references to optimization on FAAD2 decoder? I'm trying to decode a stream with SBR and PS on a microcontroller and each frame is taking too long, around 30ms when I need low 20-something.

This effort might be futile, but if an algorithm's weak point (or CPU cycle heavy functions) is known, it's easier to attempt optimization for a particular architecture.

Re: FAAD2 optimization

Reply #1
A few people as well as myself put a lot of effort into libfaad optimization for rockbox about 10 years ago:

https://git.rockbox.org/?p=rockbox.git;a=history;f=apps/codecs/libfaad;hb=a0009907de7a0107d49040d8a180f140e2eff299

I mostly worked on the LC part (getting a faster MDCT helps a lot there), but you can look at Andree Buschmann's work on the SBR parts, which made it substantially faster.  You can find a lot of our benchmarking and discussion on the patch tracker:

https://www.rockbox.org/tracker/task/11461
https://www.rockbox.org/tracker/task/11445

Also, some benchmark results for various devices (mostly ARM and Coldfire):

https://www.rockbox.org/wiki/CodecPerformanceComparison

Re: FAAD2 optimization

Reply #2
Cool thank you! I'll have to study this.

Re: FAAD2 optimization

Reply #3
What is your microcontroller by the way?

Re: FAAD2 optimization

Reply #4
What is your microcontroller by the way?
It's PIC32MZ. Not having much luck getting low enough execution time to be workable. I'm at about 27ms/frame with -o3 optimization and 150MHz, and I need around 22ms absolute minimum, but even that might be slightly too high since it's doing other stuff simultaneously. I can crank the clock to 200MHz but then I'll have LCD problems, which I can deal with but uggghh.

I tried moving some functions to RAM but it made no difference, probably because those functions end up running out of cache anyway.

Helix AAC takes around 13ms with SBR, and without SBR around 2 or 3ms! It might be the PS that's killing it, or maybe Helix is just way more efficient.

Re: FAAD2 optimization

Reply #5
That should be quite a lot faster than many of the old ARM devices we had, and it looks like most (all?) of the PIC32MZ series have hardware fixed point instructions, which are extremely helpful.  Things to check:

1)  Do you have assembly enabled for all of the fixed point macros (MUL_R, ComplexMult, etc)?  If not, you need to fix that before doing anything else.  Since you have instructions for fixed point operations, that will probably make a very large difference.

2) Do you have SRAM on your chip?  If so, take a look at what data tables Rockbox puts in SRAM and make sure yours are too.  Unless you have very large dcache, probably they won't fit and you'll spend a lot of time waiting on memory. 

And yes, libfaad standard is not at all well optimized for fixed point.  With moderate optimization, the Sansa Clip+ (ARM9E with fast RAM) got about 25 MHz for realtime LC, 73 MHz for real-time HE and 92 MHz for real-time HE-PS.  Those could have been improved a lot, but it gives you a rough idea where your microcontroller should be since its vaguely like the ARM9E.




Re: FAAD2 optimization

Reply #6
I just realized something from turning on Helix's SBR that might make a 2X difference! But first:

1) I do not. I looked at those for awhile, but didn't put much thought into it.
2) Yes, 512K on chip and I have a considerable amount free. Good advice. That's easy to do.

I'm actually running libfaad in floating point as the PIC has that in hardware. I tried integer (fixed point?) and it was slightly slower.

Ok, so back to my initial discovery. I've been using Helix AAC for around 10 years on a Cortex M3. SBR was not enabled because that alone required around 50K of RAM, which is over half of what that chip had. PCM buffers were 2048 values of 16 bits each. Now I'm running it on this PIC, lots of RAM, more MIPs, and found out the hard way that with SBR enabled and some AAC streams, it's generating 4096 16 bit values. Needless to say this trampled lots of other variables and caused immediate crashes until I increased the PCM buffer size.  Still Helix doesn't do PS which is what I'm after.

But now I wonder....is libfaad2 producing 4096 bytes of PCM, or 4096 samples of PCM? Because if it's 4096 samples, then I'm probably already ok since this is now not 23-24ms audio, but double that.

All I've tested with libfaad2 so far is performance tests by timing how long it takes to decode a frame stored in an array, otherwise I'd obviously know the answer. There's little documentation for this library.

Re: FAAD2 optimization

Reply #7
I'm actually running libfaad in floating point as the PIC has that in hardware. I tried integer (fixed point?) and it was slightly slower.

Without ASM, most fixed point operations are going to be converted to 64 bit multiplies and shifts, which are (probably) being emulated in software, so if floating point isn't many times faster than emulated 64 bit operations, you have a very, very slow FPU.  What is the throughput for single precision multiplies per cycle on your system?  If it is a lot less than 1, you will want to use fixed point.

On the other hand, if you have an FPU that is reasonably fast, you could use a better optimized decoder like ffmpeg.  The advantage of libfaad is that it supports integer (fixed point) and so does not need an FPU.

But now I wonder....is libfaad2 producing 4096 bytes of PCM, or 4096 samples of PCM? Because if it's 4096 samples, then I'm probably already ok since this is now not 23-24ms audio, but double that.

SBR files return double the number of samples since they interpolate a PCM stream with double the sampling rate of the underlying LC file (so if you decode a 22kHz LC file with SBR, you will get a 44 kHz stream out).  I think all decoders are going to work like that unless the API is hiding the underlying stream details from you.

All I've tested with libfaad2 so far is performance tests by timing how long it takes to decode a frame stored in an array, otherwise I'd obviously know the answer. There's little documentation for this library.

You should really be benchmarking an entire file (and checking that it is correctly decoded), since not all frames take the same amount of time to decode. 

Re: FAAD2 optimization

Reply #8
Not sure of the throughput for multiplies, but apparently I've been ok all along performance-wise. I've had this assumption that AAC frames decode to 2048 samples because that's all I've known until now. Even without using any ASM defines, an SBR/PS stream requires 30 or 35ms to decode, and they last for, what, around 48ms. So I'm good there.

And as of now, it works in my project, streaming from Ethernet and all....sort of. One stream, for example, is 128K, doesn't use SBR or PS (according to VLC) and is playing correctly. Another is low bitrate, using SBR+PS (again according to VLC) and works great. A third is similar and also works. But there's one stream that I cannot get to work. It's SBR only, in stereo.

When I try playing this stream, I kept getting unexplained CPU exception errors. Eventually I cranked the stack to nearly 100K from a previous 4K and that stopped, but now it's clear that malloc is failing in faad_malloc(). I have over 220K allocated to the heap and malloc is still returning 0 while attempting to play this stream. This doesn't seem normal. I noticed one function in particular has nearly 400 local float variables, so some stack is required. But this doesn't seem normal. I haven't tried to check my stack utilization yet as my plan was to throw everything at it to get it working, then optimize.

Do you recall approximately what the RAM requirements were for this?

Re: FAAD2 optimization

Reply #9
If you look at our git logs, a huge fraction of the optimizations mention memory allocations.  Lot of pain there. 

MP4 parsing can require a lot of memory, especially for pathological files.  AAC w/ SBR also requires a lot, although FAAD is not well optimized here.  I would look if your broken file has a pathological MP4 stream by repacking it using mp4box's -ipod option.  If that fixes your memory problem, it was probably the mp4 container. 

Re: FAAD2 optimization

Reply #10
At this point I know speed isn't a problem anymore, even without any ASM or any other optimizations. So that's a "hurdle" I'm over.

And the stream that caused issues simply didn't start with a 0xFFF header. I sort of assumed NeAACDecInit() scanned through the stream to synchronize and returned an offset, but I'm not seeing that in the code. Oddly enough, it never returns any other number but zero. I'm a little confused about that. So I'm using the old Helix function to do this, AACFindSyncWord(), and it took care of that. Cool.

However, it's crazy how bonkers NeAACDecDecode() gets when fed a stream that doesn't start right on the beginning of a frame. In this case, it attempts to malloc so much RAM I'm not sure where it even ends as I can't free up enough. But in any case, at least I understand this now.

The issue I'm dealing with now is heap corruption where calling malloc even outside FAAD is causing exception faults, but only after FAAD has been used. After a couple days of fighting this I have a suspicion FAAD is causing it, and tracking this down is very difficult. I looked through the typedefs in common.h and those look ok. Malloc is such a black box and not easy to deal with when it doesn't work!