I don't understand how e.g. a single 1-pole biquad working on continous data can (efficiently) fit into the 128-bit or 256-bit SIMDs of Intel (or similar ARM) instructions, where you (as I recall it) typically have access to parallell float adds, mults, etc but very little in terms of intra-register arithmetics? How do you exploit all of those 8 shiny single-cycle (?) floating-point multipliers if every single output sample depends on the previous one?
IIUC, you'd have 3 independent parallel multiply-adds per channel in that case: x[n], x[n-1], y[n-1]. That means you can fill 6 32 bit MAC operations per sample. A typical ARM A9 system has 64 bit wide multiply/accum, or on the newer Qualcomm systems, 128. So for mobile applications you would be fine.
However, I think that filter is a bit unrealistic. With plane scalar ARMv4 instructions, you'd be looking at about 1 Mhz per output channel at 44.1khz. Most likely you would either use a higher order filter, or else simply not bother optimizing because the CPU time is too low to matter.
I am not disputing your claims as much as updating myself on IIR filters (I've been mostly into FIR filtering).
If we can normalise b(1), assuming that gain will be carried out later on anyways and there is no numerical penalty:
N=1000;
x = rand(N,1);
L = 2;
[b,a] = butter(L,0.42);
%% calculate reference
y_ref = filter(b,a,x);
% normalize against b(1) (assume that gain is to be carried out later on anyways)
b2 = b./b(1);
gain = b(1);
x2 = [zeros(L+0,1); x];
y2 = zeros(size(x2));
c = [b2(2:L+1)'; a(2:L+1)'];
for n=(L+1):N+L
buf = [x2(n-(1:L)); y2(n-(1:L))];
buf = buf .* c;
y2(n) = sum(buf(1:L)) - sum(buf(L+1:2*L));
y2(n) = y2(n) + x2(n);
end
y2 = y2(L+1:end);
y2 = gain*y2;
%% pass/fail
if(max(abs(y_ref-y2))<=1.5*eps)
disp('test passed')
else
disp('test failed')
end
I figure that my pseudo-implementation will have to:
1. Read 2 inputs and 2 previous outputs into one 4-element register
2. Do a 4-element dot product
3. Do 3 intra-register add/sub into scalar
4. Read 1 input
5. Do 1 add
6. Store scalar output
I don't see any potential to roll out the loop as there is dependency between iterations.
While this may well be "fast enough" for many audio applications, it seems to be far from having the 8-fold speedup that one might hope for in a 256-bit AVX cpu compared to a similarly clocked scalar cpu? Also, there are some cumbersome load operations, intra-register operations, unlike a streamlined FIR filter that can load neat "chunks" from memory, do nice multiply-accumulate, and store output.
-k