Re: FLAC v1.4.x Performance Tests
Reply #281 – 2023-03-26 19:18:51
I don't know how to explain this in simple terms, but let's say that for each order up to and including 12, there is code optimized for that specific order. For orders above 12, there is generic code. A compiler can optimize loops in code much better if it knows in advance how often that loop will be traversed. It can 'unroll' a loop. In the generic code, the CPU will have to check after each addition and/or multiplication whether it needs to do another one for this sample, or whether it can move on to the next sample. When a loop is unrolled, there are simply a number of additions and multiplications after one another before encountering a check. So, generic code looks like this:repeat the following code for each sample { repeat the following code for each order { do multiplication do addition } } In FLAC, this is unrolled for orders below 12 to the following.[...] Use this code for order 2: repeat the following code for each sample { do multiplication do addition do multiplication do addition } Use this code for order 3: repeat the following code for each sample { do multiplication do addition do multiplication do addition do multiplication do addition } Use this code for order 4: repeat the following code for each sample { do multiplication do addition do multiplication do addition do multiplication do addition do multiplication do addition } This is pretty much what happens for residual calculation, strictly up to order 12. This is the change you're seeing for the red line, because when using -p the residual calculation code dominates the execution time. Just look at the code here: https://github.com/xiph/flac/blob/master/src/libFLAC/lpc.c#L1101 For the blue line, the change between 15 and 16, is a little bit more complicated. This has to do with the autocorrelation calculation, which can be optimized in groups of 4, more or less. So, there is code for order below 8, below 12 and below 16. You see this with the red line, because when not using -p (or -e) the autocorrelation calculation dominates the execution time. Look at the code here: https://github.com/xiph/flac/blob/68f605bd281a37890ed696555a52c6180457164f/src/libFLAC/lpc.c#L158