Re: Lossless codec comparison - part 3: CDDA (May '22)
Reply #16 – 2022-05-29 15:10:29
Regarding the speed i would like to make one remark: If the cpu of the test system really is an A4-5000: The most time consuming function in the encoder is performing lots of SSE2/SSSE3 multiplications. This particular cpu can start one each 2 clock cycles. But in between i am using the instruction palignr, which usually is very fast but on this cpu takes 19 clock cycles to execute. And since the next mutiplication has to wait for the result of palignr, you have only a fraction of the usual speed, about 1/3 to 1/2 i would estimate. This regards to this function, overall the effect is considerably smaller. The decoder is not affected. On the other hand: Maybe other codecs are similarly affected. Who knows. But the risk of distortions of the speed results seems to be bigger if a non-mainstream or outdated cpu is used. I wouldn't say this one wasn't mainstream when it was on sale, but I see your point. The past two revisions were done with an AMD A4-3400, revision 2 was the last one on an Intel CPU. If I compare revision 2 with revision 3, which is TAK 2.2.0 with TAK 2.3.0 and FLAC 1.2.1 with FLAC 1.3.0. Preset -p0 should decode 29% faster and encode 44% faster on TAK 2.3.0, according to the changelog. FLAC 1.3.0 had no noteworthy speed improvements over 1.2.1. Comparing TAK -p0 with FLAC -0 as baseline between revision 2 and 3, encoding is 22% faster and decoding is 30% faster. So yes, it seems moving from the old Intel Core2Duo T9600 to an AMD A4-3400 gave TAK a penalty of 20% on encoding compared to FLAC but no penalty on decoding. Anyway, there is a bit of reasoning behind choosing this old, slow CPU. First, it makes timing easier. When timing FLAC or TAK decodes one runs into limitations of measuring CPU time in Windows on modern CPUs combined with short tracks. This is less of a problem on Linux, but not all codecs run there of course. Maybe I take a newer CPU next time and downclock it by a lot. Second, it lessens the advantage of using the latest CPU extensions like AVX2, FMA, BMI. I feel this comparison would be too x86 focussed if these were included, but especially with Apple moving to ARM and lots of playback devices using ARM hardware it seems wrong to include AVX2 and beyond. NEON, which is available in a lot of ARM hardware, is AFAIK very comparable to the full SSE stack, but with the advantage of having many more registers. So, the A4-5000 CPU is one with all the SSE extensions, but its AVX implementation is crippled (read: is being split up into SSE by microcode). With these limitations in mind, this A4-5000 was the only thing I had lying around somewhere. If you have recommendations for a next revision (if there ever will be any) let me know. I just wanted to show I really put some thought into this. edit: BTW, with Microsoft also doing stuff on ARM, I wouldn't be surprised if a next revision would be better done on an ARM CPU. But then again, seeing lots of programs today still cling to 32-bit x86...