Continuing with my optimizations and investigations from last post I was able to make some further discoveries. I was able to debug my program that was calling the glibc implementation of ffs and see what source code was actually being executed. Last time I had thought that the library was giving us an arm32 bit version but in the end we were getting the C implementation as shown in this picture:
I could not find the exact flags that the makefile was configuring ffs.c with but the closes clue I found was:
So I went ahead and decided to compile my programs with -fno-builtin as the only flag.
This time I tested four different function calls:
Test1 – called ffs through installed glibc
Test2 – called inline AARCH64 assembler ffs implementation
Testt3 – pasted the c implementation in a function and called it
Test4 – wrote a full assembler implementation of ffs in AARCH64 assembler
Firstly testing the implementations and comparing the actual results:
Now that everything returns the same results we can test the speeds.
Test 1 – glibc ffs call – 100m function calls:
Test 1 – glibc ffs call – 1b function calls:
Test 2 – inline assembler function – 100m calls:
Test 2 – inline assembler function – 1b calls:
Test 3 – copied c-impl (hard coded library call) – 100m calls:
Test 3 – copied c-impl (hard coded library call) – 1b calls:
Test 4 – assembler function – 100m calls:
Test 4 – assembler function – 1b calls:
The fastest implementation was the assembler function of ffs. At 100m function calls the speed of the assembler function and the glibc function were tied. After testing 1 billion function calls we can see a clear difference. The assembler function ran in about 3 seconds while the glibc function call ran in about 4 seconds. This is about a 25 percent improvement. The inline assembler performed worse then the previous two mentioned but still performed much faster then the hard coded implementation.
I believe the assembler implementation of ffs would improve speeds greatly on AARCH64 when using this function. The assembler implementation has the potential for upstream in its current form.
If the compiler flags I used were not the same as glibc there is a potential for all of my testing to be completely useless. When compiler optimizations kick in it is possible that the functions performance could vary and the relationship between them could change greatly. The glibc c’s implementation performs faster then the hard coded on could be due to compiler flags.