Phase 1: For this phase I tried to optimize wcschr by following the algorithm they used in strchr.S. I was able to load the characters in the vector registers bu the algorithm only in strchr.S did not carry over for 32 bit wide characters. I then decided to try my luck at the function ffs. This function is located in the string directory and returns the lowest significant bit that is set in a given integer. The function:
I ran a search on glibc and came up with these results:
finding the function in x86_64 and not the AARCH64 folder made me think ffs is an ideal candidate for optimization. First we look at x86_64’s optimization:
Basically one line of code containing an asm call with two assembly instructions. Bfsl is the instruction that looks for the lowest significant bit and cmovel is used to return -1 if 0 is passed in. Then a return + 1 gives us the correct position of the lowest bit.
To recreate this in AARCH64 assembler we have to use two instructions, rbit to reverse the bits then clz to count leading zeroes. AARCH64 doesn’t have an instruction to count trailing zeroes hence the need for rbit. My attempt at optimizing:
I then set up a tester, filled an array of 100 million random integers and passed each of them in a loop to the function. I timed the built in function and my made function. The results:
library built in ffs: 937 ms
my custom ffs: 973 ms
c implementation: 1034 ms
The results show that my in line assembler function actually performs slower (about 3.7%) than the library function which i assumed was a c program. That didn’t make sense to me so I pasted the code for the c implementation into my file and gave it a new name and timed it. I got the above time of 1034 ms, making the inline assembler almost 6 percent faster. I believe the key to this lies within the find command we ran earlier. I think the built in function/library function is calling arm/arvt62/ffs.S an optimized ffs written in assembly for using 32 bit registers. The next step is to try and rewrite ffs.S using 64 bit registers to work (potentially) better with aarch64.
Note: Tests were ran multiple times and averaged results to deal with fluctuations in results. All the programs were compiled with -std=gnu99 and that’s it. Functions were tested in separate programs. Functions were also tested for sizes as low as 1 million(where the run time difference was still measurable) and results were consistent.