Phase 3

Phase 3 – upstream our changes.

To upstream our changes glibc has a checklist to make sure our changes our compliant with glibc guidelines. They want an email with a subject line containing a patch number. They want a properly formatted  change log as well as for you to go over the coding guidelines and run the built in test suite to ensure nothing has broken or regressed. It is also required that a patch file is attached to the email showing the changes you’ve made (basically a diff).

The email subject line:

[PATCH] aarch64: Assembly implementation of ffs for aarch64 systems.

This is the unified diff I will attach in my patch email:


And this is the change log entry for the body of the email:

2017-04-23  Joshua Longhi  <>

* sysdep/aarch64/ffs.S : Added aarch64 assembler implementation of string/ffs.c.
Performs around 25% faster than the c implementation.


Phase 2 – ffs.c

Continuing with my optimizations and investigations from last post I was able to make some further discoveries. I was able to debug my program that was calling the glibc implementation of ffs and see what source code was actually being executed. Last time I had thought that the library was giving us an arm32 bit version but in the end we were getting the C implementation as shown in this picture:


I could not find the exact flags that the makefile was configuring ffs.c with but the closes clue I found was:


So I went ahead and decided to compile my programs with -fno-builtin as the only flag.

This time I tested four different function calls:

Test1 – called ffs through installed glibc

Test2 – called inline AARCH64 assembler ffs implementation


Testt3 – pasted the c implementation in a function and called it

Test4 – wrote a full assembler implementation of ffs in AARCH64 assembler


Firstly testing the implementations and comparing the actual results:

Now that everything returns the same results we can test the speeds.

Test 1 – glibc ffs call – 100m function calls:


Test 1 – glibc ffs call – 1b function calls:


Test 2 – inline assembler function – 100m calls:


Test 2 – inline assembler function – 1b calls:


Test 3 – copied c-impl (hard coded library call) – 100m calls:


Test 3 – copied c-impl (hard coded library call) – 1b calls:


Test 4 – assembler function – 100m calls:


Test 4 – assembler function – 1b calls:



The fastest implementation was the assembler function of ffs. At 100m function calls the speed of the assembler function and the glibc function were tied. After testing 1 billion function calls we can see a clear difference. The assembler function ran in about 3 seconds while the glibc function call ran in about 4 seconds. This is about a 25 percent improvement. The inline assembler performed worse then the previous two mentioned but still performed much faster then the hard coded implementation.

I believe the assembler implementation of ffs would improve speeds greatly on AARCH64 when using this function. The assembler implementation has the potential for upstream in its current form.

If the compiler flags I used were not the same as glibc there is a potential for all of my testing to be completely useless. When compiler optimizations kick in it is possible that the functions performance could vary and the relationship between them could change greatly. The glibc c’s implementation performs faster then the hard coded on could be due to compiler flags.

Course Project Phase 1 – ffs.c

Phase 1: For this phase I tried to optimize wcschr by following the algorithm they used in strchr.S. I was able to load the characters in the vector registers bu the algorithm only in strchr.S did not carry over for 32 bit wide characters. I then decided to try my luck at the function ffs. This function is located in the string directory and returns the lowest significant bit that is set in a given integer. The function:


I ran a search on glibc and came up with these results:


finding the function in x86_64 and not the AARCH64 folder made me think ffs is an ideal candidate for optimization. First we look at x86_64’s optimization:


Basically one line of code containing an asm call with two assembly instructions. Bfsl is the instruction that looks for the lowest significant bit and cmovel is used to return -1 if 0 is passed in. Then a return + 1 gives us the correct position of the lowest bit.

To recreate this in AARCH64 assembler we have to use two instructions, rbit to reverse the bits then clz to count leading zeroes. AARCH64 doesn’t have an instruction to count trailing zeroes hence the need for rbit.  My attempt at optimizing:


I then set up a tester, filled an array of 100 million random integers and passed each of them in a loop to the function. I timed the built in function and my made function. The results:

library built in ffs: 937 ms

my custom ffs: 973 ms

c implementation: 1034 ms

The results show that my in line assembler function actually performs slower (about 3.7%) than the library function which i assumed was a c program. That didn’t make sense to me so I pasted the code for the c implementation into my file and gave it a new name and timed it. I got the above time of 1034 ms, making the inline assembler almost 6 percent faster. I believe the key to this lies within the find command we ran earlier. I think the built in function/library function is calling arm/arvt62/ffs.S an optimized ffs written in assembly for using 32 bit registers.  The next step is to try and rewrite ffs.S using 64 bit registers to work (potentially) better with aarch64.

Note: Tests were ran multiple times and averaged results to deal with fluctuations in results. All the programs were compiled with -std=gnu99 and that’s it. Functions were tested in separate programs. Functions were also tested for sizes as low as 1 million(where the run time difference was still measurable) and results were consistent.

Tester code:


Course Project

After some initial testing and finding that the cmp and binary search functions were hard not only hard to call but did not already have a platform specific optimization. When you download glibc there is a directory containing architecture specific optimizations for certain functions. Comparing the sysdeps/x86_64 folder containing 500 or so files to sysdeps/aarch64’s 100 files I decided it would be best to chose one that x86_64 had that aarch64 did not. I decided to switch my project to optimizing the wcschr function. A wide character function that finds the first occurrence of a wide character in a wide string and returns the pointer.

Course Project – Phase 0

For this project we are going to be optimizing functions from glibc, the standard c library. The function I am choosing to optimize is mpn_cmp() in the stdlib cmp.c file. This functions takes two low level integers and returns 1 if int1 > int2, 0 if equal and -1 if int1 = 0; i–)
op1_word = op1_ptr[i];
op2_word = op2_ptr[i];
if (op1_word != op2_word)
goto diff;
return 0;


In this lab we are going to be looking at the vector capabilities of the arm64 processor and how it is possible to speed up our code.

In this experiment we are going to take two 1000 long integer arrays, fill them with random numbers and then sum them into another array and print them. In the first attempt I filled the arrays with random numbers, summed them into a long and printed them in the same loop.*Note that array3 is of type long and array 1 & 2 are of type int.

Compiling this code with -O3 (to turn on auto vectorization) yielded no loop getting vectorized. Recompling with a flag to see compiler vector attemps(-ftree-vectorizer-verbose=2) it was clear that the compiler could not vectorize this loop because of the function calls. I could see that the loop was not vectorized because verbose told me it could not vectorize and when I objdump -d the elf file, you could see that vector registers were not used.

To get the compiler to vectorize the code I moved the filling of array 1 and 2 into a separate loop, the printf into a separate loop as well as the summing into a separate loop. Compiling with the same flags shows a message that the compiler was able to vectorize the summation loop. Looking at the objdump -d of the file you can see vector registers being used. It looks like this:

ldl {v1.4s}, [x21]
ldl {v0.4s}, [x20]
add v0.4s, v1.4s, v

What this code is doing is it is pulling 4 32-bit values into v0 (v0.4s), pulling 4 32-bit values into v1 and then summing the two values in one instruction effectively summing 4 index’s of the array with one instruction.

To vectorize our previous lab we could make array1 our sample sounds, we can fill array2, an identical array in size with the volume scaling factor and finally copy the code we have above but instead of adding we multiply. We could also try just recompiling our code with -O3 the way we have it as the only instruction in the loop is the scaling so it is possible -O3 will suffice in vectorizing our code the way we currently have it.

Lab 5

In this lab we are going to be testing two algorithms that scale 16 bit integer sound samples with a scaling factor. The first algorithm we will test will be straight multiplication and the second  algorithm we will test will be multiplication with bit shifting. Both algorithms will be tested on arm64 and x86 architectures.

The first algorithm we tested was straight multiplying. We took the volume and multiplied it by the sound sample in a loop and stored the new scaled sample in the same array. It looked like:

float volume = 1.0;
for(int i =0;i<size;i++){
myArray[i] = myArray[i] * volume;

The Second algorithm we tested was similar but we converted the volume to a 32 bit integer, converted the array value to a 32 bit integer with casting, multiplied the two values then bit shifted them to the right by 16 bits to convert them back to 16 bit integers and stored the result. It looked like this:

float volume = 1.0;
int32_t result = volume* 65536;
for(int i =0;i>16;

This approach has the advantage in theory of speeding up the casting process with bit shifting. When both approaches were tested using 1 million samples they were able to scale the samples in 4 seconds. A difference in algorithm speed could be seen only when testing 100 million samples but even then the difference was a couple 100 ms for the bit shifting algorithm. The goal of this lab was to test if the algorithms could keep up with scaling 44100 samples per second on two channels. Both algorithms could scale 500 thousand samples in less then two seconds safely allowing them to process the two 44k sound sample channels.

Testing optimization settings on the compiler greatly speed up the algorithms. Both of these tests were conducted without optimizing to avoid the risk of the compiler deleting code as it is not used later. However after performing initial tests I recompiled both algorithms with -03 and tested 10 million samples. Algorithm 1 and 2 both went from 43 down to 10 ms run time.