in Compilers

x86: Generated code optimizations and tricks

I’ve been playing with generated (native) code in order to test how compilers (specific or in general) translate a = a + 1 and a++, and found some other interesting optimizations. I used gcc compiler for this simple program:

int main() {
  int a = 0;
  a = a + 1;

  int b = 0;
  b++;

  return a + b;
}

Running the:

gcc -S test.cpp

We get the assembly source code of this program:

  .file "test.cpp"
  .text
.globl main
  .type main, @function
main:
.LFB0:
  .cfi_startproc
  .cfi_personality 0x3,__gxx_personality_v0
  pushq %rbp
  .cfi_def_cfa_offset 16
  .cfi_offset 6, -16
  movq  %rsp, %rbp
  .cfi_def_cfa_register 6
  movl  $0, -8(%rbp)
  addl  $1, -8(%rbp)
  movl  $0, -4(%rbp)
  addl  $1, -4(%rbp)
  movl  -4(%rbp), %eax
  movl  -8(%rbp), %edx
  leal  (%rdx,%rax), %eax
  leave
  .cfi_def_cfa 7, 8
  ret
  .cfi_endproc
.LFE0:
  .size main, .-main
  .ident  "GCC: (GNU) 4.4.7 20120313 (Red Hat 4.4.7-11)"
  .section  .note.GNU-stack,"",@progbits

(Note: gcc generated assembly code in AT&T syntax, that slighly differs from Intel syntax, which examples you may find usually in books)

Let’s strip all unrelated code (this generic initialization, known as prologue, and epilogue), and concentrate only on interesting parts, which directly correspond to our program:

movl  $0, -8(%rbp)
addl  $1, -8(%rbp)
movl  $0, -4(%rbp)
addl  $1, -4(%rbp)
movl  -4(%rbp), %eax
movl  -8(%rbp), %edx
leal  (%rdx,%rax), %eax

First thing to note: parameters are passed on the stack here (there could be optimizations when some parameters are passed in registers for faster access, and some are pushed on the stack, but in this generated code we see them on the stack). We know that when a function is called, a new stack frame (also known as Activation Record) is allocated and pushed on the stack. Parameters in this frame are laid out one after another, so technically we may reach every parameter by its offsets from the beginning of the frame.

We see this in the operands like: -8(%rbp). The %rbp register (called Base pointer, or Frame pointer sometimes) points to the beginning of the stack frame, and is initialized with the movq %rsp, %rbp instruction, where %rsp register (the Stack Pointer), points to the beginning of the newly allocated stack frame. It can change during execution, that’s why we save it to %rbp and then access parameters through it.

So the following instruction assigns value 0 to the variable a (moves value $0 to the parameter at offset 8 from the base pointer):

movl  $0, -8(%rbp)        ; int a = 0;

The next one clearly adds 1 to it:

addl  $1, -8(%rbp)        ; a = a + 1;

Having this in mind, we may clearly see now what the next two instructions do (basically the same, but with the b variable):

movl  $0, -4(%rbp)        ; int b = 0;
addl  $1, -4(%rbp)        ; b++;

And here first interesting observation: absolutely the same code is generated for the a = a + 1 and b++ (well, with different variables of course). A naïve compiler would probably be first loading current value of a, then adding 1 to it, and storing this temporary result in to the a again. And this is what would actually happen in case of a simple (e.g. AST-based) interpreter: we would need to resolve a, make addition, then resolve a again in the environment and scope chain, and store the result. But not in this case 🙂 In this case we see that it was efficiently optimized, the same as b++.

The final line of code we’re interested in is:

leal  (%rdx,%rax), %eax   ; return a + b;

What’s going on here? How the leal instruction is related to returning the sum of a and b? Why addl wasn’t used for addition as it was with a + 1? Or movl? Let’s clarify one thing at a time.

First, the return value by this convention is stored in the %eax register (known as Accumulator). Combined with the leave instruction, that restores stack pointer, and ret instruction, that actually returns — we get our return, and the caller will be able to find the returned value in the %eax.

Now the leal. The leal instruction stands for Load Effective Address, and when is used in user-level code, performs some operations on addresses. It can do different address arithmetics: addition, multiplication, etc. And compiler used this (in fact “unrelated” instruction) to perform optimization for addition of variables. It works on registers, hence the values from stack where moved to the %eax and %edx respectively:

movl  -4(%rbp), %eax
movl  -8(%rbp), %edx

In contrast with addl instruction that works with ALU module of CPU, leal in this case is much more optimized, since is related to working with memory addresses as a core operation. Also leal in contrast with addl allows addition with either two or three operands, and, that is important, to store the result in any register, not just one of the source operands as in case of addl. So the leal (%rdx,%rax), %eax will do %rdx + %rax, and saves it directly to the %eax which is the place of the return value, i.e. 2.

Notice also, that in the above experiment we haven’t even yet applied any optimization flags, like -O1, -O2. This is how the code will look like with the -O1 flag:

gcc -O1 -S test.cpp
movl  $2, %eax
ret

Semantics is preserved, we still get a correct value 2 as the result, however optimizing compiler actually completely eliminated all our code, and was able to infer correct final result as a direct constant. I leave other flags for you to experiment on bigger programs.

Have fun with the native code 😉

Write a Comment

Comment