Lockless Inc

Auto-vectorization with gcc 4.7

One way of speeding up code which contains many loops is to use vectorization. On modern machines, this means the use of SSE or AVX instructions. However, if one had to manually re-write code into assembly language in order to do this, then very few programs would. The reason is obvious: assembly language is error-prone, and a difficult skill to master. It is also intrinsically nonportable. Fortunately, compiler writers have a few tricks up their sleeves to help us.

At the lowest level, gcc exposes compiler intrinsics that match one-to-one with the vector instructions on the the cpu. Therefore, it is possible to write code using them in a higher level language like C or C++, and still have nearly complete control over the output. Unfortunately, since these intrinsics are still very low-level, code written using them can be a little hard to read. The code is also not portable to machines that lack the instructions corresponding to the intrinsics used.

gcc has another extension that helps with vectorization, vector types. It is possible to construct types that represent arrays (i.e. vectors) of smaller more basic types. Then, code can use normal C (or C++) operations on those types. The result will be vectorized over the whole vector of basic types. In other words, you can add two vectors together to make a third, and gcc will understand exactly what you mean and use the instruction you want. More recent versions of gcc allow array accesses into the vector types, making them even more flexible and easy to use.

Vector types have a disadvantage though, code needs to be rewritten in order to use them. Instead of passing around standard types, the non-portable compiler-specific ones need to be used instead. Also, since there are many vector operations that don't correspond to C operators, you may need to fall back to intrinsics in order to get the instructions you want.

At an even higher level is auto-vectorization. There, code isn't re-written at all. It remains portable C, and the compiler automagically determines how to vectorize it. Since ideally, no work needs to be done, one can simply recompile with a new compiler and get all the speed advantages of vectorization with very little effort required. The big question though, is how much code can gcc vectorize? If very few loops can be, then this feature isn't too useful. Conversely, if gcc is very smart, then the lower level techniques aren't necessary any more.

Auto-vectorization

We use the most recent version of gcc at the time of writing (4.7) to test to see how well auto-vectorization works in practice. We will try a series of test routines with simple loops to see what gcc does with them. To start with, we will use a constant loop count. (It is possible to have a variable loop count, but in most code that doesn't change too much in the important inner loop.)


#include <stdlib.h>
#include <math.h>

#define SIZE	(1L << 16)

void test1(double *a, double *b)
{
	int i;

	for (i = 0; i < SIZE; i++)
	{
		a[i] += b[i];
	}
}

The first test simply adds an array of doubles onto a second array. We compile with -O3 which in modern gcc turns on auto-vectorization for us. We use no other optimization compile-time flags to get the default from the compiler. The resulting routine in asm is:


<+0>:	  lea	 0x10(%rsi),%rax
<+4>:	  cmp	 %rax,%rdi
<+7>:	  jb	 0x47 <test1+71>
<+9>:	  xor	 %eax,%eax
<+11>:    nopl   0x0(%rax,%rax,1)
<+16>:    movsd  (%rdi,%rax,1),%xmm1
<+21>:    movsd  (%rsi,%rax,1),%xmm2
<+26>:    movhpd 0x8(%rdi,%rax,1),%xmm1
<+32>:    movhpd 0x8(%rsi,%rax,1),%xmm2
<+38>:    movapd %xmm1,%xmm0
<+42>:    addpd  %xmm2,%xmm0
<+46>:    movlpd %xmm0,(%rdi,%rax,1)
<+51>:    movhpd %xmm0,0x8(%rdi,%rax,1)
<+57>:    add	 $0x10,%rax
<+61>:    cmp	 $0x80000,%rax
<+67>:    jne	 0x10 <test1+16>
<+69>:    repz retq 
<+71>:    lea	 0x10(%rdi),%rax
<+75>:    cmp	 %rax,%rsi
<+78>:    jae	 0x9 <test1+9>
<+80>:    xor	 %eax,%eax
<+82>:    nopw   0x0(%rax,%rax,1)
<+88>:    movsd  (%rdi,%rax,1),%xmm0
<+93>:    addsd  (%rsi,%rax,1),%xmm0
<+98>:    movsd  %xmm0,(%rdi,%rax,1)
<+103>:   add	 $0x8,%rax
<+107>:   cmp	 $0x80000,%rax
<+113>:   jne	 0x58 <test1+88>
<+115>:   repz retq

The above code is interesting... but not particularly optimal. gcc first tests to see if the arrays partially overlap, if so it jumps to a slow loop that does one addition at a time. If the arrays are not directly on top of each other, gcc then does a rather strange thing of loading each double manually, and stuffing them into %xmm1 and %xmm2 It then wastes time by copying into %xmm0 before finally doing the packed addition. Once the addition is done, it then moves the data out one double at a time.

The above can at best be called partially vectorized. The problem is that the compiler is constrained by what we tell it about the arrays. If we tell it more, then perhaps it can do more optimization. The most obvious thing is to inform the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers. The resulting code looks like:


void test2(double * restrict a, double * restrict b)
{
	int i;

	for (i = 0; i < SIZE; i++)
	{
		a[i] += b[i];
	}
}

This compiles into:


<+0>:	  mov	 %rdi,%rax
<+3>:	  push   %rbp
<+4>:	  shl	 $0x3c,%rax
<+8>:	  sar	 $0x3f,%rax
<+12>:    push   %rbx
<+13>:    mov	 %rax,%rdx
<+16>:    and	 $0x1,%edx
<+19>:    test   %rdx,%rdx
<+22>:    mov	 %edx,%ecx
<+24>:    je	 0x14c <test2+204>
<+30>:    movsd  (%rdi),%xmm0
<+34>:    mov	 $0xffff,%ebx
<+39>:    mov	 $0x1,%edx
<+44>:    addsd  (%rsi),%xmm0
<+48>:    movsd  %xmm0,(%rdi)
<+52>:    mov	 $0x10000,%r11d
<+58>:    and	 $0x1,%eax
<+61>:    sub	 %ecx,%r11d
<+64>:    mov	 %r11d,%r10d
<+67>:    shr	 %r10d
<+70>:    mov	 %r10d,%ebp
<+73>:    add	 %ebp,%ebp
<+75>:    je	 0x112 <test2+146>
<+77>:    lea	 0x0(,%rax,8),%r8
<+85>:    xor	 %ecx,%ecx
<+87>:    xor	 %eax,%eax
<+89>:    lea	 (%rdi,%r8,1),%r9
<+93>:    add	 %rsi,%r8
<+96>:    movsd  (%r8,%rax,1),%xmm1
<+102>:   add	 $0x1,%ecx
<+105>:   movapd (%r9,%rax,1),%xmm0
<+111>:   movhpd 0x8(%r8,%rax,1),%xmm1
<+118>:   addpd  %xmm1,%xmm0
<+122>:   movapd %xmm0,(%r9,%rax,1)
<+128>:   add	 $0x10,%rax
<+132>:   cmp	 %r10d,%ecx
<+135>:   jb	 0xe0 <test2+96>
<+137>:   add	 %ebp,%edx
<+139>:   sub	 %ebp,%ebx
<+141>:   cmp	 %ebp,%r11d
<+144>:   je	 0x149 <test2+201>
<+146>:   movslq %edx,%rcx
<+149>:   lea	 0x0(,%rcx,8),%rdx
<+157>:   lea	 (%rdi,%rdx,1),%rax
<+161>:   add	 %rsi,%rdx
<+164>:   lea	 -0x1(%rbx),%esi
<+167>:   add	 %rsi,%rcx
<+170>:   lea	 0x8(%rdi,%rcx,8),%rcx
<+175>:   nop
<+176>:   movsd  (%rax),%xmm0
<+180>:   addsd  (%rdx),%xmm0
<+184>:   add	 $0x8,%rdx
<+188>:   movsd  %xmm0,(%rax)
<+192>:   add	 $0x8,%rax
<+196>:   cmp	 %rcx,%rax
<+199>:   jne	 0x130 <test2+176>
<+201>:   pop	 %rbx
<+202>:   pop	 %rbp
<+203>:   retq   
<+204>:   mov	 $0x10000,%ebx
<+209>:   xor	 %edx,%edx
<+211>:   jmpq   0xb4 <test2+52>

The above is an enormous monstrosity. The problem here is alignment. gcc is not yet allowed to assume that the arrays are 16-byte aligned. This means that it has to go through all sorts of gymnastics to handle the cases which are not. (SSE instructions in general do not work with unaligned accesses.) It also means that the inner loop above can't assume that both arrays are aligned. If gcc were smart, it could test for the cases where the arrays are either both aligned, or both unaligned, and have a fast inner loop. However, it doesn't do that currently.

So in order to get the performance we are looking for, we need to tell gcc that the arrays are aligned. There are a couple of ways to do that. The first is to construct a (non-portable) aligned type, and use that in the function interface. The second is to add an intrinsic or two within the function itself. The second option is easier to implement on older code bases, as other functions calling the one to be vectorized do not have to be modified. The intrinsic has for this is called __builtin_assume_aligned:


void test3(double * restrict a, double * restrict b)
{
	int i;

	__builtin_assume_aligned(a, 16);
	__builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		a[i] += b[i];
	}
}

Which produces:


<+0>:	  mov	 %rdi,%rax
<+3>:	  push   %rbp
<+4>:	  shl	 $0x3c,%rax
<+8>:	  sar	 $0x3f,%rax
<+12>:    push   %rbx
<+13>:    mov	 %rax,%rdx
<+16>:    and	 $0x1,%edx
<+19>:    test   %rdx,%rdx
<+22>:    mov	 %edx,%ecx
<+24>:    je	 0x22c <test3+204>
<+30>:    movsd  (%rdi),%xmm0
<+34>:    mov	 $0xffff,%ebx
<+39>:    mov	 $0x1,%edx
<+44>:    addsd  (%rsi),%xmm0
<+48>:    movsd  %xmm0,(%rdi)
<+52>:    mov	 $0x10000,%r11d
<+58>:    and	 $0x1,%eax
<+61>:    sub	 %ecx,%r11d
<+64>:    mov	 %r11d,%r10d
<+67>:    shr	 %r10d
<+70>:    mov	 %r10d,%ebp
<+73>:    add	 %ebp,%ebp
<+75>:    je	 0x1f2 <test3+146>
<+77>:    lea	 0x0(,%rax,8),%r8
<+85>:    xor	 %ecx,%ecx
<+87>:    xor	 %eax,%eax
<+89>:    lea	 (%rdi,%r8,1),%r9
<+93>:    add	 %rsi,%r8
<+96>:    movsd  (%r8,%rax,1),%xmm1
<+102>:   add	 $0x1,%ecx
<+105>:   movapd (%r9,%rax,1),%xmm0
<+111>:   movhpd 0x8(%r8,%rax,1),%xmm1
<+118>:   addpd  %xmm1,%xmm0
<+122>:   movapd %xmm0,(%r9,%rax,1)
<+128>:   add	 $0x10,%rax
<+132>:   cmp	 %r10d,%ecx
<+135>:   jb	 0x1c0 <test3+96>
<+137>:   add	 %ebp,%edx
<+139>:   sub	 %ebp,%ebx
<+141>:   cmp	 %ebp,%r11d
<+144>:   je	 0x229 <test3+201>
<+146>:   movslq %edx,%rcx
<+149>:   lea	 0x0(,%rcx,8),%rdx
<+157>:   lea	 (%rdi,%rdx,1),%rax
<+161>:   add	 %rsi,%rdx
<+164>:   lea	 -0x1(%rbx),%esi
<+167>:   add	 %rsi,%rcx
<+170>:   lea	 0x8(%rdi,%rcx,8),%rcx
<+175>:   nop
<+176>:   movsd  (%rax),%xmm0
<+180>:   addsd  (%rdx),%xmm0
<+184>:   add	 $0x8,%rdx
<+188>:   movsd  %xmm0,(%rax)
<+192>:   add	 $0x8,%rax
<+196>:   cmp	 %rcx,%rax
<+199>:   jne	 0x210 <test3+176>
<+201>:   pop	 %rbx
<+202>:   pop	 %rbp
<+203>:   retq   
<+204>:   mov	 $0x10000,%ebx
<+209>:   xor	 %edx,%edx
<+211>:   jmpq   0x194 <test3+52>

Which is unfortunately identical to the non-aligned version. It looks like gcc didn't understand what we were telling it. Lets be more explicit:


void test4(double * restrict a, double * restrict b)
{
	int i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] += y[i];
	}
}

This compiles to produce:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rdi,%rax,1),%xmm0
<+13>:    addpd  (%rsi,%rax,1),%xmm0
<+18>:    movapd %xmm0,(%rdi,%rax,1)
<+23>:    add	 $0x10,%rax
<+27>:    cmp	 $0x80000,%rax
<+33>:    jne	 0x248 <test4+8>
<+35>:    repz retq 

Now finally, we get the nice tight vectorized code we were looking for. gcc has used packed SSE instructions to add two doubles at a time. It also manages to load and store two at a time, which it did do last time. The question is now that we understand what we need to tell the compiler, how much more complex can the loop be before auto-vectorization fails.

Well, something that's more complex than a single loop is a double nested loop:


void test5(double * restrict a, double * restrict b)
{
	int i, j;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (j = 0; j < SIZE; j++)
	{
		for (i = 0; i < SIZE; i++)
		{
			x[i + j * SIZE] += y[i + j * SIZE];
		}
	}
}

The above uses a common code idiom where a matrix is represented by a block of contiguous memory. The explicit indexing is usually hidden within a macro in normal usage, but the above is what it would expand out into. Lets see how well gcc does:


<+0>:	  push   %r15
<+2>:	  movabs $0x800000000,%rax
<+12>:    lea	 0x8(%rdi),%r15
<+16>:    add	 %rdi,%rax
<+19>:    mov	 %rdi,%r10
<+22>:    push   %r14
<+24>:    mov	 %rsi,%r14
<+27>:    push   %r13
<+29>:    mov	 %rdi,%r13
<+32>:    push   %r12
<+34>:    xor	 %r12d,%r12d
<+37>:    push   %rbp
<+38>:    push   %rbx
<+39>:    mov	 %rax,-0x10(%rsp)
<+44>:    nopl   0x0(%rax)
<+48>:    mov	 %r10,%rax
<+51>:    shl	 $0x3c,%rax
<+55>:    sar	 $0x3f,%rax
<+59>:    mov	 %rax,%rdx
<+62>:    and	 $0x1,%edx
<+65>:    test   %rdx,%rdx
<+68>:    mov	 %edx,%ecx
<+70>:    je	 0x3a4 <test5+308>
<+76>:    movsd  (%r10),%xmm0
<+81>:    mov	 $0xffff,%ebx
<+86>:    mov	 $0x1,%r11d
<+92>:    addsd  (%rsi),%xmm0
<+96>:    movsd  %xmm0,(%r10)
<+101>:   mov	 $0x10000,%r9d
<+107>:   and	 $0x1,%eax
<+110>:   sub	 %ecx,%r9d
<+113>:   mov	 %r9d,%r8d
<+116>:   shr	 %r8d
<+119>:   mov	 %r8d,%ebp
<+122>:   add	 %ebp,%ebp
<+124>:   je	 0x337 <test5+199>
<+126>:   lea	 0x0(,%rax,8),%rcx
<+134>:   xor	 %edx,%edx
<+136>:   xor	 %eax,%eax
<+138>:   lea	 (%r10,%rcx,1),%rdi
<+142>:   add	 %rsi,%rcx
<+145>:   nopl   0x0(%rax)
<+152>:   movsd  (%rcx,%rax,1),%xmm1
<+157>:   add	 $0x1,%edx
<+160>:   movapd (%rdi,%rax,1),%xmm0
<+165>:   movhpd 0x8(%rcx,%rax,1),%xmm1
<+171>:   addpd  %xmm1,%xmm0
<+175>:   movapd %xmm0,(%rdi,%rax,1)
<+180>:   add	 $0x10,%rax
<+184>:   cmp	 %r8d,%edx
<+187>:   jb	 0x308 <test5+152>
<+189>:   add	 %ebp,%r11d
<+192>:   sub	 %ebp,%ebx
<+194>:   cmp	 %ebp,%r9d
<+197>:   je	 0x379 <test5+265>
<+199>:   movslq %r11d,%rcx
<+202>:   lea	 -0x1(%rbx),%edi
<+205>:   add	 %r12,%rcx
<+208>:   lea	 0x0(,%rcx,8),%rdx
<+216>:   add	 %rdi,%rcx
<+219>:   lea	 (%r15,%rcx,8),%rcx
<+223>:   lea	 0x0(%r13,%rdx,1),%rax
<+228>:   add	 %r14,%rdx
<+231>:   nopw   0x0(%rax,%rax,1)
<+240>:   movsd  (%rax),%xmm0
<+244>:   addsd  (%rdx),%xmm0
<+248>:   add	 $0x8,%rdx
<+252>:   movsd  %xmm0,(%rax)
<+256>:   add	 $0x8,%rax
<+260>:   cmp	 %rcx,%rax
<+263>:   jne	 0x360 <test5+240>
<+265>:   add	 $0x80000,%r10
<+272>:   add	 $0x80000,%rsi
<+279>:   add	 $0x10000,%r12
<+286>:   cmp	 -0x10(%rsp),%r10
<+291>:   jne	 0x2a0 <test5+48>
<+297>:   pop	 %rbx
<+298>:   pop	 %rbp
<+299>:   pop	 %r12
<+301>:   pop	 %r13
<+303>:   pop	 %r14
<+305>:   pop	 %r15
<+307>:   retq   
<+308>:   mov	 $0x10000,%ebx
<+313>:   xor	 %r11d,%r11d
<+316>:   jmpq   0x2d5 <test5+101>

Ouch! What a difference a single extra loop makes. gcc hasn't understood what we are trying to say. Perhaps the problem here is that the amount of memory accessed is larger than the size an int can touch. Lets use size_t as a hint that the two loop indexes can be collapsed into one.


void test6(double * restrict a, double * restrict b)
{
	size_t i, j;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (j = 0; j < SIZE; j++)
	{
		for (i = 0; i < SIZE; i++)
		{
			x[i + j * SIZE] += y[i + j * SIZE];
		}
	}
}

<+0>:	  push   %r12
<+2>:	  mov	 %rsi,%r9
<+5>:	  mov	 $0x10000,%r12d
<+11>:    push   %rbp
<+12>:    push   %rbx
<+13>:    nopl   (%rax)
<+16>:    mov	 %rdi,%rcx
<+19>:    shl	 $0x3c,%rcx
<+23>:    shr	 $0x3f,%rcx
<+27>:    test   %rcx,%rcx
<+30>:    je	 0x496 <test6+214>
<+36>:    movsd  (%rdi),%xmm0
<+40>:    mov	 $0xffff,%ebx
<+45>:    mov	 $0x1,%eax
<+50>:    addsd  (%r9),%xmm0
<+55>:    movsd  %xmm0,(%rdi)
<+59>:    mov	 $0x10000,%r11d
<+65>:    sub	 %rcx,%r11
<+68>:    mov	 %r11,%r10
<+71>:    shr	 %r10
<+74>:    mov	 %r10,%rbp
<+77>:    add	 %rbp,%rbp
<+80>:    je	 0x45b <test6+155>
<+82>:    shl	 $0x3,%rcx
<+86>:    xor	 %edx,%edx
<+88>:    lea	 (%rdi,%rcx,1),%r8
<+92>:    lea	 (%r9,%rcx,1),%rsi
<+96>:    xor	 %ecx,%ecx
<+98>:    nopw   0x0(%rax,%rax,1)
<+104>:   movsd  (%rsi,%rdx,1),%xmm1
<+109>:   add	 $0x1,%rcx
<+113>:   movapd (%r8,%rdx,1),%xmm0
<+119>:   movhpd 0x8(%rsi,%rdx,1),%xmm1
<+125>:   addpd  %xmm1,%xmm0
<+129>:   movapd %xmm0,(%r8,%rdx,1)
<+135>:   add	 $0x10,%rdx
<+139>:   cmp	 %r10,%rcx
<+142>:   jb	 0x428 <test6+104>
<+144>:   add	 %rbp,%rax
<+147>:   sub	 %rbp,%rbx
<+150>:   cmp	 %rbp,%r11
<+153>:   je	 0x479 <test6+185>
<+155>:   lea	 (%rbx,%rax,1),%rdx
<+159>:   nop
<+160>:   movsd  (%rdi,%rax,8),%xmm0
<+165>:   addsd  (%r9,%rax,8),%xmm0
<+171>:   movsd  %xmm0,(%rdi,%rax,8)
<+176>:   add	 $0x1,%rax
<+180>:   cmp	 %rdx,%rax
<+183>:   jne	 0x460 <test6+160>
<+185>:   add	 $0x80000,%rdi
<+192>:   add	 $0x80000,%r9
<+199>:   sub	 $0x1,%r12
<+203>:   jne	 0x3d0 <test6+16>
<+209>:   pop	 %rbx
<+210>:   pop	 %rbp
<+211>:   pop	 %r12
<+213>:   retq   
<+214>:   mov	 $0x10000,%ebx
<+219>:   xor	 %eax,%eax
<+221>:   jmpq   0x3fb <test6+59>

The above is a bit better (200 instead of 300 bytes of instructions), but still is nowhere near where we want it to be. It looks like gcc doesn't understand that we are just iterating over the array in segments. Lets be more explicit that we touch the whole array:


void test7(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE * SIZE; i++)
	{
		x[i] += y[i];
	}
}

This accomplishes the exact same thing as the previous two functions, but uses one loop. This routine is very similar to test4(), and we would expect gcc to do a fairly good job:


<+0>:	  xor	 %eax,%eax
<+2>:	  movabs $0x800000000,%rdx
<+12>:    nopl   0x0(%rax)
<+16>:    movapd (%rdi,%rax,1),%xmm0
<+21>:    addpd  (%rsi,%rax,1),%xmm0
<+26>:    movapd %xmm0,(%rdi,%rax,1)
<+31>:    add	 $0x10,%rax
<+35>:    cmp	 %rdx,%rax
<+38>:    jne	 0x4c0 <test7+16>
<+40>:    repz retq 

Indeed, it does produce what we are looking for. So, a rule of thumb might be; two loops bad, one loop good.

We have used addition in the above because it is both a C operator, and a vectorizable instruction. Lets see what happens when we try to use something that should vectorize, but doesn't directly correspond to a C operator. A simple example is the max() operation:


void test8(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		/* max() */
		if (y[i] > x[i]) x[i] = y[i];
	}
}

This produces:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  (%rsi,%rax,8),%xmm0
<+13>:    ucomisd (%rdi,%rax,8),%xmm0
<+18>:    jbe	 0x4f9 <test8+25>
<+20>:    movsd  %xmm0,(%rdi,%rax,8)
<+25>:    add	 $0x1,%rax
<+29>:    cmp	 $0x10000,%rax
<+35>:    jne	 0x4e8 <test8+8>
<+37>:    repz retq 

And there is no vectorization at all. What has gone wrong? The problem here is the C memory model. gcc isn't allowed to introduce extra writes to a memory location. If some other thread modifies x[] while the above routine is called, then those extra writes could stomp on modifications. We need a way to tell gcc that using the vectorized max() instruction is okay. Fortunately, this isn't too hard, we just use a non-conditional write:


void test9(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		/* max() */
		x[i] = ((y[i] > x[i]) ? y[i] : x[i]);
	}
}

Which produces:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rsi,%rax,1),%xmm0
<+13>:    maxpd  (%rdi,%rax,1),%xmm0
<+18>:    movapd %xmm0,(%rdi,%rax,1)
<+23>:    add	 $0x10,%rax
<+27>:    cmp	 $0x80000,%rax
<+33>:    jne	 0x518 <test9+8>
<+35>:    repz retq 

This code looks like the vectorized code for addition, and we didn't have to use any SSE-explicit intrinsic to do so. gcc has pattern-matched the operation and found the instruction we are looking for, which is very encouraging.

So what happens when we try something a little more complex, like a conditional add?


void test10(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		/* conditional add */
		x[i] = ((y[i] > x[i]) ? x[i] + y[i] : x[i]);
	}
}

This produces:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  (%rsi,%rax,8),%xmm1
<+13>:    movsd  (%rdi,%rax,8),%xmm0
<+18>:    ucomisd %xmm0,%xmm1
<+22>:    jbe	 0x55c <test10+28>
<+24>:    addsd  %xmm1,%xmm0
<+28>:    movsd  %xmm0,(%rdi,%rax,8)
<+33>:    add	 $0x1,%rax
<+37>:    cmp	 $0x10000,%rax
<+43>:    jne	 0x548 <test10+8>
<+45>:    repz retq 

Which is unfortunately not vectorized. Perhaps our expression is too complex, so lets try the simpler version:


void test11(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		/* conditional add */
		x[i] += ((y[i] > x[i]) ? y[i] : 0);
	}
}

This produces:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rdi,%rax,1),%xmm1
<+13>:    movapd (%rsi,%rax,1),%xmm2
<+18>:    movapd %xmm1,%xmm0
<+22>:    cmpltpd %xmm2,%xmm0
<+27>:    andpd  %xmm2,%xmm0
<+31>:    addpd  %xmm1,%xmm0
<+35>:    movapd %xmm0,(%rdi,%rax,1)
<+40>:    add	 $0x10,%rax
<+44>:    cmp	 $0x80000,%rax
<+50>:    jne	 0x578 <test11+8>
<+52>:    repz retq

So the form of the expression matters. In this case, a different rearrangement of an expression was vectorizable. In general, the simpler the expression, the greater the chance of auto-vectorization.

All of the above have had fixed-length loops. What happens if the loop exit condition is more complex:


void test12(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		/* early stop */
		if (x[i] < y[i]) break;

		x[i] += y[i];
	}
}

Where here we can exit early if the (hopefully vectorizable) condition occurs. Unfortunately, gcc produces:


<+0>:	  xor	 %eax,%eax
<+2>:	  jmp	 0x5cd <test12+29>
<+4>:	  nopl   0x0(%rax)
<+8>:	  addsd  %xmm1,%xmm0
<+12>:    movsd  %xmm0,(%rdi,%rax,8)
<+17>:    add	 $0x1,%rax
<+21>:    cmp	 $0x10000,%rax
<+27>:    je	 0x5dd <test12+45>
<+29>:    movsd  (%rdi,%rax,8),%xmm1
<+34>:    movsd  (%rsi,%rax,8),%xmm0
<+39>:    ucomisd %xmm1,%xmm0
<+43>:    jbe	 0x5b8 <test12+8>
<+45>:    repz retq 

Where the above obviously isn't vectorized. For some reason gcc isn't using a compare followed by a move-mask to determine if the loop can exit early, whilst maintaining the ability to do two adds at a time.

In all the above, we have used simple indexing, lets see what happens if we make it more complex:


void test13(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = y[i] + y[i + 1];
	}
}

This compiles into:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  0x8(%rsi,%rax,1),%xmm1
<+14>:    movapd (%rsi,%rax,1),%xmm0
<+19>:    movhpd 0x10(%rsi,%rax,1),%xmm1
<+25>:    addpd  %xmm1,%xmm0
<+29>:    movapd %xmm0,(%rdi,%rax,1)
<+34>:    add	 $0x10,%rax
<+38>:    cmp	 $0x80000,%rax
<+44>:    jne	 0x5e8 <test13+8>
<+46>:    repz retq

This is half-vectorized. Instead of using an unaligned load instruction, gcc has used two half-load instructions for the upper and lower halves of %xmm1. The unaligned-load is probably better. Another possibility is to use pipelined aligned loads followed by shuffles.

If we switch to an even offset, gcc no longer needs to do unaligned accesses, and switches back to full vectorization:


void test14(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = y[i] + y[i + 2];
	}
}

giving


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rsi,%rax,1),%xmm0
<+13>:    addpd  0x10(%rsi,%rax,1),%xmm0
<+19>:    movapd %xmm0,(%rdi,%rax,1)
<+24>:    add	 $0x10,%rax
<+28>:    cmp	 $0x80000,%rax
<+34>:    jne	 0x618 <test14+8>
<+36>:    repz retq 

So the quality of the code gcc produces can subtly depend on which index offsets are used. Confirming this is the simpler routine:


void test15(double * restrict a, double * restrict b)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = y[i + 1];
	}
}

which gives:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  0x8(%rsi,%rax,1),%xmm1
<+14>:    movhpd 0x10(%rsi,%rax,1),%xmm1
<+20>:    movapd %xmm1,(%rdi,%rax,1)
<+25>:    add	 $0x10,%rax
<+29>:    cmp	 $0x80000,%rax
<+35>:    jne	 0x648 <test15+8>
<+37>:    repz retq 

Where again the half-loads appear.

Everything so far has used two input arrays. What does gcc do with more?


void test16(double * restrict a, double * restrict b, double * restrict c)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);
	double *z = __builtin_assume_aligned(c, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = y[i] + z[i];
	}
}

Which when compiled, gives:


<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rsi,%rax,1),%xmm0
<+13>:    addpd  (%rdx,%rax,1),%xmm0
<+18>:    movapd %xmm0,(%rdi,%rax,1)
<+23>:    add	 $0x10,%rax
<+27>:    cmp	 $0x80000,%rax
<+33>:    jne	 0x678 <test16+8>
<+35>:    repz retq 

This is good, fully vectorized code.

How about something more complex, like our max function implementation?


void test17(double * restrict a, double * restrict b, double * restrict c)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);
	double *z = __builtin_assume_aligned(c, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = ((y[i] > z[i]) ? y[i] : z[i]);
	}
}

<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movapd (%rsi,%rax,1),%xmm0
<+13>:    maxpd  (%rdx,%rax,1),%xmm0
<+18>:    movapd %xmm0,(%rdi,%rax,1)
<+23>:    add	 $0x10,%rax
<+27>:    cmp	 $0x80000,%rax
<+33>:    jne	 0x6a8 <test17+8>
<+35>:    repz retq 

Which again works nicely.

So, now that we have a third array, lets use it within the expression on the right:


void test18(double * restrict a, double * restrict b, double * restrict c)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);
	double *z = __builtin_assume_aligned(c, 16);

	for (i = 0; i < SIZE; i++)
	{
		x[i] = ((y[i] > z[i]) ? x[i] : z[i]);
	}
}

<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  (%rdx,%rax,8),%xmm0
<+13>:    movsd  (%rsi,%rax,8),%xmm1
<+18>:    ucomisd %xmm0,%xmm1
<+22>:    jbe	 0x6ed <test18+29>
<+24>:    movsd  (%rdi,%rax,8),%xmm0
<+29>:    movsd  %xmm0,(%rdi,%rax,8)
<+34>:    add	 $0x1,%rax
<+38>:    cmp	 $0x10000,%rax
<+44>:    jne	 0x6d8 <test18+8>
<+46>:    repz retq 

Ooops, no vectorization here. It looks like a conditional store is a bit too complex. How about if we make it a bit simpler, using an if statement instead:


void test19(double * restrict a, double * restrict b, double * restrict c)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double *y = __builtin_assume_aligned(b, 16);
	double *z = __builtin_assume_aligned(c, 16);

	for (i = 0; i < SIZE; i++)
	{
		if (y[i] <= z[i]) x[i] = z[i];
	}
}

<+0>:	  xor	 %eax,%eax
<+2>:	  nopw   0x0(%rax,%rax,1)
<+8>:	  movsd  (%rdx,%rax,8),%xmm0
<+13>:    ucomisd (%rsi,%rax,8),%xmm0
<+18>:    jb	 0x719 <test19+25>
<+20>:    movsd  %xmm0,(%rdi,%rax,8)
<+25>:    add	 $0x1,%rax
<+29>:    cmp	 $0x10000,%rax
<+35>:    jne	 0x708 <test19+8>
<+37>:    repz retq 

This is a bit tighter code, but it still isn't vectorized. We may need to go to intrinsics to get something like the output from test11().

Finally, lets look at operations that scan over an array, and collate a single output. gcc should be able to vectorize these as well.


double test20(double * restrict a)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double y = -INFINITY;

	for (i = 0; i < SIZE; i++)
	{
		if (x[i] > y) y = x[i];
	}

	return y;
}

<+0>:	  movsd  0x0(%rip),%xmm0		# y
<+8>:	  xor	 %eax,%eax
<+10>:    nopw   0x0(%rax,%rax,1)
<+16>:    movsd  (%rdi,%rax,8),%xmm1
<+21>:    add	 $0x1,%rax
<+25>:    cmp	 $0x10000,%rax
<+31>:    maxsd  %xmm0,%xmm1
<+35>:    movapd %xmm1,%xmm0
<+39>:    jne	 0x740 <test20+16>
<+41>:    repz retq

Unfortunately, this global maximum operation isn't vectorized. Lets see a simpler function that calculates a global sum instead:


double test21(double * restrict a)
{
	size_t i;

	double *x = __builtin_assume_aligned(a, 16);
	double y = 0;

	for (i = 0; i < SIZE; i++)
	{
		y += x[i];
	}

	return y;
}

Which produces:


<+0>:	  xorpd  %xmm0,%xmm0
<+4>:	  xor	 %eax,%eax
<+6>:	  nopw   %cs:0x0(%rax,%rax,1)
<+16>:    addsd  (%rdi,%rax,8),%xmm0
<+21>:    add	 $0x1,%rax
<+25>:    cmp	 $0x10000,%rax
<+31>:    jne	 0x770 <test21+16>
<+33>:    repz retq 

Which also isn't vectorized. What is going on? The problem here is that gcc isn't allowed to re-order the operations we give it. Even though the maximum operation, and the addition operation are associative with real numbers, they aren't with floating point numbers. (Consider what happens with signed zeros, for example.)

We need to tell gcc that re-ordering operations is okay with us. To do this, we need to add another compile-time flag, "--fast-math". If we do this, we find that the output from functions 20 and 21 are improved:


<+0>:	  lea	 0x80000(%rdi),%rax
<+7>:	  movapd 0x0(%rip),%xmm0		# 0x75f <test20+15>
<+15>:    nop
<+16>:    maxpd  (%rdi),%xmm0
<+20>:    add	 $0x10,%rdi
<+24>:    cmp	 %rax,%rdi
<+27>:    jne	 0x760 <test20+16>
<+29>:    movapd %xmm0,%xmm1
<+33>:    unpckhpd %xmm0,%xmm0
<+37>:    maxsd  %xmm1,%xmm0
<+41>:    retq 

gcc is very intelligent, and does a vectorized maximum calculation, followed by a horizontal maximum at the end.


<+0>:	  xorpd  %xmm1,%xmm1
<+4>:	  lea	 0x80000(%rdi),%rax
<+11>:    nopl   0x0(%rax,%rax,1)
<+16>:    addpd  (%rdi),%xmm1
<+20>:    add	 $0x10,%rdi
<+24>:    cmp	 %rax,%rdi
<+27>:    jne	 0x790 <test21+16>
<+29>:    movapd %xmm1,%xmm2
<+33>:    unpckhpd %xmm2,%xmm2
<+37>:    movapd %xmm2,%xmm0
<+41>:    addsd  %xmm1,%xmm0
<+45>:    retq 

And again gcc manages to vectorize the summation, even with having to do a horizontal add at the end.

Summary

gcc is very good, and can auto-vectorize many inner loops. However, if the expressions get too complex, vectorization will fail. gcc also may not be able to get the most optimal form of the loop kernel. In general, the simpler the code, the more likely gcc is to give good results.

However, you cannot expect gcc to give what you expect without a few tweaks. You may need to add the "--fast-math" to turn on associativity. You will definitely need to tell the compiler about alignment and array-overlap considerations to get good code.

On the other hand, gcc will still attempt to vectorize code which hasn't had changes done to it at all. It just won't be able to get nearly as much of a performance improvement as you might hope.

However, as time passes, more inner loop patterns will be added to the vectorizable list. Thus if you are using later versions of gcc, don't take the above results for granted. Check the output of the compiler yourself to see if it is behaving as you might expect. You might be pleasantly surprised by what it can do.

Comments

bobbyprani said...
I wonder how good ICC will be able to vectorize these codes. Last heard, ICC was much better at auto-vectorization than GCC. I think I will do a comparison with ICC for the above cases.
pikachu said...
Just read software.intel.com/file/31848
Bdog said...
Rather than use -ffast-math, which sets a number of potentially dangerous options, it would be safer to use -fassociative-math to enable the specific optimization desired here.
ssam said...
Could it be considered a GCC bug that:
__builtin_assume_aligned(a, 16);
was not recognised, and you had to switch to:
double *x = __builtin_assume_aligned(a, 16);

Also what are the practical implications of making that assumption? do I have to do something special when I create the array?

(PS you captcha is hard)
sfuerst said...
You may consider it a bug. The gcc maintainers may disagree. I haven't tried submitting a PR request for it though.

For 16-byte alignment: Anything allocated by malloc() and friends will already be 16-byte aligned due to the ABI. Arrays allocated statically or on the stack probably will not be 16-byte aligned. You will need to add a non-standard gcc alignment attribute for them.

The Captcha is unusual... but we get hundreds of spam attempts per day, and it filters out all of them.
said...
Enter your comments here
Albert said...
Hi and thanks for this great article! What flags did you give gcc when compiling these examples? I couldn't get it to recognize the restrict keyword.

Then I tried your examples with icc and used:

#pragma vector always
#pragma vector aligned

before the loop and flags:

-restrict -O3 -march=corei7-avx -vec-report2

to icc. That worked and it vectorized nicely. But I had to manually edit the generated .s file a bit before I could feed it back to gcc. Basically what I did was removing some intel stuff and adding .cfi_startproc and .cfi_endproc. And changed to .p2align syntax, like the ones gcc -O3 usually generates.

It would be nice to see a vectorization comparison between gcc and icc, as the first comment suggested. Keep up the good work!
hsroh said...
This is what Sun's suncc generates for test2:

<pre>

test2:
.CG4:
.CG5:
        movl %edi,%edx
        andl $15,%edx
        movl %esi,%eax
        andl $15,%eax
        movl %edi,%ecx
        andl $7,%ecx
        xorl %edx,%eax
        xorl %r8d,%r8d
        orl %ecx,%eax
        je .L77000045.44
.L77000069.43:
        xorq %rax,%rax
        .align 16
.L77000057.52:
        prefetcht0 256(%rdi,%rax)
        movupd (%rdi,%rax),%xmm0
        movupd (%rsi,%rax),%xmm1
        addpd %xmm1,%xmm0
        movupd %xmm0,(%rdi,%rax)
        prefetcht0 272(%rsi,%rax)
        movupd 16(%rdi,%rax),%xmm0
        movupd 16(%rsi,%rax),%xmm1
        addpd %xmm1,%xmm0
        movupd %xmm0,16(%rdi,%rax)
        movupd 32(%rdi,%rax),%xmm0
        movupd 32(%rsi,%rax),%xmm1
        addpd %xmm1,%xmm0
        movupd %xmm0,32(%rdi,%rax)
        movupd 48(%rdi,%rax),%xmm0
        movupd 48(%rsi,%rax),%xmm1
        addpd %xmm1,%xmm0
        movupd %xmm0,48(%rdi,%rax)
        addq $64,%rax
        addl $8,%r8d
        cmpl $65534,%r8d
        jle .L77000057.52
.LX2.123:
        jmp .LE5.143
.L77000045.44:
        movl $16,%ecx
        subl %edx,%ecx
        sarl $3,%ecx
        andl $1,%ecx
        jle .LE0.120
.L77000071.46:
        addl $-1,%ecx
        movq $65535,%rax
        movslq %ecx,%rcx
        cmpq %rcx,%rax
        movq $0,%rdx
        cmovlq %rax,%rcx
        .zalign 16,8
.L77000049.48:
        movsd (%rdi,%rdx),%xmm0
        addsd (%rsi,%rdx),%xmm0
        movsd %xmm0,(%rdi,%rdx)
        addq $8,%rdx
        addl $1,%r8d
        cmpl %ecx,%r8d
        jle .L77000049.48
.LX0.119:
.LE0.120:
.L77000044.45:
        movl $65536,%ecx
        subl %r8d,%ecx
        leal 1(%rcx),%edx
        testl %ecx,%ecx
        cmovnsl %ecx,%edx
        sarl $1,%edx
        testl %edx,%edx
        jle .LE5.143
.L77000072.50:
        movl $-2147483647,%eax
        andl %ecx,%eax
        jns .CG3.144
        andl $1,%eax
        negl %eax
.CG3.144:
        subl %eax,%ecx
        addl %r8d,%ecx
        movslq %r8d,%rax
        movl %ecx,%edx
        subl %r8d,%edx
        addl $-2,%edx
        leal -2(%rcx),%r10d
        leal 1(%rdx),%r9d
        shlq $3,%rax
        testl %edx,%edx
        cmovnsl %edx,%r9d
        sarl $1,%r9d
        addl $1,%r9d
        cmpl %r10d,%r8d
        jle .CG2.134
        xorl %r9d,%r9d
.CG2.134:
        cmpl $4,%r9d
        jl .LU0.135
.LP0.139:
        addl $-8,%ecx
        .zalign 16,8
.L77000052.51:
        prefetcht0 256(%rdi,%rax)
        movapd (%rdi,%rax),%xmm0
        addpd (%rsi,%rax),%xmm0
        movapd %xmm0,(%rdi,%rax)
        prefetcht0 272(%rsi,%rax)
        movapd 16(%rdi,%rax),%xmm0
        addpd 16(%rsi,%rax),%xmm0
        movapd %xmm0,16(%rdi,%rax)
        movapd 32(%rdi,%rax),%xmm0
        addpd 32(%rsi,%rax),%xmm0
        movapd %xmm0,32(%rdi,%rax)
        movapd 48(%rdi,%rax),%xmm0
        addpd 48(%rsi,%rax),%xmm0
        movapd %xmm0,48(%rdi,%rax)
        addq $64,%rax
        addl $8,%r8d
        cmpl %ecx,%r8d
        jle .L77000052.51
.LX4.140:
.LE4.141:
        cmpl %r10d,%r8d
        jg .LE5.143
.LU0.135:
        .zalign 16,8
.LU1.136:
        movapd (%rdi,%rax),%xmm0
        addpd (%rsi,%rax),%xmm0
        movapd %xmm0,(%rdi,%rax)
        addq $16,%rax
        addl $2,%r8d
        cmpl %r10d,%r8d
        jle .LU1.136
.LX5.142:
.LE5.143:
.LX1.121:
.LE1.122:
.LE2.124:
.L77000046.49:
        cmpl $65535,%r8d
        jg .LE3.126
.L77000070.54:
        movslq %r8d,%rax
        leaq (,%rax,8),%rcx
        .zalign 16,8
.L77000054.55:
        movsd (%rdi,%rcx),%xmm0
        addsd (%rsi,%rcx),%xmm0
        movsd %xmm0,(%rdi,%rcx)
        addq $8,%rcx
        addq $1,%rax
        cmpq $65535,%rax
        jle .L77000054.55
.LX3.125:
.LE3.126:
.L77000034.53:
        repz
        ret

</pre>
Yann said...
Thanks for your excellent article. It helps a lot when trying to build "standard" C code in a way which benefit auto-vectorize when it's present.


Quick question : I don't understand the requirement for "16-bytes aligned" property.

Both SSE and AVX offer instructions to load unaligned memory segments.
Look at movdqu (SSE) or vmovdqu (AVX).

They are very good. Not only do they work properly on unaligned memory, they also transparently benefit from loading aligned memory (i.e. they run faster in such case).

Considering the gigantic flexibility these instructions offer, I would use them always instead of their "aligned" counterpart (movdqa, vmovdqa).

Not sure, from your example, if GCC is able to use them automatically (without intrinsics).
Nat said...
The issues compiling are due to the restrict keyword being c99 specific. Add the flag -std=c99 and it should compile just fine with restrict.

Great article! Thanks from somebody just getting started with vectorization!
James said...
Thank you for the clear details on what Vectorization is. its so nice that people take the time to write articles like this up for the community!
said...
Enter your comments here

Enter the 10 characters above here


Name
About Us Returns Policy Privacy Policy Send us Feedback
Company Info | Product Index | Category Index | Help | Terms of Use
Copyright © Lockless Inc All Rights Reserved.