Auto-vectorization with gcc 4.7
One way of speeding up code which contains many loops is to use vectorization. On modern machines, this means the use of SSE or AVX instructions. However, if one had to manually re-write code into assembly language in order to do this, then very few programs would. The reason is obvious: assembly language is error-prone, and a difficult skill to master. It is also intrinsically nonportable. Fortunately, compiler writers have a few tricks up their sleeves to help us.
At the lowest level, gcc exposes compiler intrinsics that match one-to-one with the vector instructions on the the cpu. Therefore, it is possible to write code using them in a higher level language like C or C++, and still have nearly complete control over the output. Unfortunately, since these intrinsics are still very low-level, code written using them can be a little hard to read. The code is also not portable to machines that lack the instructions corresponding to the intrinsics used.
gcc has another extension that helps with vectorization, vector types. It is possible to construct types that represent arrays (i.e. vectors) of smaller more basic types. Then, code can use normal C (or C++) operations on those types. The result will be vectorized over the whole vector of basic types. In other words, you can add two vectors together to make a third, and gcc will understand exactly what you mean and use the instruction you want. More recent versions of gcc allow array accesses into the vector types, making them even more flexible and easy to use.
Vector types have a disadvantage though, code needs to be rewritten in order to use them. Instead of passing around standard types, the non-portable compiler-specific ones need to be used instead. Also, since there are many vector operations that don't correspond to C operators, you may need to fall back to intrinsics in order to get the instructions you want.
At an even higher level is auto-vectorization. There, code isn't re-written at all. It remains portable C, and the compiler automagically determines how to vectorize it. Since ideally, no work needs to be done, one can simply recompile with a new compiler and get all the speed advantages of vectorization with very little effort required. The big question though, is how much code can gcc vectorize? If very few loops can be, then this feature isn't too useful. Conversely, if gcc is very smart, then the lower level techniques aren't necessary any more.
Auto-vectorization
We use the most recent version of gcc at the time of writing (4.7) to test to see how well auto-vectorization works in practice. We will try a series of test routines with simple loops to see what gcc does with them. To start with, we will use a constant loop count. (It is possible to have a variable loop count, but in most code that doesn't change too much in the important inner loop.)
#include <stdlib.h>
#include <math.h>
#define SIZE (1L << 16)
void test1(double *a, double *b)
{
int i;
for (i = 0; i < SIZE; i++)
{
a[i] += b[i];
}
}
The first test simply adds an array of doubles onto a second array. We compile with -O3 which in modern gcc turns on auto-vectorization for us. We use no other optimization compile-time flags to get the default from the compiler. The resulting routine in asm is:
<+0>: lea 0x10(%rsi),%rax
<+4>: cmp %rax,%rdi
<+7>: jb 0x47 <test1+71>
<+9>: xor %eax,%eax
<+11>: nopl 0x0(%rax,%rax,1)
<+16>: movsd (%rdi,%rax,1),%xmm1
<+21>: movsd (%rsi,%rax,1),%xmm2
<+26>: movhpd 0x8(%rdi,%rax,1),%xmm1
<+32>: movhpd 0x8(%rsi,%rax,1),%xmm2
<+38>: movapd %xmm1,%xmm0
<+42>: addpd %xmm2,%xmm0
<+46>: movlpd %xmm0,(%rdi,%rax,1)
<+51>: movhpd %xmm0,0x8(%rdi,%rax,1)
<+57>: add $0x10,%rax
<+61>: cmp $0x80000,%rax
<+67>: jne 0x10 <test1+16>
<+69>: repz retq
<+71>: lea 0x10(%rdi),%rax
<+75>: cmp %rax,%rsi
<+78>: jae 0x9 <test1+9>
<+80>: xor %eax,%eax
<+82>: nopw 0x0(%rax,%rax,1)
<+88>: movsd (%rdi,%rax,1),%xmm0
<+93>: addsd (%rsi,%rax,1),%xmm0
<+98>: movsd %xmm0,(%rdi,%rax,1)
<+103>: add $0x8,%rax
<+107>: cmp $0x80000,%rax
<+113>: jne 0x58 <test1+88>
<+115>: repz retq
The above code is interesting... but not particularly optimal. gcc first tests to see if the arrays partially overlap, if so it jumps to a slow loop that does one addition at a time. If the arrays are not directly on top of each other, gcc then does a rather strange thing of loading each double manually, and stuffing them into %xmm1 and %xmm2 It then wastes time by copying into %xmm0 before finally doing the packed addition. Once the addition is done, it then moves the data out one double at a time.
The above can at best be called partially vectorized. The problem is that the compiler is constrained by what we tell it about the arrays. If we tell it more, then perhaps it can do more optimization. The most obvious thing is to inform the compiler that no overlap is possible. This is done in standard C by using the restrict qualifier for the pointers. The resulting code looks like:
void test2(double * restrict a, double * restrict b)
{
int i;
for (i = 0; i < SIZE; i++)
{
a[i] += b[i];
}
}
This compiles into:
<+0>: mov %rdi,%rax
<+3>: push %rbp
<+4>: shl $0x3c,%rax
<+8>: sar $0x3f,%rax
<+12>: push %rbx
<+13>: mov %rax,%rdx
<+16>: and $0x1,%edx
<+19>: test %rdx,%rdx
<+22>: mov %edx,%ecx
<+24>: je 0x14c <test2+204>
<+30>: movsd (%rdi),%xmm0
<+34>: mov $0xffff,%ebx
<+39>: mov $0x1,%edx
<+44>: addsd (%rsi),%xmm0
<+48>: movsd %xmm0,(%rdi)
<+52>: mov $0x10000,%r11d
<+58>: and $0x1,%eax
<+61>: sub %ecx,%r11d
<+64>: mov %r11d,%r10d
<+67>: shr %r10d
<+70>: mov %r10d,%ebp
<+73>: add %ebp,%ebp
<+75>: je 0x112 <test2+146>
<+77>: lea 0x0(,%rax,8),%r8
<+85>: xor %ecx,%ecx
<+87>: xor %eax,%eax
<+89>: lea (%rdi,%r8,1),%r9
<+93>: add %rsi,%r8
<+96>: movsd (%r8,%rax,1),%xmm1
<+102>: add $0x1,%ecx
<+105>: movapd (%r9,%rax,1),%xmm0
<+111>: movhpd 0x8(%r8,%rax,1),%xmm1
<+118>: addpd %xmm1,%xmm0
<+122>: movapd %xmm0,(%r9,%rax,1)
<+128>: add $0x10,%rax
<+132>: cmp %r10d,%ecx
<+135>: jb 0xe0 <test2+96>
<+137>: add %ebp,%edx
<+139>: sub %ebp,%ebx
<+141>: cmp %ebp,%r11d
<+144>: je 0x149 <test2+201>
<+146>: movslq %edx,%rcx
<+149>: lea 0x0(,%rcx,8),%rdx
<+157>: lea (%rdi,%rdx,1),%rax
<+161>: add %rsi,%rdx
<+164>: lea -0x1(%rbx),%esi
<+167>: add %rsi,%rcx
<+170>: lea 0x8(%rdi,%rcx,8),%rcx
<+175>: nop
<+176>: movsd (%rax),%xmm0
<+180>: addsd (%rdx),%xmm0
<+184>: add $0x8,%rdx
<+188>: movsd %xmm0,(%rax)
<+192>: add $0x8,%rax
<+196>: cmp %rcx,%rax
<+199>: jne 0x130 <test2+176>
<+201>: pop %rbx
<+202>: pop %rbp
<+203>: retq
<+204>: mov $0x10000,%ebx
<+209>: xor %edx,%edx
<+211>: jmpq 0xb4 <test2+52>
The above is an enormous monstrosity. The problem here is alignment. gcc is not yet allowed to assume that the arrays are 16-byte aligned. This means that it has to go through all sorts of gymnastics to handle the cases which are not. (SSE instructions in general do not work with unaligned accesses.) It also means that the inner loop above can't assume that both arrays are aligned. If gcc were smart, it could test for the cases where the arrays are either both aligned, or both unaligned, and have a fast inner loop. However, it doesn't do that currently.
So in order to get the performance we are looking for, we need to tell gcc that the arrays are aligned. There are a couple of ways to do that. The first is to construct a (non-portable) aligned type, and use that in the function interface. The second is to add an intrinsic or two within the function itself. The second option is easier to implement on older code bases, as other functions calling the one to be vectorized do not have to be modified. The intrinsic has for this is called __builtin_assume_aligned :
void test3(double * restrict a, double * restrict b)
{
int i;
__builtin_assume_aligned(a, 16);
__builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
a[i] += b[i];
}
}
Which produces:
<+0>: mov %rdi,%rax
<+3>: push %rbp
<+4>: shl $0x3c,%rax
<+8>: sar $0x3f,%rax
<+12>: push %rbx
<+13>: mov %rax,%rdx
<+16>: and $0x1,%edx
<+19>: test %rdx,%rdx
<+22>: mov %edx,%ecx
<+24>: je 0x22c <test3+204>
<+30>: movsd (%rdi),%xmm0
<+34>: mov $0xffff,%ebx
<+39>: mov $0x1,%edx
<+44>: addsd (%rsi),%xmm0
<+48>: movsd %xmm0,(%rdi)
<+52>: mov $0x10000,%r11d
<+58>: and $0x1,%eax
<+61>: sub %ecx,%r11d
<+64>: mov %r11d,%r10d
<+67>: shr %r10d
<+70>: mov %r10d,%ebp
<+73>: add %ebp,%ebp
<+75>: je 0x1f2 <test3+146>
<+77>: lea 0x0(,%rax,8),%r8
<+85>: xor %ecx,%ecx
<+87>: xor %eax,%eax
<+89>: lea (%rdi,%r8,1),%r9
<+93>: add %rsi,%r8
<+96>: movsd (%r8,%rax,1),%xmm1
<+102>: add $0x1,%ecx
<+105>: movapd (%r9,%rax,1),%xmm0
<+111>: movhpd 0x8(%r8,%rax,1),%xmm1
<+118>: addpd %xmm1,%xmm0
<+122>: movapd %xmm0,(%r9,%rax,1)
<+128>: add $0x10,%rax
<+132>: cmp %r10d,%ecx
<+135>: jb 0x1c0 <test3+96>
<+137>: add %ebp,%edx
<+139>: sub %ebp,%ebx
<+141>: cmp %ebp,%r11d
<+144>: je 0x229 <test3+201>
<+146>: movslq %edx,%rcx
<+149>: lea 0x0(,%rcx,8),%rdx
<+157>: lea (%rdi,%rdx,1),%rax
<+161>: add %rsi,%rdx
<+164>: lea -0x1(%rbx),%esi
<+167>: add %rsi,%rcx
<+170>: lea 0x8(%rdi,%rcx,8),%rcx
<+175>: nop
<+176>: movsd (%rax),%xmm0
<+180>: addsd (%rdx),%xmm0
<+184>: add $0x8,%rdx
<+188>: movsd %xmm0,(%rax)
<+192>: add $0x8,%rax
<+196>: cmp %rcx,%rax
<+199>: jne 0x210 <test3+176>
<+201>: pop %rbx
<+202>: pop %rbp
<+203>: retq
<+204>: mov $0x10000,%ebx
<+209>: xor %edx,%edx
<+211>: jmpq 0x194 <test3+52>
Which is unfortunately identical to the non-aligned version. It looks like gcc didn't understand what we were telling it. Lets be more explicit:
void test4(double * restrict a, double * restrict b)
{
int i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] += y[i];
}
}
This compiles to produce:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rdi,%rax,1),%xmm0
<+13>: addpd (%rsi,%rax,1),%xmm0
<+18>: movapd %xmm0,(%rdi,%rax,1)
<+23>: add $0x10,%rax
<+27>: cmp $0x80000,%rax
<+33>: jne 0x248 <test4+8>
<+35>: repz retq
Now finally, we get the nice tight vectorized code we were looking for. gcc has used packed SSE instructions to add two doubles at a time. It also manages to load and store two at a time, which it did do last time. The question is now that we understand what we need to tell the compiler, how much more complex can the loop be before auto-vectorization fails.
Well, something that's more complex than a single loop is a double nested loop:
void test5(double * restrict a, double * restrict b)
{
int i, j;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (j = 0; j < SIZE; j++)
{
for (i = 0; i < SIZE; i++)
{
x[i + j * SIZE] += y[i + j * SIZE];
}
}
}
The above uses a common code idiom where a matrix is represented by a block of contiguous memory. The explicit indexing is usually hidden within a macro in normal usage, but the above is what it would expand out into. Lets see how well gcc does:
<+0>: push %r15
<+2>: movabs $0x800000000,%rax
<+12>: lea 0x8(%rdi),%r15
<+16>: add %rdi,%rax
<+19>: mov %rdi,%r10
<+22>: push %r14
<+24>: mov %rsi,%r14
<+27>: push %r13
<+29>: mov %rdi,%r13
<+32>: push %r12
<+34>: xor %r12d,%r12d
<+37>: push %rbp
<+38>: push %rbx
<+39>: mov %rax,-0x10(%rsp)
<+44>: nopl 0x0(%rax)
<+48>: mov %r10,%rax
<+51>: shl $0x3c,%rax
<+55>: sar $0x3f,%rax
<+59>: mov %rax,%rdx
<+62>: and $0x1,%edx
<+65>: test %rdx,%rdx
<+68>: mov %edx,%ecx
<+70>: je 0x3a4 <test5+308>
<+76>: movsd (%r10),%xmm0
<+81>: mov $0xffff,%ebx
<+86>: mov $0x1,%r11d
<+92>: addsd (%rsi),%xmm0
<+96>: movsd %xmm0,(%r10)
<+101>: mov $0x10000,%r9d
<+107>: and $0x1,%eax
<+110>: sub %ecx,%r9d
<+113>: mov %r9d,%r8d
<+116>: shr %r8d
<+119>: mov %r8d,%ebp
<+122>: add %ebp,%ebp
<+124>: je 0x337 <test5+199>
<+126>: lea 0x0(,%rax,8),%rcx
<+134>: xor %edx,%edx
<+136>: xor %eax,%eax
<+138>: lea (%r10,%rcx,1),%rdi
<+142>: add %rsi,%rcx
<+145>: nopl 0x0(%rax)
<+152>: movsd (%rcx,%rax,1),%xmm1
<+157>: add $0x1,%edx
<+160>: movapd (%rdi,%rax,1),%xmm0
<+165>: movhpd 0x8(%rcx,%rax,1),%xmm1
<+171>: addpd %xmm1,%xmm0
<+175>: movapd %xmm0,(%rdi,%rax,1)
<+180>: add $0x10,%rax
<+184>: cmp %r8d,%edx
<+187>: jb 0x308 <test5+152>
<+189>: add %ebp,%r11d
<+192>: sub %ebp,%ebx
<+194>: cmp %ebp,%r9d
<+197>: je 0x379 <test5+265>
<+199>: movslq %r11d,%rcx
<+202>: lea -0x1(%rbx),%edi
<+205>: add %r12,%rcx
<+208>: lea 0x0(,%rcx,8),%rdx
<+216>: add %rdi,%rcx
<+219>: lea (%r15,%rcx,8),%rcx
<+223>: lea 0x0(%r13,%rdx,1),%rax
<+228>: add %r14,%rdx
<+231>: nopw 0x0(%rax,%rax,1)
<+240>: movsd (%rax),%xmm0
<+244>: addsd (%rdx),%xmm0
<+248>: add $0x8,%rdx
<+252>: movsd %xmm0,(%rax)
<+256>: add $0x8,%rax
<+260>: cmp %rcx,%rax
<+263>: jne 0x360 <test5+240>
<+265>: add $0x80000,%r10
<+272>: add $0x80000,%rsi
<+279>: add $0x10000,%r12
<+286>: cmp -0x10(%rsp),%r10
<+291>: jne 0x2a0 <test5+48>
<+297>: pop %rbx
<+298>: pop %rbp
<+299>: pop %r12
<+301>: pop %r13
<+303>: pop %r14
<+305>: pop %r15
<+307>: retq
<+308>: mov $0x10000,%ebx
<+313>: xor %r11d,%r11d
<+316>: jmpq 0x2d5 <test5+101>
Ouch! What a difference a single extra loop makes. gcc hasn't understood what we are trying to say. Perhaps the problem here is that the amount of memory accessed is larger than the size an int can touch. Lets use size_t as a hint that the two loop indexes can be collapsed into one.
void test6(double * restrict a, double * restrict b)
{
size_t i, j;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (j = 0; j < SIZE; j++)
{
for (i = 0; i < SIZE; i++)
{
x[i + j * SIZE] += y[i + j * SIZE];
}
}
}
<+0>: push %r12
<+2>: mov %rsi,%r9
<+5>: mov $0x10000,%r12d
<+11>: push %rbp
<+12>: push %rbx
<+13>: nopl (%rax)
<+16>: mov %rdi,%rcx
<+19>: shl $0x3c,%rcx
<+23>: shr $0x3f,%rcx
<+27>: test %rcx,%rcx
<+30>: je 0x496 <test6+214>
<+36>: movsd (%rdi),%xmm0
<+40>: mov $0xffff,%ebx
<+45>: mov $0x1,%eax
<+50>: addsd (%r9),%xmm0
<+55>: movsd %xmm0,(%rdi)
<+59>: mov $0x10000,%r11d
<+65>: sub %rcx,%r11
<+68>: mov %r11,%r10
<+71>: shr %r10
<+74>: mov %r10,%rbp
<+77>: add %rbp,%rbp
<+80>: je 0x45b <test6+155>
<+82>: shl $0x3,%rcx
<+86>: xor %edx,%edx
<+88>: lea (%rdi,%rcx,1),%r8
<+92>: lea (%r9,%rcx,1),%rsi
<+96>: xor %ecx,%ecx
<+98>: nopw 0x0(%rax,%rax,1)
<+104>: movsd (%rsi,%rdx,1),%xmm1
<+109>: add $0x1,%rcx
<+113>: movapd (%r8,%rdx,1),%xmm0
<+119>: movhpd 0x8(%rsi,%rdx,1),%xmm1
<+125>: addpd %xmm1,%xmm0
<+129>: movapd %xmm0,(%r8,%rdx,1)
<+135>: add $0x10,%rdx
<+139>: cmp %r10,%rcx
<+142>: jb 0x428 <test6+104>
<+144>: add %rbp,%rax
<+147>: sub %rbp,%rbx
<+150>: cmp %rbp,%r11
<+153>: je 0x479 <test6+185>
<+155>: lea (%rbx,%rax,1),%rdx
<+159>: nop
<+160>: movsd (%rdi,%rax,8),%xmm0
<+165>: addsd (%r9,%rax,8),%xmm0
<+171>: movsd %xmm0,(%rdi,%rax,8)
<+176>: add $0x1,%rax
<+180>: cmp %rdx,%rax
<+183>: jne 0x460 <test6+160>
<+185>: add $0x80000,%rdi
<+192>: add $0x80000,%r9
<+199>: sub $0x1,%r12
<+203>: jne 0x3d0 <test6+16>
<+209>: pop %rbx
<+210>: pop %rbp
<+211>: pop %r12
<+213>: retq
<+214>: mov $0x10000,%ebx
<+219>: xor %eax,%eax
<+221>: jmpq 0x3fb <test6+59>
The above is a bit better (200 instead of 300 bytes of instructions), but still is nowhere near where we want it to be. It looks like gcc doesn't understand that we are just iterating over the array in segments. Lets be more explicit that we touch the whole array:
void test7(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE * SIZE; i++)
{
x[i] += y[i];
}
}
This accomplishes the exact same thing as the previous two functions, but uses one loop. This routine is very similar to test4() , and we would expect gcc to do a fairly good job:
<+0>: xor %eax,%eax
<+2>: movabs $0x800000000,%rdx
<+12>: nopl 0x0(%rax)
<+16>: movapd (%rdi,%rax,1),%xmm0
<+21>: addpd (%rsi,%rax,1),%xmm0
<+26>: movapd %xmm0,(%rdi,%rax,1)
<+31>: add $0x10,%rax
<+35>: cmp %rdx,%rax
<+38>: jne 0x4c0 <test7+16>
<+40>: repz retq
Indeed, it does produce what we are looking for. So, a rule of thumb might be; two loops bad, one loop good.
We have used addition in the above because it is both a C operator, and a vectorizable instruction. Lets see what happens when we try to use something that should vectorize, but doesn't directly correspond to a C operator. A simple example is the max() operation:
void test8(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
/* max() */
if (y[i] > x[i]) x[i] = y[i];
}
}
This produces:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd (%rsi,%rax,8),%xmm0
<+13>: ucomisd (%rdi,%rax,8),%xmm0
<+18>: jbe 0x4f9 <test8+25>
<+20>: movsd %xmm0,(%rdi,%rax,8)
<+25>: add $0x1,%rax
<+29>: cmp $0x10000,%rax
<+35>: jne 0x4e8 <test8+8>
<+37>: repz retq
And there is no vectorization at all. What has gone wrong? The problem here is the C memory model. gcc isn't allowed to introduce extra writes to a memory location. If some other thread modifies x[] while the above routine is called, then those extra writes could stomp on modifications. We need a way to tell gcc that using the vectorized max() instruction is okay. Fortunately, this isn't too hard, we just use a non-conditional write:
void test9(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
/* max() */
x[i] = ((y[i] > x[i]) ? y[i] : x[i]);
}
}
Which produces:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rsi,%rax,1),%xmm0
<+13>: maxpd (%rdi,%rax,1),%xmm0
<+18>: movapd %xmm0,(%rdi,%rax,1)
<+23>: add $0x10,%rax
<+27>: cmp $0x80000,%rax
<+33>: jne 0x518 <test9+8>
<+35>: repz retq
This code looks like the vectorized code for addition, and we didn't have to use any SSE-explicit intrinsic to do so. gcc has pattern-matched the operation and found the instruction we are looking for, which is very encouraging.
So what happens when we try something a little more complex, like a conditional add?
void test10(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
/* conditional add */
x[i] = ((y[i] > x[i]) ? x[i] + y[i] : x[i]);
}
}
This produces:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd (%rsi,%rax,8),%xmm1
<+13>: movsd (%rdi,%rax,8),%xmm0
<+18>: ucomisd %xmm0,%xmm1
<+22>: jbe 0x55c <test10+28>
<+24>: addsd %xmm1,%xmm0
<+28>: movsd %xmm0,(%rdi,%rax,8)
<+33>: add $0x1,%rax
<+37>: cmp $0x10000,%rax
<+43>: jne 0x548 <test10+8>
<+45>: repz retq
Which is unfortunately not vectorized. Perhaps our expression is too complex, so lets try the simpler version:
void test11(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
/* conditional add */
x[i] += ((y[i] > x[i]) ? y[i] : 0);
}
}
This produces:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rdi,%rax,1),%xmm1
<+13>: movapd (%rsi,%rax,1),%xmm2
<+18>: movapd %xmm1,%xmm0
<+22>: cmpltpd %xmm2,%xmm0
<+27>: andpd %xmm2,%xmm0
<+31>: addpd %xmm1,%xmm0
<+35>: movapd %xmm0,(%rdi,%rax,1)
<+40>: add $0x10,%rax
<+44>: cmp $0x80000,%rax
<+50>: jne 0x578 <test11+8>
<+52>: repz retq
So the form of the expression matters. In this case, a different rearrangement of an expression was vectorizable. In general, the simpler the expression, the greater the chance of auto-vectorization.
All of the above have had fixed-length loops. What happens if the loop exit condition is more complex:
void test12(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
/* early stop */
if (x[i] < y[i]) break;
x[i] += y[i];
}
}
Where here we can exit early if the (hopefully vectorizable) condition occurs. Unfortunately, gcc produces:
<+0>: xor %eax,%eax
<+2>: jmp 0x5cd <test12+29>
<+4>: nopl 0x0(%rax)
<+8>: addsd %xmm1,%xmm0
<+12>: movsd %xmm0,(%rdi,%rax,8)
<+17>: add $0x1,%rax
<+21>: cmp $0x10000,%rax
<+27>: je 0x5dd <test12+45>
<+29>: movsd (%rdi,%rax,8),%xmm1
<+34>: movsd (%rsi,%rax,8),%xmm0
<+39>: ucomisd %xmm1,%xmm0
<+43>: jbe 0x5b8 <test12+8>
<+45>: repz retq
Where the above obviously isn't vectorized. For some reason gcc isn't using a compare followed by a move-mask to determine if the loop can exit early, whilst maintaining the ability to do two adds at a time.
In all the above, we have used simple indexing, lets see what happens if we make it more complex:
void test13(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = y[i] + y[i + 1];
}
}
This compiles into:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd 0x8(%rsi,%rax,1),%xmm1
<+14>: movapd (%rsi,%rax,1),%xmm0
<+19>: movhpd 0x10(%rsi,%rax,1),%xmm1
<+25>: addpd %xmm1,%xmm0
<+29>: movapd %xmm0,(%rdi,%rax,1)
<+34>: add $0x10,%rax
<+38>: cmp $0x80000,%rax
<+44>: jne 0x5e8 <test13+8>
<+46>: repz retq
This is half-vectorized. Instead of using an unaligned load instruction, gcc has used two half-load instructions for the upper and lower halves of %xmm1 . The unaligned-load is probably better. Another possibility is to use pipelined aligned loads followed by shuffles.
If we switch to an even offset, gcc no longer needs to do unaligned accesses, and switches back to full vectorization:
void test14(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = y[i] + y[i + 2];
}
}
giving
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rsi,%rax,1),%xmm0
<+13>: addpd 0x10(%rsi,%rax,1),%xmm0
<+19>: movapd %xmm0,(%rdi,%rax,1)
<+24>: add $0x10,%rax
<+28>: cmp $0x80000,%rax
<+34>: jne 0x618 <test14+8>
<+36>: repz retq
So the quality of the code gcc produces can subtly depend on which index offsets are used. Confirming this is the simpler routine:
void test15(double * restrict a, double * restrict b)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = y[i + 1];
}
}
which gives:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd 0x8(%rsi,%rax,1),%xmm1
<+14>: movhpd 0x10(%rsi,%rax,1),%xmm1
<+20>: movapd %xmm1,(%rdi,%rax,1)
<+25>: add $0x10,%rax
<+29>: cmp $0x80000,%rax
<+35>: jne 0x648 <test15+8>
<+37>: repz retq
Where again the half-loads appear.
Everything so far has used two input arrays. What does gcc do with more?
void test16(double * restrict a, double * restrict b, double * restrict c)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
double *z = __builtin_assume_aligned(c, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = y[i] + z[i];
}
}
Which when compiled, gives:
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rsi,%rax,1),%xmm0
<+13>: addpd (%rdx,%rax,1),%xmm0
<+18>: movapd %xmm0,(%rdi,%rax,1)
<+23>: add $0x10,%rax
<+27>: cmp $0x80000,%rax
<+33>: jne 0x678 <test16+8>
<+35>: repz retq
This is good, fully vectorized code.
How about something more complex, like our max function implementation?
void test17(double * restrict a, double * restrict b, double * restrict c)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
double *z = __builtin_assume_aligned(c, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = ((y[i] > z[i]) ? y[i] : z[i]);
}
}
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movapd (%rsi,%rax,1),%xmm0
<+13>: maxpd (%rdx,%rax,1),%xmm0
<+18>: movapd %xmm0,(%rdi,%rax,1)
<+23>: add $0x10,%rax
<+27>: cmp $0x80000,%rax
<+33>: jne 0x6a8 <test17+8>
<+35>: repz retq
Which again works nicely.
So, now that we have a third array, lets use it within the expression on the right:
void test18(double * restrict a, double * restrict b, double * restrict c)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
double *z = __builtin_assume_aligned(c, 16);
for (i = 0; i < SIZE; i++)
{
x[i] = ((y[i] > z[i]) ? x[i] : z[i]);
}
}
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd (%rdx,%rax,8),%xmm0
<+13>: movsd (%rsi,%rax,8),%xmm1
<+18>: ucomisd %xmm0,%xmm1
<+22>: jbe 0x6ed <test18+29>
<+24>: movsd (%rdi,%rax,8),%xmm0
<+29>: movsd %xmm0,(%rdi,%rax,8)
<+34>: add $0x1,%rax
<+38>: cmp $0x10000,%rax
<+44>: jne 0x6d8 <test18+8>
<+46>: repz retq
Ooops, no vectorization here. It looks like a conditional store is a bit too complex. How about if we make it a bit simpler, using an if statement instead:
void test19(double * restrict a, double * restrict b, double * restrict c)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double *y = __builtin_assume_aligned(b, 16);
double *z = __builtin_assume_aligned(c, 16);
for (i = 0; i < SIZE; i++)
{
if (y[i] <= z[i]) x[i] = z[i];
}
}
<+0>: xor %eax,%eax
<+2>: nopw 0x0(%rax,%rax,1)
<+8>: movsd (%rdx,%rax,8),%xmm0
<+13>: ucomisd (%rsi,%rax,8),%xmm0
<+18>: jb 0x719 <test19+25>
<+20>: movsd %xmm0,(%rdi,%rax,8)
<+25>: add $0x1,%rax
<+29>: cmp $0x10000,%rax
<+35>: jne 0x708 <test19+8>
<+37>: repz retq
This is a bit tighter code, but it still isn't vectorized. We may need to go to intrinsics to get something like the output from test11() .
Finally, lets look at operations that scan over an array, and collate a single output. gcc should be able to vectorize these as well.
double test20(double * restrict a)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double y = -INFINITY;
for (i = 0; i < SIZE; i++)
{
if (x[i] > y) y = x[i];
}
return y;
}
<+0>: movsd 0x0(%rip),%xmm0 # y
<+8>: xor %eax,%eax
<+10>: nopw 0x0(%rax,%rax,1)
<+16>: movsd (%rdi,%rax,8),%xmm1
<+21>: add $0x1,%rax
<+25>: cmp $0x10000,%rax
<+31>: maxsd %xmm0,%xmm1
<+35>: movapd %xmm1,%xmm0
<+39>: jne 0x740 <test20+16>
<+41>: repz retq
Unfortunately, this global maximum operation isn't vectorized. Lets see a simpler function that calculates a global sum instead:
double test21(double * restrict a)
{
size_t i;
double *x = __builtin_assume_aligned(a, 16);
double y = 0;
for (i = 0; i < SIZE; i++)
{
y += x[i];
}
return y;
}
Which produces:
<+0>: xorpd %xmm0,%xmm0
<+4>: xor %eax,%eax
<+6>: nopw %cs:0x0(%rax,%rax,1)
<+16>: addsd (%rdi,%rax,8),%xmm0
<+21>: add $0x1,%rax
<+25>: cmp $0x10000,%rax
<+31>: jne 0x770 <test21+16>
<+33>: repz retq
Which also isn't vectorized. What is going on? The problem here is that gcc isn't allowed to re-order the operations we give it. Even though the maximum operation, and the addition operation are associative with real numbers, they aren't with floating point numbers. (Consider what happens with signed zeros, for example.)
We need to tell gcc that re-ordering operations is okay with us. To do this, we need to add another compile-time flag, "--fast-math". If we do this, we find that the output from functions 20 and 21 are improved:
<+0>: lea 0x80000(%rdi),%rax
<+7>: movapd 0x0(%rip),%xmm0 # 0x75f <test20+15>
<+15>: nop
<+16>: maxpd (%rdi),%xmm0
<+20>: add $0x10,%rdi
<+24>: cmp %rax,%rdi
<+27>: jne 0x760 <test20+16>
<+29>: movapd %xmm0,%xmm1
<+33>: unpckhpd %xmm0,%xmm0
<+37>: maxsd %xmm1,%xmm0
<+41>: retq
gcc is very intelligent, and does a vectorized maximum calculation, followed by a horizontal maximum at the end.
<+0>: xorpd %xmm1,%xmm1
<+4>: lea 0x80000(%rdi),%rax
<+11>: nopl 0x0(%rax,%rax,1)
<+16>: addpd (%rdi),%xmm1
<+20>: add $0x10,%rdi
<+24>: cmp %rax,%rdi
<+27>: jne 0x790 <test21+16>
<+29>: movapd %xmm1,%xmm2
<+33>: unpckhpd %xmm2,%xmm2
<+37>: movapd %xmm2,%xmm0
<+41>: addsd %xmm1,%xmm0
<+45>: retq
And again gcc manages to vectorize the summation, even with having to do a horizontal add at the end.
Summary
gcc is very good, and can auto-vectorize many inner loops. However, if the expressions get too complex, vectorization will fail. gcc also may not be able to get the most optimal form of the loop kernel. In general, the simpler the code, the more likely gcc is to give good results.
However, you cannot expect gcc to give what you expect without a few tweaks. You may need to add the "--fast-math" to turn on associativity. You will definitely need to tell the compiler about alignment and array-overlap considerations to get good code.
On the other hand, gcc will still attempt to vectorize code which hasn't had changes done to it at all. It just won't be able to get nearly as much of a performance improvement as you might hope.
However, as time passes, more inner loop patterns will be added to the vectorizable list. Thus if you are using later versions of gcc, don't take the above results for granted. Check the output of the compiler yourself to see if it is behaving as you might expect. You might be pleasantly surprised by what it can do.
|
Comments
bobbyprani said...__builtin_assume_aligned(a, 16);
was not recognised, and you had to switch to:
double *x = __builtin_assume_aligned(a, 16);
Also what are the practical implications of making that assumption? do I have to do something special when I create the array?
(PS you captcha is hard)
For 16-byte alignment: Anything allocated by malloc() and friends will already be 16-byte aligned due to the ABI. Arrays allocated statically or on the stack probably will not be 16-byte aligned. You will need to add a non-standard gcc alignment attribute for them.
The Captcha is unusual... but we get hundreds of spam attempts per day, and it filters out all of them.
Then I tried your examples with icc and used:
#pragma vector always
#pragma vector aligned
before the loop and flags:
-restrict -O3 -march=corei7-avx -vec-report2
to icc. That worked and it vectorized nicely. But I had to manually edit the generated .s file a bit before I could feed it back to gcc. Basically what I did was removing some intel stuff and adding .cfi_startproc and .cfi_endproc. And changed to .p2align syntax, like the ones gcc -O3 usually generates.
It would be nice to see a vectorization comparison between gcc and icc, as the first comment suggested. Keep up the good work!
<pre>
test2:
.CG4:
.CG5:
movl %edi,%edx
andl $15,%edx
movl %esi,%eax
andl $15,%eax
movl %edi,%ecx
andl $7,%ecx
xorl %edx,%eax
xorl %r8d,%r8d
orl %ecx,%eax
je .L77000045.44
.L77000069.43:
xorq %rax,%rax
.align 16
.L77000057.52:
prefetcht0 256(%rdi,%rax)
movupd (%rdi,%rax),%xmm0
movupd (%rsi,%rax),%xmm1
addpd %xmm1,%xmm0
movupd %xmm0,(%rdi,%rax)
prefetcht0 272(%rsi,%rax)
movupd 16(%rdi,%rax),%xmm0
movupd 16(%rsi,%rax),%xmm1
addpd %xmm1,%xmm0
movupd %xmm0,16(%rdi,%rax)
movupd 32(%rdi,%rax),%xmm0
movupd 32(%rsi,%rax),%xmm1
addpd %xmm1,%xmm0
movupd %xmm0,32(%rdi,%rax)
movupd 48(%rdi,%rax),%xmm0
movupd 48(%rsi,%rax),%xmm1
addpd %xmm1,%xmm0
movupd %xmm0,48(%rdi,%rax)
addq $64,%rax
addl $8,%r8d
cmpl $65534,%r8d
jle .L77000057.52
.LX2.123:
jmp .LE5.143
.L77000045.44:
movl $16,%ecx
subl %edx,%ecx
sarl $3,%ecx
andl $1,%ecx
jle .LE0.120
.L77000071.46:
addl $-1,%ecx
movq $65535,%rax
movslq %ecx,%rcx
cmpq %rcx,%rax
movq $0,%rdx
cmovlq %rax,%rcx
.zalign 16,8
.L77000049.48:
movsd (%rdi,%rdx),%xmm0
addsd (%rsi,%rdx),%xmm0
movsd %xmm0,(%rdi,%rdx)
addq $8,%rdx
addl $1,%r8d
cmpl %ecx,%r8d
jle .L77000049.48
.LX0.119:
.LE0.120:
.L77000044.45:
movl $65536,%ecx
subl %r8d,%ecx
leal 1(%rcx),%edx
testl %ecx,%ecx
cmovnsl %ecx,%edx
sarl $1,%edx
testl %edx,%edx
jle .LE5.143
.L77000072.50:
movl $-2147483647,%eax
andl %ecx,%eax
jns .CG3.144
andl $1,%eax
negl %eax
.CG3.144:
subl %eax,%ecx
addl %r8d,%ecx
movslq %r8d,%rax
movl %ecx,%edx
subl %r8d,%edx
addl $-2,%edx
leal -2(%rcx),%r10d
leal 1(%rdx),%r9d
shlq $3,%rax
testl %edx,%edx
cmovnsl %edx,%r9d
sarl $1,%r9d
addl $1,%r9d
cmpl %r10d,%r8d
jle .CG2.134
xorl %r9d,%r9d
.CG2.134:
cmpl $4,%r9d
jl .LU0.135
.LP0.139:
addl $-8,%ecx
.zalign 16,8
.L77000052.51:
prefetcht0 256(%rdi,%rax)
movapd (%rdi,%rax),%xmm0
addpd (%rsi,%rax),%xmm0
movapd %xmm0,(%rdi,%rax)
prefetcht0 272(%rsi,%rax)
movapd 16(%rdi,%rax),%xmm0
addpd 16(%rsi,%rax),%xmm0
movapd %xmm0,16(%rdi,%rax)
movapd 32(%rdi,%rax),%xmm0
addpd 32(%rsi,%rax),%xmm0
movapd %xmm0,32(%rdi,%rax)
movapd 48(%rdi,%rax),%xmm0
addpd 48(%rsi,%rax),%xmm0
movapd %xmm0,48(%rdi,%rax)
addq $64,%rax
addl $8,%r8d
cmpl %ecx,%r8d
jle .L77000052.51
.LX4.140:
.LE4.141:
cmpl %r10d,%r8d
jg .LE5.143
.LU0.135:
.zalign 16,8
.LU1.136:
movapd (%rdi,%rax),%xmm0
addpd (%rsi,%rax),%xmm0
movapd %xmm0,(%rdi,%rax)
addq $16,%rax
addl $2,%r8d
cmpl %r10d,%r8d
jle .LU1.136
.LX5.142:
.LE5.143:
.LX1.121:
.LE1.122:
.LE2.124:
.L77000046.49:
cmpl $65535,%r8d
jg .LE3.126
.L77000070.54:
movslq %r8d,%rax
leaq (,%rax,8),%rcx
.zalign 16,8
.L77000054.55:
movsd (%rdi,%rcx),%xmm0
addsd (%rsi,%rcx),%xmm0
movsd %xmm0,(%rdi,%rcx)
addq $8,%rcx
addq $1,%rax
cmpq $65535,%rax
jle .L77000054.55
.LX3.125:
.LE3.126:
.L77000034.53:
repz
ret
</pre>
Quick question : I don't understand the requirement for "16-bytes aligned" property.
Both SSE and AVX offer instructions to load unaligned memory segments.
Look at movdqu (SSE) or vmovdqu (AVX).
They are very good. Not only do they work properly on unaligned memory, they also transparently benefit from loading aligned memory (i.e. they run faster in such case).
Considering the gigantic flexibility these instructions offer, I would use them always instead of their "aligned" counterpart (movdqa, vmovdqa).
Not sure, from your example, if GCC is able to use them automatically (without intrinsics).
Great article! Thanks from somebody just getting started with vectorization!