Does the restrict keyword provide significant benefits in gcc / g++ ?
It can reduce the number of instructions as shown on the example below, so use it whenever possible.
GCC 4.8 Linux x86-64 exmample
Input:
void f(int *a, int *b, int *x) {
*a += *x;
*b += *x;
}
void fr(int *restrict a, int *restrict b, int *restrict x) {
*a += *x;
*b += *x;
}
Compile and decompile:
gcc -g -std=c99 -O0 -c main.c
objdump -S main.o
With -O0
, they are the same.
With -O3
:
void f(int *a, int *b, int *x) {
*a += *x;
0: 8b 02 mov (%rdx),%eax
2: 01 07 add %eax,(%rdi)
*b += *x;
4: 8b 02 mov (%rdx),%eax
6: 01 06 add %eax,(%rsi)
void fr(int *restrict a, int *restrict b, int *restrict x) {
*a += *x;
10: 8b 02 mov (%rdx),%eax
12: 01 07 add %eax,(%rdi)
*b += *x;
14: 01 06 add %eax,(%rsi)
For the uninitiated, the calling convention is:
rdi
= first parameter
rsi
= second parameter
rdx
= third parameter
Conclusion: 3 instructions instead of 4.
Of course, instructions can have different latencies, but this gives a good idea.
Why GCC was able to optimize that?
The code above was taken from the Wikipedia example which is very illuminating.
Pseudo assembly for f
:
load R1 ← *x ; Load the value of x pointer
load R2 ← *a ; Load the value of a pointer
add R2 += R1 ; Perform Addition
set R2 → *a ; Update the value of a pointer
; Similarly for b, note that x is loaded twice,
; because x may point to a (a aliased by x) thus
; the value of x will change when the value of a
; changes.
load R1 ← *x
load R2 ← *b
add R2 += R1
set R2 → *b
For fr
:
load R1 ← *x
load R2 ← *a
add R2 += R1
set R2 → *a
; Note that x is not reloaded,
; because the compiler knows it is unchanged
; "load R1 ← *x" is no longer needed.
load R2 ← *b
add R2 += R1
set R2 → *b
Is it really any faster?
Ermmm... not for this simple test:
.text
.global _start
_start:
mov $0x10000000, %rbx
mov $x, %rdx
mov $x, %rdi
mov $x, %rsi
loop:
# START of interesting block
mov (%rdx),%eax
add %eax,(%rdi)
mov (%rdx),%eax # Comment out this line.
add %eax,(%rsi)
# END ------------------------
dec %rbx
cmp $0, %rbx
jnz loop
mov $60, %rax
mov $0, %rdi
syscall
.data
x:
.int 0
And then:
as -o a.o a.S && ld a.o && time ./a.out
on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.
I confess that I still don't understand modern CPUs. Let me know if you:
- found a flaw in my method
- found an assembler test case where it becomes much faster
- understand why there wasn't a difference