Does the restrict keyword provide significant benefits in gcc/g++?

Asked 27/12, 2009 at 8:4 Answered 14/6, 2015 at 10:43

Has anyone ever seen any numbers/analysis on whether or not use of the C/C++ restrict keyword in gcc/g++ actual provides any significant performance boost in reality (and not just in theory)?

I've read various articles recommending / disparaging its use, but I haven't ran across any real numbers practically demonstrating either sides arguments.

EDIT

I know that restrict is not officially part of C++, but it is supported by some compilers and I've read a paper by Christer Ericson which strongly recommends its usage.

Impressment answered 27/12, 2009 at 8:4 Comment(2)

Aliasing issues are commonly considered the #1 reason why C/C++ are less efficient at many computational tasks than Fortran. So I'd say any feature which helps avoid aliasing can make a big difference. – Alysonalysoun 27/12, 2009 at 23:58

possible duplicate of Realistic usage of the C99 'restrict' keyword? – Lobeline 14/6, 2015 at 10:48

The restrict keyword does a difference.

I've seen improvements of factor 2 and more in some situations (image processing). Most of the time the difference is not that large though. About 10%.

Here is a little example that illustrate the difference. I've written a very basic 4x4 vector * matrix transform as a test. Note that I have to force the function not to be inlined. Otherwise GCC detects that there aren't any aliasing pointers in my benchmark code and restrict wouldn't make a difference due to inlining.

I could have moved the transform function to a different file as well.

#include <math.h>

#ifdef USE_RESTRICT
#else
#define __restrict
#endif


void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n) __attribute__ ((noinline));

void transform (float * __restrict dest, float * __restrict src, 
                float * __restrict matrix, int n)
{
  int i;

  // simple transform loop.

  // written with aliasing in mind. dest, src and matrix 
  // are potentially aliasing, so the compiler is forced to reload
  // the values of matrix and src for each iteration.

  for (i=0; i<n; i++)
  {
    dest[0] = src[0] * matrix[0] + src[1] * matrix[1] + 
              src[2] * matrix[2] + src[3] * matrix[3];

    dest[1] = src[0] * matrix[4] + src[1] * matrix[5] + 
              src[2] * matrix[6] + src[3] * matrix[7];

    dest[2] = src[0] * matrix[8] + src[1] * matrix[9] + 
              src[2] * matrix[10] + src[3] * matrix[11];

    dest[3] = src[0] * matrix[12] + src[1] * matrix[13] + 
              src[2] * matrix[14] + src[3] * matrix[15];

    src  += 4;
    dest += 4;
  }
}

float srcdata[4*10000];
float dstdata[4*10000];

int main (int argc, char**args)
{
  int i,j;
  float matrix[16];

  // init all source-data, so we don't get NANs  
  for (i=0; i<16; i++)   matrix[i] = 1;
  for (i=0; i<4*10000; i++) srcdata[i] = i;

  // do a bunch of tests for benchmarking. 
  for (j=0; j<10000; j++)
    transform (dstdata, srcdata, matrix, 10000);
}

Results: (on my 2 Ghz Core Duo)

nils@doofnase:~$ gcc -O3 test.c
nils@doofnase:~$ time ./a.out

real    0m2.517s
user    0m2.516s
sys     0m0.004s

nils@doofnase:~$ gcc -O3 -DUSE_RESTRICT test.c
nils@doofnase:~$ time ./a.out

real    0m2.034s
user    0m2.028s
sys     0m0.000s

Over the thumb 20% faster execution, on that system.

To show how much it depends on the architecture I've let the same code run on a Cortex-A8 embedded CPU (adjusted the loop count a bit cause I don't want to wait that long):

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp test.c
root@beagleboard:~# time ./a.out

real    0m 7.64s
user    0m 7.62s
sys     0m 0.00s

root@beagleboard:~# gcc -O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -DUSE_RESTRICT test.c 
root@beagleboard:~# time ./a.out

real    0m 7.00s
user    0m 6.98s
sys     0m 0.00s

Here the difference is just 9% (same compiler btw.)

Fiorin answered 27/12, 2009 at 18:31 Comment(4)

Nice work. There is an article on the use of restrict on a Cell processor here: cellperformance.beyond3d.com/articles/2006/05/… that may be relevant to the discussion architecture specific benefits. – Incontinent 27/12, 2009 at 20:31

@Nils Pipenbrinck: Why do you have to disable inlining for the function? It seems like an awfully big function for the compiler to automatically inline. – Impressment 28/12, 2009 at 7:40

@Nils Pipenbrinck: By the way Ulrich Drepper posted code for a superoptimized matrix multiply as part of his discussion of optimizing cache and memory usage. It's here: lwn.net/Articles/258188 . His discussion of each step he went through to arrive at that solution is here: lwn.net/Articles/255364 . He was able to reduce the execution time by 90% over a standard MM. – Impressment 28/12, 2009 at 7:45

@Nils Pipenbrinck: I ran your test. When I compiled with -O3 I got the same result as you, about a 20% speed increase. But when I compiled without any optimization flag the two runs where identical. This is interesting: No Op flag = 0m5.022s for both, -O3 0m3.186s & 0m2.583s, -Os 0m2.391s & 0m2.314s. Optimizing for size gives the best result. Adding restrict in this case only bought an extra 3.2% performance. Wonder why that is? – Impressment 28/12, 2009 at 7:58

Does the restrict keyword provide significant benefits in gcc / g++ ?

It can reduce the number of instructions as shown on the example below, so use it whenever possible.

GCC 4.8 Linux x86-64 exmample

Input:

void f(int *a, int *b, int *x) {
  *a += *x;
  *b += *x;
}

void fr(int *restrict a, int *restrict b, int *restrict x) {
  *a += *x;
  *b += *x;
}

Compile and decompile:

gcc -g -std=c99 -O0 -c main.c
objdump -S main.o

With -O0, they are the same.

With -O3:

void f(int *a, int *b, int *x) {
    *a += *x;
   0:   8b 02                   mov    (%rdx),%eax
   2:   01 07                   add    %eax,(%rdi)
    *b += *x;
   4:   8b 02                   mov    (%rdx),%eax
   6:   01 06                   add    %eax,(%rsi)  

void fr(int *restrict a, int *restrict b, int *restrict x) {
    *a += *x;
  10:   8b 02                   mov    (%rdx),%eax
  12:   01 07                   add    %eax,(%rdi)
    *b += *x;
  14:   01 06                   add    %eax,(%rsi)

For the uninitiated, the calling convention is:

rdi = first parameter
rsi = second parameter
rdx = third parameter

Conclusion: 3 instructions instead of 4.

Of course, instructions can have different latencies, but this gives a good idea.

Why GCC was able to optimize that?

The code above was taken from the Wikipedia example which is very illuminating.

Pseudo assembly for f:

load R1 ← *x    ; Load the value of x pointer
load R2 ← *a    ; Load the value of a pointer
add R2 += R1    ; Perform Addition
set R2 → *a     ; Update the value of a pointer
; Similarly for b, note that x is loaded twice,
; because x may point to a (a aliased by x) thus 
; the value of x will change when the value of a
; changes.
load R1 ← *x
load R2 ← *b
add R2 += R1
set R2 → *b

For fr:

load R1 ← *x
load R2 ← *a
add R2 += R1
set R2 → *a
; Note that x is not reloaded,
; because the compiler knows it is unchanged
; "load R1 ← *x" is no longer needed.
load R2 ← *b
add R2 += R1
set R2 → *b

Is it really any faster?

Ermmm... not for this simple test:

.text
    .global _start
    _start:
        mov $0x10000000, %rbx
        mov $x, %rdx
        mov $x, %rdi
        mov $x, %rsi
    loop:
        # START of interesting block
        mov (%rdx),%eax
        add %eax,(%rdi)
        mov (%rdx),%eax # Comment out this line.
        add %eax,(%rsi)
        # END ------------------------
        dec %rbx
        cmp $0, %rbx
        jnz loop
        mov $60, %rax
        mov $0, %rdi
        syscall
.data
    x:
        .int 0

And then:

as -o a.o a.S && ld a.o && time ./a.out

on Ubuntu 14.04 AMD64 CPU Intel i5-3210M.

I confess that I still don't understand modern CPUs. Let me know if you:

found a flaw in my method
found an assembler test case where it becomes much faster
understand why there wasn't a difference

Lobeline answered 14/6, 2015 at 10:43 Comment(0)

The article Demystifying The Restrict Keyword refers to the paper Why Programmer-specified Aliasing is a Bad Idea (pdf) which says it generally doesn't help and provides measurements to back this up.

Fumatorium answered 15/5, 2011 at 20:18 Comment(1)

There are many kinds of code where it provides little benefit, but there are some where it provides a huge benefit. Are you aware of any papers that would show that judicious use of "restrict" would not offer benefits greater than those compilers can realize through type-based aliasing? – Lobby 9/3, 2016 at 0:28

-1

Note that C++ compilers that allow the restrict keyword may still ignore it. That is the case for example here.

Incontinent answered 27/12, 2009 at 9:42 Comment(5)

Actually, if you read down the page you'll notice that while restrict is ignored in C++ because of a potential conflict with user variables of the same name, __restrict__ is supported for C++. – Impressment 27/12, 2009 at 9:59

@Robert: And ignored. The difference is only that identifiers with a double underscore are reserved for system usage. Thus a __restrict__ should not clash with any user declared identifiers. – Paragraph 27/12, 2009 at 20:2

@Martin: How do you know it's ignored? It's not completely clear from the documentation - seems like you could read it either way. – Impressment 28/12, 2009 at 7:35

I agree that it is not clear, but it would seem inconsistent to ignore restrict but not __restrict__. Either way, it remains non-portable, and beneficial in very specific cases. I suggest that if you know it is beneficial in a particular situation, and you need that benefit to achieve success, then use it; otherwise why make the code gratuitously non-portable? I would not use it habitually, but as a last resort and after testing the actual benefit. – Incontinent 28/12, 2009 at 9:32

@Clifford: Of course, but it's like that with pretty much any optimization - test this way and that way and then use what works. – Impressment 28/12, 2009 at 11:17

-2

I tested this C-Program. Without restrict it took 12.640 seconds to complete, with restrict 12.516. Looks like it can save some time.

Coumas answered 27/12, 2009 at 9:33 Comment(6)

That difference is almost certainly insignificant, however, you should also declare c as restricted since each time c is written to at the moment the compiler may be considering that *a *b and *inc might have been changed. – Lariat 27/12, 2009 at 16:17

In your example the optimizer can detect that the parameters don't have aliasing. Try to disable inlining and you'll see a bigger difference. – Fiorin 27/12, 2009 at 18:41

But if you disable inlining, you're artificially crippling the compiler, so you no longer get an accurate picture of how much restricthelps on real-world code. – Alysonalysoun 28/12, 2009 at 0:0

@raphaelr: It seems like you need to use optimization flags for restrict to be useful. Try either -O3 or -Os. – Impressment 28/12, 2009 at 8:11

@DrewHall — A difference of 1/8 second may indeed be noise, but it is not necessarily noise. The only way to make that claim with certainty is to know for sure that the measurement was carried out poorly, for example a single run on a loaded system. But for all we know, raphaelr averaged the runtimes over a dozen runs, and the 1% difference is valid. Unfortunately, the link he gave is dead now, so there's no way to run an independent test of his code. – Egidio 11/8, 2014 at 0:9

Dead link, and no info on what compiler or target architecture this was for, so we can't see if there was a code-gen difference or not. – Cobol 13/3, 2019 at 5:49

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags