How to vectorize with gcc?
Asked Answered
C

2

22

The v4 series of the gcc compiler can automatically vectorize loops using the SIMD processor on some modern CPUs, such as the AMD Athlon or Intel Pentium/Core chips. How is this done?

Contented answered 3/1, 2009 at 16:22 Comment(1)
By "how is this done", do you mean how to enable gcc's autovectorization support, or how the compiler actually recognizes vectorizable code and implements that support?Malca
C
32

The original page offers details on getting gcc to automatically vectorize loops, including a few examples:

http://gcc.gnu.org/projects/tree-ssa/vectorization.html

While the examples are great, it turns out the syntax for calling those options with latest GCC seems to have changed a bit, see now:

In summary, the following options will work for x86 chips with SSE2, giving a log of loops that have been vectorized:

gcc -O2 -ftree-vectorize -msse2 -mfpmath=sse -ftree-vectorizer-verbose=5

Note that -msse is also a possibility, but it will only vectorize loops using floats, not doubles or ints. (SSE2 is baseline for x86-64. For 32-bit code use -mfpmath=sse as well. That's the default for 64-bit but not 32-bit.)


Modern versions of GCC enable -ftree-vectorize at -O3 so just use that in GCC4.x and later:

gcc   -O3 -msse2 -mfpmath=sse  -ftree-vectorizer-verbose=5

(Clang enables auto-vectorization at -O2. ICC defaults to optimization enabled + fast-math.)


Most of the following was written by Peter Cordes, who could have just written a new answer. Over time, as compilers change, options and compiler output will change. I am not entirely sure whether it is worth tracking it in great detail here. Comments? -- Author

To also use instruction set extensions supported by the hardware you're compiling on, and tune for it, use -march=native.

Reduction loops (like sum of an array) will need OpenMP or -ffast-math to treat FP math as associative and vectorize. Example on the Godbolt compiler explorer with -O3 -march=native -ffast-math including a reduction (array sum) which is scalar without -ffast-math. (Well, GCC8 and later do a SIMD load and then unpack it to scalar elements, which is pointless vs. simple unrolling. The loop bottlenecks on the latency of the one addss dependency chain.)

Sometimes you don't need -ffast-math, just -fno-math-errno can help gcc inline math functions and vectorize something involving sqrt and/or rint / nearbyint.

Other useful options include -flto (link-time optimization for cross-file inlining, constant propagation, etc) and / or profile-guided optimization with -fprofile-generate / test run(s) with realistic input(s) /-fprofile-use. PGO enables loop unrolling for "hot" loops; in modern GCC that's off by default even at -O3.

Contented answered 3/1, 2009 at 16:24 Comment(7)
-ftree-vectorizer-verbose=5 is the old syntax, one need to use the newer syntax now seeZebrawood
Does GCC have more updated document about vectorization?Drooff
That flag and the ones specified in the link @Zebrawood gave no longer exist in gcc 8.3. Trying to pin down the flags that gcc offers is a bit diffcult. The link in my original post has not been updated in 8 years, either.Contented
GCC enables auto-vectorization at -O3. Prefer that. (It doesn't enable loop unrolling by default these days; ideally use -fprofile-generate + -fprofile-use to get hot loops unrolled.) Also prefer -O3 -march=native -ffast-math when compiling only for your own computer. See also C loop optimization help for final assignment for some examples of GCC auto-vectorization and auto-parallelization with non-ancient gcc.Nomology
@PeterCordes I didn't know -march=native and it works really well. Just specifying the flag made my code 1.19 times faster. Thank you.Praetorian
@casualcoder: fixed that for you. GCC -O3 includes auto-vectorization as early as 4.4, not GCC8. I also added some other relevant options and a link to an example on Godbolt.Nomology
I could have written a new answer, but maintaining existing answers on SO is generally a good thing. Leaving the top-voted answer with something as misleading as -O3 only vectorizing with GCC8 seemed like a bad thing. (I guess you could have changed your accept vote if I'd written a new answer, and you editing proved you were active; more often we have answers accepted long ago by an asker who's not coming back so that's an even less good option, and this answer did need some fixing. I do tend to get carried away and make edits larger than then need to be...)Nomology
W
10

There is a gimple (an Intermediate Representation of GCC) pass pass_vectorize. This pass will enable auto-vectorization at gimple level.

For enabling autovectorization (GCC V4.4.0), we need to following steps:

  1. Mention the number of words in a vector as per target architecture. This can be done by defining the macro UNITS_PER_SIMD_WORD.
  2. The vector modes that are possible needs to be defined in a separate file usually <target>-modes.def. This file has to reside in the directory where other files containing the machine descriptions are residing on. (As per the configuration script. If you can change the script you can place the file in whatever directory you want it to be in).
  3. The modes that are to be considered for vectorization as per target architecture. Like, 4 words will constitute a vector or eight half words will constitute a vector or two double-words will constitute a vector. The details of this needs to be mentioned in the <target>-modes.def file. For example:

    VECTOR_MODES (INT, 8);     /*       V8QI V4HI V2SI /
    VECTOR_MODES (INT, 16);    /
    V16QI V8HI V4SI V2DI /
    VECTOR_MODES (FLOAT, 8);   /
               V4HF V2SF */
  4. Build the port. Vectorization can be enabled using the command line options -O2 -ftree-vectorize.

Wakefield answered 3/11, 2009 at 12:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.