How to help gcc vectorize C code
Asked Answered
S

3

10

I have the following C code. The first part just reads in a matrix of complex numbers from standard in into matrix called M. The interesting part is the second part.

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

int main() {
    int n, m, c, d;
    float re, im;

    scanf("%d %d", &n, &m);
    assert(n==m);
    complex float M[n][n];

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
    scanf("%f%fi", &re, &im);
    M[c][d] = re + im * I;
      }
    }

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
        printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d]));
      }
      printf("\n");
    }
/*
Example:input   
2 3
1+2i 2+3i 74-4i
3+4i 4+5i -7-8i
*/
    /* Part 2. M is now an n by n matrix of complex numbers */
    int s=1, i, j;
    int *f = malloc(n * sizeof *f);
    complex float *delta = malloc(n * sizeof *delta);
    complex float *v = malloc(n * sizeof *v);
    complex float p = 1, prod;

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      f[i] = i;
      delta[i] = 1;
    }
    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }
    free(delta);
    free(f);
    free(v);
    printf("%f + i%f\n", creal(p/pow(2.,(n-1))), cimag(p/pow(2.,(n-1))));
    return 0;
}

I compile with gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm. This explains to me why almost none of the loops are being vectorized.

The most important part for performance is lines 47--50 which are:

for (i = 0; i < n; i++) {
    v[i] -= 2.*delta[j]*M[j][i];
    prod *= v[i];
}

gcc tells me:

permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: Unsupported pattern.
permanent-in-c.c:47:7: note: not vectorized: unsupported use in stmt.
permanent-in-c.c:47:7: note: unexpected pattern.
[...]
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: IMAGPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: REALPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
[...]
permanent-in-c.c:48:26: note: Build SLP failed: unrolling required in basic block SLP
permanent-in-c.c:48:26: note: Failed to SLP the basic block.
permanent-in-c.c:48:26: note: not vectorized: failed to find SLP opportunities in basic block.

How can I fix the problems that are stopping this part from being vectorized?


Curiously this part is vectorized but I am not sure why:

for (j = 0; j <n; j++) {
    v[i] += M[j][i];

The full output of gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm is at https://bpaste.net/show/18ebc3d66a53.

Schnell answered 13/1, 2017 at 16:52 Comment(14)
SIMD instructions require 16 byte alignment and malloc makes no such guarantees. What happens if you switch to posix_memalign?Adaurd
@SeanBright That's interesting. What would I replace the line "complex float *v = malloc(n * sizeof *v);" with for example to use posix_memalign?Schnell
Something like complex float *v; if (posix_memalign((void **) &v, 16, n * sizeof *v)) { /* error */ }Adaurd
Actually, I take back everything I've said. I'm seeing sources that say that malloc will always return 16 byte aligned pointers (on 64-bit systems), so this is probably a red herring.Adaurd
You're asking the compiler to vectorize a complex number multiply. While it's certainly possible (by hand), I'm not sure if compilers are smart enough to do it automatically.Antiscorbutic
for GCC you can use __builtin_assume_aligned, also if you want more alignment than default, try aligned_alloc() which is available in C11Marileemarilin
@Marileemarilin If I do delta = __builtin_assume_aligned(malloc(n * sizeof *delta), ALIGNMENT ); what should ALIGNMENT be and how to I ensure it's actually true? Do I need to do both aligned_alloc and __builtin_assume_aligned ?Schnell
@Antiscorbutic The loop involving addition that I showed is vectorised. Do you think gcc simply can't vectorise multiplication?Schnell
@eleanora Most likely. And even when you do it by hand, there's a lot of shuffling overhead. If you really care about performance, you cannot use the complex type since it's an AOS packing. You need to reorder the data to SOA.Antiscorbutic
@Antiscorbutic Any help doing that would be really great!Schnell
@eleanora Read up on AOS/SOA (array-of-structs/struct-of-arrays). The complex number type is a struct with the real and imaginary parts. So rather than use that, use separate arrays for each part.Antiscorbutic
I don't trust, or expect the GCC vectorizer, to yield truly good results; I use the GCC vector extensions instead (with separate matrix and vectors for the real and imaginary parts). The code ends up being pretty much write-only, so I need to keep a non-vectorized version of the same function around (for reference, and for unit/comparison testing).Rimple
@eleanora: The values of j in the latter loop seem to be OEIS A007814 (I'm partial to binary trees!). Is there a reason you do not just initialize delta[] to all 2.0f, and avoid the extra multiplication per matrix entry? As I mentioned, using GCC vector extensions the code becomes a lot more obtuse, but it does vectorize quite efficiently, and the code is not specific to any arch extension, only vector size. Do you need suggestions with regards to that?Rimple
@NominalAnimal You are quite right about OEIS A007814 and you are right that it would be a little quicker to initialize delta[] to all 2.0f. However the main problem is vectorization. I would love some help in that regard.Schnell
R
8

Let's examine the code in detail, first. We have

complex float  M[rows][cols];
complex float  v[cols];
float          delta[rows];
complex float  p = 1.0f;
float          s = 1.0f;

While delta[] is of complex float type in OP's code, it only ever contains -1.0f or +1.0f. (Furthermore, the calculations could be simplified if it were -2.0f or +2.0f instead.) For that reason, I defined is as real and not complex.

Similarly, OP defines s as int, but uses it effectively as -1.0f and +1.0f only (in the calculations). That's why I defined it explicitly as float.

I omit the f array, because there is a trivial way of avoiding it entirely.

The first loop of the interesting part of the code,

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      delta[i] = 1;
    }

performs several things. It initializes all elements in the delta[] array to 1; it could (and probably should) be split into a separate loop.

Since the outer loop is increasing in i, p will be the product of the elements in v; it could be split off into a separate loop, too.

Because the inner loop sums all the elements in column i to v[i], the outer and inner loops simply sums each row, as a vector, to vector v.

We can thus rewrite the above, in pseudocode, as

Copy first row of matrix M to vector v
For r = 1 .. rows-1:
    Add complex values in row r of matrix M to vector v

p = product of complex elements in vector v

delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f

 
Let's next look at the second nested loop:

    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }

It is hard to see unless you examine the values of j as the loop progresses, but the last 4 lines in the body of the outer loop implement the OEIS A007814 integer sequence in j (0,1,0,2,0,1,0,3,0,1,0,2,0,1,0,4,...). The number of iterations in this loop is 2rows-1-1. This part of the sequence is symmetric, and implements a binary tree of height rows-1:

               4
       3               3
   2       2       2       2     (Read horizontally)
 1   1   1   1   1   1   1   1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

It turns out that if we loop over i = 1 .. 2rows-1, then r is the number of zero low bits in i. GCC provides a __builtin_ctz() built-in function, that computes exactly this. (Note that __builtin_ctz(0) yields an undefined value; so don't do that, even if it happens to produce a specific value on your computer.)

The inner loop subtracts the complex values on row j of the matrix, scaled by 2*delta[j], from vector v[]. It also calculates the product of the complex entries in vector v[] (after the subtraction) into variable prod.

After the inner loop, delta[j] is negated, as is scale factor s. The value of variable prod, scaled by s, is added to p.

After the loop, the final (complex) result is p divided by 2rows-1. This is better done using the ldexp() C99 function (separately on the real and complex parts).

We can therefore rewrite the second nested loop, in pseudocode, as

s = 1.0f

For k = 1 .. rows-1, inclusive:
    r = __builtin_ctz(k), i.e. number of least
                          significant bits that
                          are zero in k

    Subtract the complex values on row r of matrix M,
        scaled by delta[r], from vector v[]

    prod = the product of (complex) elements in vector v[]

    Negate scale factor s (changing its sign)

    Add prod times s to result p

In my experience, it is best to split the real and imaginary parts into separate vectors and matrices. Consider the following definitions:

typedef struct {
    float  *real;
    float  *imag;
    size_t  floats_per_row;  /* Often 'stride' */
    size_t  rows;
    size_t  cols;
} complex_float_matrix;

/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);

/* Copy an array of floats */
void float_copy(float *, const float *, size_t);

/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);

/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);

/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);

/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);

All of the above are easily vectorized, as long as float_malloc() yields sufficiently aligned pointers (and the compiler is told that, via e.g. GCC function attribute __attribute__ ((__assume_aligned__ (BYTES_IN_VECTOR)));), and the floats_per_row member in the matrix is a multiple of the number of floats in a vector.

(I do not know whether GCC can automatically vectorize the above functions, but I do know that they can be vectorized "by hand" using the GCC vector extensions.)

With the above, the entire interesting part of the code, in pseudo-C, becomes

complex float permanent(const complex_float_matrix *m)
{
    float         *v_real, *v_imag;
    float         *scale;   /* OP used 'delta' */
    complex float  result;  /* OP used 'p'     */
    complex float  product; /* OP used 'prod'  */
    float          coeff = 1.0f; /* OP used 's' */
    size_t         i = 1 << (m->rows - 1);
    size_t         r;

    if (!m || m->cols < 1 || m->rows < 1 || !i) {
        /* TODO: No input matrix, or too many rows in input matrix */
    }

    v_real = float_malloc(m->cols);
    v_imag = float_malloc(m->cols);
    scale  = float_malloc(m->rows - 1);
    if (!v_real || !v_imag || !scale) {
        free(scale);
        free(v_imag);
        free(v_real);
        /* TODO: Out of memory error */
    }

    float_set(scale, 2.0f, m->rows - 1);

    /* Sum matrix rows to v. */

    float_copy(v_real, m->real, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_real, m->real + r * m->floats_per_row, m->cols);

    float_copy(v_imag, m->imag, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);

    result = complex_float_product(v_real, v_imag, m->cols);

    while (--i) {
        r = __builtin_ctz(i);

        scale[r] = -scale[r];

        float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
        float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);

        product = complex_float_product(v_real, v_imag, m->cols);

        coeff = -coeff;

        result += coeff * product;
    }

    free(scale);
    free(v_imag);
    free(v_real);

    return result;
}

At this point, I would personally implement the above without vectorization, then test it extensively, until I was sure it works correctly.

Then, I'd examine the GCC assembly output (-S) to see if it can sufficiently vectorize the individual operations (the functions I listed earlier).

Hand-vectorizing the functions using GCC vector extensions is quite straightforward. Declaring a float vector is trivial:

typedef  float  vec2f __attribute__((vector_size (8), aligned (8)));  /* 64 bits; MMX, 3DNow! */
typedef  float  vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef  float  vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef  float  vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */

The individual components in each vector can be addressed using the array notation (v[0] and v[1] for vec2f v;). GCC can do basic operations on entire vectors element-wise; we only really need addition and multiplication here. Horizontal operations (operations that apply between elements in the same vector) should be avoided, and instead elements reordered.

GCC will generate working code for the above vector sizes even on architectures without such vectorization, but the resulting code may be slow. (GCC versions up to 5.4 at least will generate a lot of unnecessary moves, typically to stack and back.)

The memory allocated for a vector needs to be sufficiently aligned. malloc() does not provide sufficiently aligned memory in all cases; you ought to use posix_memalign() instead. The aligned attribute can be used to increase the alignment GCC uses for the vector type, when allocating one locally or statically. In a matrix, you need to ensure that rows start at a sufficiently aligned boundary; that's the reason I have the floats_per_row variable in the structure.

In cases where the number of elements in a vector (or row) is large, but not a multiple of the number of floats in a vector, you should pad the vector with "inert" values -- values that do not affect the result, like 0.0f for addition and subtraction, and 1.0f for multiplication.

At least on x86 and x86-64, GCC will generate better code for loops using pointers only. For example, this

void float_set(float *array, float value, size_t count)
{
    float *const limit = array + count;
    while (array < limit)
        *(array++) = value;
}

yields better code than

void float_set(float *array, float value, size_t count)
{
    size_t i;
    for (i = 0; i < count; i++)
        array[i] = value;
}

or

void float_set(float *array, float value, size_t count)
{
    while (count--)
        *(array++) = value;
}

(which tend to produce similar code). Personally, I'd implement it as

void float_set(float *array, float value, size_t count)
{
    if (!((uintptr_t)array & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
        uint64_t        val;

        __builtin_memcpy(&val, &value, 4);
        __builtin_memcpy(4 + (char *)&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t        val;

        __builtin_memcpy(&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    }
}

and float_copy() as

void float_copy(float *target, const float *source, size_t count)
{
    if (!((uintptr_t)source & 7) &&
        !((uintptr_t)target & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
        uint64_t       *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);

        while (ptr < end)
            *(ptr++) = *(src++);

    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t       *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);

        while (ptr < end)
            *(ptr++) = *(src++);
    }
}

or something close to that.

The hardest to vectorize is complex_float_product(). If you fill in the unused elements in the final vector with 1.0f for the real part and 0.0f for the imaginary part, you can easily compute the complex product for each vector. Remember that

      (a + b i) × (c + d i) = (a c - b d) + (a d + b c) i

The hard part here is to efficiently calculate the complex product for the elements in the vector. Fortunately, that part is not at all critical for overall performance (except for very short vectors, or matrices with very few columns), so it should not matter much in practice.

(In short, the "hard" part is to find the way to reorder the elements to maximally use the packed vector multiplication, and not need so many shuffles/moves to slow it down.)

For the float_add_scaled() function, you should create a vector filled with the scale factor; something like the following,

void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
    const vec4f  coeff = { scale, scale, scale, scale };
    vec4f       *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
    vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
    const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);

    while (ptr < end)
        *(ptr++) += *(src++) * coeff;
}

if we ignore the alignment and size checks, and the fallback implementation.

Rimple answered 24/1, 2017 at 18:30 Comment(0)
M
5

I think I might have figured it out. After a lot of trial/error, it became clear that gcc built in vectorization optimizations are sort of hard coded and it doesn't 'understand' complex numbers properly. I made some changes in the code and got your inner performance sensitive loop to vectorize, confirmed by gcc output (though I am not sure the desired outcome is computationally equivalent to what you want). While my understanding is limited to what you want the code to do, the finding is that it'll work fine if you compute real and imag separately. Have a look:

    float t_r = 0.0, t_im = 0.0; // two new temporaries  
    while (j < n-1) {
        prod = 1.;
        for (i = 0; i < n; i++) {
// fill the temps after subtraction from V to avoid stmt error
            t_r = creal (v[i]) - (2. * creal(delta[j]) * creal (M[j][i]));
            t_im = cimag(v[i]) - (2. * cimag(delta[j]) * cimag (M[j][i])) * I;
            //v[i] = 2.*delta[j]*M[j][i];
            v[i] = t_r + t_im; // sum of real and img
            prod *= v[i];
        }
        delta[j] = -delta[j];
        s = -s;            
        p += s*prod;
        f[0] = 0;
        f[j] = f[j+1];
        f[j+1] = j+1;
        j = f[0];
    }
Menard answered 15/1, 2017 at 20:52 Comment(12)
This doesn't help as far as I can tell. I think the problem is in the prod *= v[i]; part which gcc doesn't know how to vectorize.Schnell
I forgot to write in my answer. the 'prod' part vectorizes just fine if you split real and imaginary separate. I tested it after I wrote the comment (I can confirm seeing two LOOP VECTORIZED statements one for v[i] and another for prod). I think you'd be better off using a struct to hold real and imag parts yourself for a better control over multiplication.Menard
I think it might just be vectorizing the part before prod *= v[i];. I tried splitting it into two loops and for the second loop for (i = 0; i < n; i++) { prod *= v[i]; } it says it has not vectorized it. It seems you have succeeded in persuading gcc to half vectorize the loop.Schnell
For reproducibility, this is the version of the code with the loop split into two bpaste.net/show/84ea4b8b518e. This is the output of gcc bpaste.net/show/f40f95567536 . The loop with prod *= v[i] starts at line 55. The final message is "test.c:55:2: note: not vectorized: unsupported use in stmt.".Schnell
Hmm I may have fixed it. If you convert whole inner loop to a function, you can vectorize both v[i] and prod calculation. Check out: bpaste.net/show/5ed62a3275eb We are returning the prod from the function since we are only concerned with end value of the prod, which in turn keeps gcc happy and vectorizes the calculation.Menard
Unfortunately your bpaste is broken now (you can select the length of time you want a link to last and the default is 1 day).Schnell
I think this has just moved the inevitable problem to a different part of the code. See e.g. godbolt.org/g/WdMvIh and search for mulss. I think the only way to do this with gcc is to use its gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html sadly.Schnell
@eleanora The loop looks vectorized. But since the compiler doesn't know the trip-count, it needs to generate scalar code to handle the edge cases. That's a normal pattern for compiler-vectorization.Antiscorbutic
On the other hand, there some unnecessary intermediate promotion to double since the literal 2. is a double.Antiscorbutic
Oh I think you are right. We should check it actually computes the same thing. Do you want to add your code as an answer? The double promotion should be fixed by using 2 instead of 2., right?Schnell
@eleanora Just to clarify, even though GCC did manage to vectorize it, it's extremely poorly vectorized. To repeat an earlier comment, if you want it vectorized well, you cannot use the complex type. You need split the real and imaginary parts into separate arrays. There are other approaches if you're willing to manually vectorize, but either way you need to change your data layout.Antiscorbutic
@Antiscorbutic Would you be able to show how to vectorize the matrix M?Schnell
C
-1

Optimizer logs indicate clearly

Unknown alignment for access:...

when trying to vectorize

printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d])); //24
v[i] += M[j][i]; //38
p *= v[i]; //40
v[i] -= 2.*delta[j]*M[j][i]; //48

It seems really linked you need to force alignment of your arrays M, delta and v in memory.

Auto-vectorization in GCC

Handling of aligned memory accesses only (do not attempt to vectorize loops that contain unaligned accesses)

As mentionned in previous comments, I would suggest you use posix_memalign for that purpose.

complex float * restrict delta;
posix_memalign(&delta, 64, n * sizeof *delta); //to adapt

What is your target environment? (OS, CPU)

Please have a look at data-alignment-to-assist-vectorization

Carton answered 15/1, 2017 at 18:43 Comment(6)
That certainly made some difference but it still fails to vectorize sadly. See bpaste.net/show/561f6b8d2e21 . Lines 47 and 48 are still the relevant ones.Schnell
The target OS is ubuntu and the CPU is the amd fx 8350. This is recognized by gcc as -march=bdver2 .Schnell
If you separate the loop at lines 47--50 into two parts it seems the problem is in fact entirely in the prod *= v[i]; part. gcc doesn't seem to know how to vectorize a function which takes the product of an array of complex numbers.Schnell
prod *= v[i]; breaks the vectorization because prod is not a vector.Carton
OK but the Intel C compiler seems to be able to vectorize the product of an array. See e.g. godbolt.org/g/q3neAUSchnell
Indeed, Intel C compiler looks like to be able to vectorize and unroll the multiplication sequence. I suppose Intel C compiler can manage more complex SLP patterns instead of gcc.Carton

© 2022 - 2024 — McMap. All rights reserved.