Let's examine the code in detail, first. We have
complex float M[rows][cols];
complex float v[cols];
float delta[rows];
complex float p = 1.0f;
float s = 1.0f;
While delta[]
is of complex float
type in OP's code, it only ever contains -1.0f
or +1.0f
. (Furthermore, the calculations could be simplified if it were -2.0f
or +2.0f
instead.) For that reason, I defined is as real and not complex.
Similarly, OP defines s
as int
, but uses it effectively as -1.0f
and +1.0f
only (in the calculations). That's why I defined it explicitly as float
.
I omit the f
array, because there is a trivial way of avoiding it entirely.
The first loop of the interesting part of the code,
for (i = 0; i < n; i++) {
v[i] = 0;
for (j = 0; j <n; j++) {
v[i] += M[j][i];
}
p *= v[i];
delta[i] = 1;
}
performs several things. It initializes all elements in the delta[]
array to 1; it could (and probably should) be split into a separate loop.
Since the outer loop is increasing in i
, p
will be the product of the elements in v
; it could be split off into a separate loop, too.
Because the inner loop sums all the elements in column i
to v[i]
, the outer and inner loops simply sums each row, as a vector, to vector v
.
We can thus rewrite the above, in pseudocode, as
Copy first row of matrix M to vector v
For r = 1 .. rows-1:
Add complex values in row r of matrix M to vector v
p = product of complex elements in vector v
delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f
Let's next look at the second nested loop:
j = 0;
while (j < n-1) {
prod = 1.;
for (i = 0; i < n; i++) {
v[i] -= 2.*delta[j]*M[j][i];
prod *= v[i];
}
delta[j] = -delta[j];
s = -s;
p += s*prod;
f[0] = 0;
f[j] = f[j+1];
f[j+1] = j+1;
j = f[0];
}
It is hard to see unless you examine the values of j
as the loop progresses, but the last 4 lines in the body of the outer loop implement the OEIS A007814 integer sequence in j
(0,1,0,2,0,1,0,3,0,1,0,2,0,1,0,4,...). The number of iterations in this loop is 2rows-1-1. This part of the sequence is symmetric, and implements a binary tree of height rows-1:
4
3 3
2 2 2 2 (Read horizontally)
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
It turns out that if we loop over i
= 1 .. 2rows-1, then r
is the number of zero low bits in i
. GCC provides a __builtin_ctz()
built-in function, that computes exactly this. (Note that __builtin_ctz(0)
yields an undefined value; so don't do that, even if it happens to produce a specific value on your computer.)
The inner loop subtracts the complex values on row j
of the matrix, scaled by 2*delta[j]
, from vector v[]
. It also calculates the product of the complex entries in vector v[]
(after the subtraction) into variable prod
.
After the inner loop, delta[j]
is negated, as is scale factor s
. The value of variable prod
, scaled by s
, is added to p
.
After the loop, the final (complex) result is p
divided by 2rows-1. This is better done using the ldexp()
C99 function (separately on the real and complex parts).
We can therefore rewrite the second nested loop, in pseudocode, as
s = 1.0f
For k = 1 .. rows-1, inclusive:
r = __builtin_ctz(k), i.e. number of least
significant bits that
are zero in k
Subtract the complex values on row r of matrix M,
scaled by delta[r], from vector v[]
prod = the product of (complex) elements in vector v[]
Negate scale factor s (changing its sign)
Add prod times s to result p
In my experience, it is best to split the real and imaginary parts into separate vectors and matrices. Consider the following definitions:
typedef struct {
float *real;
float *imag;
size_t floats_per_row; /* Often 'stride' */
size_t rows;
size_t cols;
} complex_float_matrix;
/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);
/* Copy an array of floats */
void float_copy(float *, const float *, size_t);
/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);
/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);
/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);
/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);
All of the above are easily vectorized, as long as float_malloc()
yields sufficiently aligned pointers (and the compiler is told that, via e.g. GCC function attribute __attribute__ ((__assume_aligned__ (BYTES_IN_VECTOR)));
), and the floats_per_row
member in the matrix is a multiple of the number of floats in a vector.
(I do not know whether GCC can automatically vectorize the above functions, but I do know that they can be vectorized "by hand" using the GCC vector extensions.)
With the above, the entire interesting part of the code, in pseudo-C, becomes
complex float permanent(const complex_float_matrix *m)
{
float *v_real, *v_imag;
float *scale; /* OP used 'delta' */
complex float result; /* OP used 'p' */
complex float product; /* OP used 'prod' */
float coeff = 1.0f; /* OP used 's' */
size_t i = 1 << (m->rows - 1);
size_t r;
if (!m || m->cols < 1 || m->rows < 1 || !i) {
/* TODO: No input matrix, or too many rows in input matrix */
}
v_real = float_malloc(m->cols);
v_imag = float_malloc(m->cols);
scale = float_malloc(m->rows - 1);
if (!v_real || !v_imag || !scale) {
free(scale);
free(v_imag);
free(v_real);
/* TODO: Out of memory error */
}
float_set(scale, 2.0f, m->rows - 1);
/* Sum matrix rows to v. */
float_copy(v_real, m->real, m->cols);
for (r = 1; r < m->rows; r++)
float_add(v_real, m->real + r * m->floats_per_row, m->cols);
float_copy(v_imag, m->imag, m->cols);
for (r = 1; r < m->rows; r++)
float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);
result = complex_float_product(v_real, v_imag, m->cols);
while (--i) {
r = __builtin_ctz(i);
scale[r] = -scale[r];
float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);
product = complex_float_product(v_real, v_imag, m->cols);
coeff = -coeff;
result += coeff * product;
}
free(scale);
free(v_imag);
free(v_real);
return result;
}
At this point, I would personally implement the above without vectorization, then test it extensively, until I was sure it works correctly.
Then, I'd examine the GCC assembly output (-S
) to see if it can sufficiently vectorize the individual operations (the functions I listed earlier).
Hand-vectorizing the functions using GCC vector extensions is quite straightforward. Declaring a float vector is trivial:
typedef float vec2f __attribute__((vector_size (8), aligned (8))); /* 64 bits; MMX, 3DNow! */
typedef float vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef float vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef float vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */
The individual components in each vector can be addressed using the array notation (v[0]
and v[1]
for vec2f v;
). GCC can do basic operations on entire vectors element-wise; we only really need addition and multiplication here. Horizontal operations (operations that apply between elements in the same vector) should be avoided, and instead elements reordered.
GCC will generate working code for the above vector sizes even on architectures without such vectorization, but the resulting code may be slow. (GCC versions up to 5.4 at least will generate a lot of unnecessary moves, typically to stack and back.)
The memory allocated for a vector needs to be sufficiently aligned. malloc()
does not provide sufficiently aligned memory in all cases; you ought to use posix_memalign()
instead. The aligned
attribute can be used to increase the alignment GCC uses for the vector type, when allocating one locally or statically. In a matrix, you need to ensure that rows start at a sufficiently aligned boundary; that's the reason I have the floats_per_row
variable in the structure.
In cases where the number of elements in a vector (or row) is large, but not a multiple of the number of floats in a vector, you should pad the vector with "inert" values -- values that do not affect the result, like 0.0f
for addition and subtraction, and 1.0f
for multiplication.
At least on x86 and x86-64, GCC will generate better code for loops using pointers only. For example, this
void float_set(float *array, float value, size_t count)
{
float *const limit = array + count;
while (array < limit)
*(array++) = value;
}
yields better code than
void float_set(float *array, float value, size_t count)
{
size_t i;
for (i = 0; i < count; i++)
array[i] = value;
}
or
void float_set(float *array, float value, size_t count)
{
while (count--)
*(array++) = value;
}
(which tend to produce similar code). Personally, I'd implement it as
void float_set(float *array, float value, size_t count)
{
if (!((uintptr_t)array & 7) && !(count & 1)) {
uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
uint64_t *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
uint64_t val;
__builtin_memcpy(&val, &value, 4);
__builtin_memcpy(4 + (char *)&val, &value, 4);
while (ptr < end)
*(ptr++) = val;
} else {
uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
uint32_t *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
uint32_t val;
__builtin_memcpy(&val, &value, 4);
while (ptr < end)
*(ptr++) = val;
}
}
and float_copy()
as
void float_copy(float *target, const float *source, size_t count)
{
if (!((uintptr_t)source & 7) &&
!((uintptr_t)target & 7) && !(count & 1)) {
uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
uint64_t *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
uint64_t *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);
while (ptr < end)
*(ptr++) = *(src++);
} else {
uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
uint32_t *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
uint32_t *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);
while (ptr < end)
*(ptr++) = *(src++);
}
}
or something close to that.
The hardest to vectorize is complex_float_product()
. If you fill in the unused elements in the final vector with 1.0f
for the real part and 0.0f
for the imaginary part, you can easily compute the complex product for each vector. Remember that
(a + b i) × (c + d i) = (a c - b d) + (a d + b c) i
The hard part here is to efficiently calculate the complex product for the elements in the vector. Fortunately, that part is not at all critical for overall performance (except for very short vectors, or matrices with very few columns), so it should not matter much in practice.
(In short, the "hard" part is to find the way to reorder the elements to maximally use the packed vector multiplication, and not need so many shuffles/moves to slow it down.)
For the float_add_scaled()
function, you should create a vector filled with the scale factor; something like the following,
void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
const vec4f coeff = { scale, scale, scale, scale };
vec4f *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);
while (ptr < end)
*(ptr++) += *(src++) * coeff;
}
if we ignore the alignment and size checks, and the fallback implementation.
malloc
makes no such guarantees. What happens if you switch toposix_memalign
? – Adaurdcomplex float *v; if (posix_memalign((void **) &v, 16, n * sizeof *v)) { /* error */ }
– Adaurdmalloc
will always return 16 byte aligned pointers (on 64-bit systems), so this is probably a red herring. – Adaurd__builtin_assume_aligned
, also if you want more alignment than default, tryaligned_alloc()
which is available in C11 – Marileemarilindelta = __builtin_assume_aligned(malloc(n * sizeof *delta), ALIGNMENT );
what should ALIGNMENT be and how to I ensure it's actually true? Do I need to do both aligned_alloc and __builtin_assume_aligned ? – Schnellcomplex
type since it's an AOS packing. You need to reorder the data to SOA. – Antiscorbuticj
in the latter loop seem to be OEIS A007814 (I'm partial to binary trees!). Is there a reason you do not just initializedelta[]
to all2.0f
, and avoid the extra multiplication per matrix entry? As I mentioned, using GCC vector extensions the code becomes a lot more obtuse, but it does vectorize quite efficiently, and the code is not specific to any arch extension, only vector size. Do you need suggestions with regards to that? – Rimple