While trying to compute eigenvalues and eigenvectors of several matrices in parallel, I found that LAPACKs dsyevr function does not seem to be thread safe.
- Is this known to anyone?
- Is there something wrong with my code? (see minimal example below)
- Any suggestions of an eigensolver implementation for dense matrices that is not too slow and is definitely thread safe is welcome.
Here is a minimal code example in C which demonstrates the problem:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>
#include <omp.h>
#include "lapacke.h"
#define M 8 /* number of matrices to be diagonalized */
#define N 1000 /* size of each matrix (real, symmetric) */
typedef double vec_t[N]; /* type for length N vector */
typedef double mtx_t[N][N]; /* type for N x N matrices */
void
init(int m, int n, mtx_t *A){
/* init m symmetric n x x matrices */
srand(0);
for (int i = 0; i < m; ++i){
for (int j = 0; j < n; ++j){
for (int k = 0; k <= j; ++k){
A[i][j][k] = A[i][k][j] = (rand()%100-50) / (double)100.;
}
}
}
}
void
solve(int n, double *A, double *E, double *Q){
/* diagonalize one matrix */
double tol = 0.;
int *isuppz = malloc(2*n*sizeof(int)); assert(isuppz);
int k;
int info = LAPACKE_dsyevr(LAPACK_COL_MAJOR, 'V', 'A', 'L',
n, A, n, 0., 0., 0, 0, tol, &k, E, Q, n, isuppz);
assert(!info);
free(isuppz);
}
void
s_solve(int m, int n, mtx_t *A, vec_t *E, mtx_t *Q){
/* serial solve */
for (int i = 0; i < m; ++i){
solve(n, (double *)A[i], (double *)E[i], (double *)Q[i]);
}
}
void
p_solve(int m, int n, mtx_t *A, vec_t *E, mtx_t *Q, int nt){
/* parallel solve */
int i;
#pragma omp parallel for schedule(static) num_threads(nt) \
private(i) \
shared(m, n, A, E, Q)
for (i = 0; i < m; ++i){
solve(n, (double *)A[i], (double *)E[i], (double *)Q[i]);
}
}
void
analyze_results(int m, int n, vec_t *E0, vec_t *E1, mtx_t *Q0, mtx_t *Q1){
/* compare eigenvalues */
printf("\nmax. abs. diff. of eigenvalues:\n");
for (int i = 0; i < m; ++i){
double t, dE = 0.;
for (int j = 0; j < n; ++j){
t = fabs(E0[i][j] - E1[i][j]);
if (t > dE) dE = t;
}
printf("%i: %5.1e\n", i, dE);
}
/* compare eigenvectors (ignoring sign) */
printf("\nmax. abs. diff. of eigenvectors (ignoring sign):\n");
for (int i = 0; i < m; ++i){
double t, dQ = 0.;
for (int j = 0; j < n; ++j){
for (int k = 0; k < n; ++k){
t = fabs(fabs(Q0[i][j][k]) - fabs(Q1[i][j][k]));
if (t > dQ) dQ = t;
}
}
printf("%i: %5.1e\n", i, dQ);
}
}
int main(void){
mtx_t *A = malloc(M*N*N*sizeof(double)); assert(A);
init(M, N, A);
/* allocate space for matrices, eigenvalues and eigenvectors */
mtx_t *s_A = malloc(M*N*N*sizeof(double)); assert(s_A);
vec_t *s_E = malloc(M*N*sizeof(double)); assert(s_E);
mtx_t *s_Q = malloc(M*N*N*sizeof(double)); assert(s_Q);
/* copy initial matrix */
memcpy(s_A, A, M*N*N*sizeof(double));
/* solve serial */
s_solve(M, N, s_A, s_E, s_Q);
/* allocate space for matrices, eigenvalues and eigenvectors */
mtx_t *p_A = malloc(M*N*N*sizeof(double)); assert(p_A);
vec_t *p_E = malloc(M*N*sizeof(double)); assert(p_E);
mtx_t *p_Q = malloc(M*N*N*sizeof(double)); assert(p_Q);
/* copy initial matrix */
memcpy(p_A, A, M*N*N*sizeof(double));
/* use one thread, to check that the algorithm is deterministic */
p_solve(M, N, p_A, p_E, p_Q, 1);
analyze_results(M, N, s_E, p_E, s_Q, p_Q);
/* copy initial matrix */
memcpy(p_A, A, M*N*N*sizeof(double));
/* use several threads, and see what happens */
p_solve(M, N, p_A, p_E, p_Q, 4);
analyze_results(M, N, s_E, p_E, s_Q, p_Q);
free(A);
free(s_A);
free(s_E);
free(s_Q);
free(p_A);
free(p_E);
free(p_Q);
return 0;
}
and this is what you get (see difference in last output block, which tells you, that the eigenvectors are wrong, although eigenvalues are ok):
max. abs. diff. of eigenvalues:
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00
max. abs. diff. of eigenvectors (ignoring sign):
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00
max. abs. diff. of eigenvalues:
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00
max. abs. diff. of eigenvectors (ignoring sign):
0: 0.0e+00
1: 1.2e-01
2: 1.6e-01
3: 1.4e-01
4: 2.3e-01
5: 1.8e-01
6: 2.6e-01
7: 2.6e-01
The code was compiled with gcc 4.4.5 and linked against openblas (containing LAPACK) (libopenblas_sandybridge-r0.2.8.so) but was also tested with another LAPACK version. Calling LAPACK directly from C (without LAPACKE) was also tested, same results. Substituting dsyevr
by the dsyevd
function (and adjusting arguments) did also have no effect.
Finally, here is the compilation instruction I used:
gcc -std=c99 -fopenmp -L/path/to/openblas/lib -Wl,-R/path/to/openblas/lib/ \
-lopenblas -lgomp -I/path/to/openblas/include main.c -o main
Unfortunately google did not answer my questions, so any hint is welcome!
EDIT: To make sure, that everything is ok with the BLAS and LAPACK versions I took the reference LAPACK (including BLAS and LAPACKE) from http://www.netlib.org/lapack/ (version 3.4.2) Compiling the example code was a bit tricky, but did finally work with separate compiling and linking:
gcc -c -std=c99 -fopenmp -I../lapack-3.4.2/lapacke/include \
netlib_dsyevr.c -o netlib_main.o
gfortran netlib_main.o ../lapack-3.4.2/liblapacke.a \
../lapack-3.4.2/liblapack.a ../lapack-3.4.2/librefblas.a \
-lgomp -o netlib_main
The build of the netlib LAPACK/BLAS and the example program was done on a Darwin 12.4.0 x86_64
and a Linux 3.2.0-0.bpo.4-amd64 x86_64
platform. Consistent misbehavior of the program can be observed.
malloc
, for example, are not. Is yours ? – Pleadingsmalloc
for quite some time in multi-threaded programs I assume it is thread-safe. This is, of course, not a proof. To make sure that this is not the issue, I inlined thesolve
function in the parallel for loop and did themalloc
andfree
calls in critical sections. Unfortunately, the (mis)behavior of the program remains unchanged. (I also rechecked the version where LAPACK is directly called to make sure that LAPACKE does not cause the trouble due to internalmalloc
calls when allocating the workspace.) – Jhelummalloc
statements and change the header file tomkl_lapacke.h
... – Cornymalloc
statements"? Unfortunately, I don't have access to the MKL library right now, so can't test my self. However, I am wondering about the numerical errors, is this also the case for the comparison of the first two serial calls (i.e. the first two output blocks)? If these are zero exactly, also the parallel run should give exactly the same results because all problems are independent. The only reason for numerical noise could be caused by multithreaded BLAS calls from within LAPACK, which does not make sense here. Correct me if I am wrong – Jhelumicpc
complained about assigning avoid
pointer to a pointer of another type. So I had to add e.g.(int*)
to themalloc
insolve
. The numerical errors occur only for the parallel part. The sequential part gives exactly zero for all results. – Cornyicpc
)... Anyway, I take the presence of numerical errors in the parallel loop as confirmation of the bug, since, as explained above, there shouldn't be any difference from the serial loop. Thank you for testing the code. – Jhelumicc
(Intel's C compiler) complains, that you are not allowed to define a variable in a loop (for (int i = 0...
). That's the reason, why I assumed C++. – Cornygcc -std=c99 ...
) but compiling it with a c++ compiler is also fine, except that you have to do the explicitvoid *
casts. – Jhelum