Shouldn't LAPACKs dsyevr function (for eigenvalues and eigenvectors) be thread-safe?

Asked 13/8, 2013 at 18:17 Answered 16/2, 2014 at 2:17

Solved c thread-safety fortran lapack eigenvector

While trying to compute eigenvalues and eigenvectors of several matrices in parallel, I found that LAPACKs dsyevr function does not seem to be thread safe.

Is this known to anyone?
Is there something wrong with my code? (see minimal example below)
Any suggestions of an eigensolver implementation for dense matrices that is not too slow and is definitely thread safe is welcome.

Here is a minimal code example in C which demonstrates the problem:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <assert.h>
#include <omp.h>
#include "lapacke.h"

#define M 8 /* number of matrices to be diagonalized */
#define N 1000 /* size of each matrix (real, symmetric) */

typedef double vec_t[N]; /* type for length N vector */
typedef double mtx_t[N][N]; /* type for N x N matrices */

void 
init(int m, int n, mtx_t *A){
    /* init m symmetric n x x matrices */
    srand(0);
    for (int i = 0; i < m; ++i){
        for (int j = 0; j < n; ++j){
            for (int k = 0; k <= j; ++k){
                A[i][j][k] = A[i][k][j] = (rand()%100-50) / (double)100.;
            }
        }
    }
}

void 
solve(int n, double *A, double *E, double *Q){
    /* diagonalize one matrix */
    double tol = 0.;
    int *isuppz = malloc(2*n*sizeof(int)); assert(isuppz);
    int k;
    int info = LAPACKE_dsyevr(LAPACK_COL_MAJOR, 'V', 'A', 'L', 
                              n, A, n, 0., 0., 0, 0, tol, &k, E, Q, n, isuppz);
    assert(!info);
    free(isuppz);
}

void 
s_solve(int m, int n, mtx_t *A, vec_t *E, mtx_t *Q){
    /* serial solve */
    for (int i = 0; i < m; ++i){
        solve(n, (double *)A[i], (double *)E[i], (double *)Q[i]);
    }
}

void 
p_solve(int m, int n, mtx_t *A, vec_t *E, mtx_t *Q, int nt){
    /* parallel solve */
    int i;
    #pragma omp parallel for schedule(static) num_threads(nt) \
        private(i) \
        shared(m, n, A, E, Q)
    for (i = 0; i < m; ++i){
        solve(n, (double *)A[i], (double *)E[i], (double *)Q[i]);
    }
}

void 
analyze_results(int m, int n, vec_t *E0, vec_t *E1, mtx_t *Q0, mtx_t *Q1){
    /* compare eigenvalues */
    printf("\nmax. abs. diff. of eigenvalues:\n");
    for (int i = 0; i < m; ++i){
        double t, dE = 0.;
        for (int j = 0; j < n; ++j){
            t = fabs(E0[i][j] - E1[i][j]);
            if (t > dE) dE = t;
        }
        printf("%i: %5.1e\n", i, dE);
    }

    /* compare eigenvectors (ignoring sign) */
    printf("\nmax. abs. diff. of eigenvectors (ignoring sign):\n");
    for (int i = 0; i < m; ++i){
        double t, dQ = 0.;
        for (int j = 0; j < n; ++j){
            for (int k = 0; k < n; ++k){
                t = fabs(fabs(Q0[i][j][k]) - fabs(Q1[i][j][k]));
                if (t > dQ) dQ = t;
            }
        }
        printf("%i: %5.1e\n", i, dQ);
    }
}


int main(void){
    mtx_t *A = malloc(M*N*N*sizeof(double)); assert(A);
    init(M, N, A);

    /* allocate space for matrices, eigenvalues and eigenvectors */
    mtx_t *s_A = malloc(M*N*N*sizeof(double)); assert(s_A);
    vec_t *s_E = malloc(M*N*sizeof(double));   assert(s_E);
    mtx_t *s_Q = malloc(M*N*N*sizeof(double)); assert(s_Q);

    /* copy initial matrix */
    memcpy(s_A, A, M*N*N*sizeof(double));

    /* solve serial */
    s_solve(M, N, s_A, s_E, s_Q);

    /* allocate space for matrices, eigenvalues and eigenvectors */
    mtx_t *p_A = malloc(M*N*N*sizeof(double)); assert(p_A);
    vec_t *p_E = malloc(M*N*sizeof(double));   assert(p_E);
    mtx_t *p_Q = malloc(M*N*N*sizeof(double)); assert(p_Q);

    /* copy initial matrix */
    memcpy(p_A, A, M*N*N*sizeof(double));

    /* use one thread, to check that the algorithm is deterministic */
    p_solve(M, N, p_A, p_E, p_Q, 1); 

    analyze_results(M, N, s_E, p_E, s_Q, p_Q);

    /* copy initial matrix */
    memcpy(p_A, A, M*N*N*sizeof(double));

    /* use several threads, and see what happens */
    p_solve(M, N, p_A, p_E, p_Q, 4); 

    analyze_results(M, N, s_E, p_E, s_Q, p_Q);

    free(A);
    free(s_A);
    free(s_E);
    free(s_Q);
    free(p_A);
    free(p_E);
    free(p_Q);
    return 0;
}

and this is what you get (see difference in last output block, which tells you, that the eigenvectors are wrong, although eigenvalues are ok):

max. abs. diff. of eigenvalues:
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00

max. abs. diff. of eigenvectors (ignoring sign):
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00

max. abs. diff. of eigenvalues:
0: 0.0e+00
1: 0.0e+00
2: 0.0e+00
3: 0.0e+00
4: 0.0e+00
5: 0.0e+00
6: 0.0e+00
7: 0.0e+00

max. abs. diff. of eigenvectors (ignoring sign):
0: 0.0e+00
1: 1.2e-01
2: 1.6e-01
3: 1.4e-01
4: 2.3e-01
5: 1.8e-01
6: 2.6e-01
7: 2.6e-01

The code was compiled with gcc 4.4.5 and linked against openblas (containing LAPACK) (libopenblas_sandybridge-r0.2.8.so) but was also tested with another LAPACK version. Calling LAPACK directly from C (without LAPACKE) was also tested, same results. Substituting dsyevr by the dsyevd function (and adjusting arguments) did also have no effect.

Finally, here is the compilation instruction I used:

gcc -std=c99 -fopenmp -L/path/to/openblas/lib -Wl,-R/path/to/openblas/lib/ \
-lopenblas -lgomp -I/path/to/openblas/include main.c -o main

Unfortunately google did not answer my questions, so any hint is welcome!

EDIT: To make sure, that everything is ok with the BLAS and LAPACK versions I took the reference LAPACK (including BLAS and LAPACKE) from http://www.netlib.org/lapack/ (version 3.4.2) Compiling the example code was a bit tricky, but did finally work with separate compiling and linking:

gcc -c -std=c99 -fopenmp -I../lapack-3.4.2/lapacke/include \
    netlib_dsyevr.c -o netlib_main.o
gfortran netlib_main.o ../lapack-3.4.2/liblapacke.a \
    ../lapack-3.4.2/liblapack.a ../lapack-3.4.2/librefblas.a \
    -lgomp -o netlib_main

The build of the netlib LAPACK/BLAS and the example program was done on a Darwin 12.4.0 x86_64 and a Linux 3.2.0-0.bpo.4-amd64 x86_64 platform. Consistent misbehavior of the program can be observed.

Jhelum answered 13/8, 2013 at 18:17 Comment(15)

The first thing to check is to make sure you are really passing in disjoint data sets to the parallel calls. Second, LAPACK is implemented in FORTRAN, so you should somehow get that language tagged into the question. – Externality 13/8, 2013 at 19:3

Good advice. I modified the example code to be more readable and safer w.r.t pointer arithmetic. To this end, typedefs for the vector and matrix were included. I also checked, that the memory chunks that are passed to LAPACK are disjoined by some additional assertions (not included in the edited example code to keep it as short as possible). – Jhelum 13/8, 2013 at 22:37

netlib.org/lapack/lapack-3.3.0.html states that all routines in LAPACK 3.3 are thread safe. – Deprived 13/8, 2013 at 23:45

Which is a nice statement that I also read, though, does not help much as long as the proposed test program does not run properly. – Jhelum 14/8, 2013 at 0:4

Hmmm, the docs say that LAPACK routines are thread-safe, you've determined that your program behaves as if at least one of them isn't. Possibly something else isn't thread safe. I believe that some implementations of malloc, for example, are not. Is yours ? – Pleadings 14/8, 2013 at 9:42

Never thought about that. Since I am using malloc for quite some time in multi-threaded programs I assume it is thread-safe. This is, of course, not a proof. To make sure that this is not the issue, I inlined the solve function in the parallel for loop and did the malloc and free calls in critical sections. Unfortunately, the (mis)behavior of the program remains unchanged. (I also rechecked the version where LAPACK is directly called to make sure that LAPACKE does not cause the trouble due to internal malloc calls when allocating the workspace.) – Jhelum 14/8, 2013 at 10:17

I tried your code with Intel MKL and icpc and on the second run, I get only numerical errors (<1e-12). But I had to add the types to the malloc statements and change the header file to mkl_lapacke.h... – Corny 15/8, 2013 at 13:18

What do you mean with "had to add the types to the malloc statements"? Unfortunately, I don't have access to the MKL library right now, so can't test my self. However, I am wondering about the numerical errors, is this also the case for the comparison of the first two serial calls (i.e. the first two output blocks)? If these are zero exactly, also the parallel run should give exactly the same results because all problems are independent. The only reason for numerical noise could be caused by multithreaded BLAS calls from within LAPACK, which does not make sense here. Correct me if I am wrong – Jhelum 15/8, 2013 at 15:35

Although the symptoms seem a little different, I wonder if your problem is at all related to this one: jointhespirit.eads.com/forum/showPostsForThread/1?threadId=26 – Ahithophel 16/8, 2013 at 7:9

Thanks for the hint. It is true that OpenBLAS exhibits sometimes strange behavior in multi-threaded applications, experienced it my self in some other "experiment" myself yesterday. That is why I recompiled and ran everything with the reference BLAS/LAPACK from netlib (see edit of the question). – Jhelum 16/8, 2013 at 9:50

With "had to add the types..." I meant, that icpc complained about assigning a void pointer to a pointer of another type. So I had to add e.g. (int*) to the malloc in solve. The numerical errors occur only for the parallel part. The sequential part gives exactly zero for all results. – Corny 19/8, 2013 at 7:44

@Corny The cast of void pointers is not required in pure c, so it's due to the use of the c++ compiler (icpc)... Anyway, I take the presence of numerical errors in the parallel loop as confirmation of the bug, since, as explained above, there shouldn't be any difference from the serial loop. Thank you for testing the code. – Jhelum 19/8, 2013 at 9:19

@Jhelum That is interesting. icc (Intel's C compiler) complains, that you are not allowed to define a variable in a loop (for (int i = 0...). That's the reason, why I assumed C++. – Corny 19/8, 2013 at 9:30

@Corny it's c99 standard (see compilation statement: gcc -std=c99 ...) but compiling it with a c++ compiler is also fine, except that you have to do the explicit void * casts. – Jhelum 19/8, 2013 at 9:35

@Jhelum Didn't know, that there are such differences. I'm coming from Fortran where I never select the standard to use (though, it is possible to restrict yourself to a standard, afaik). – Corny 19/8, 2013 at 9:42

I finally received the explanation from the LAPACK team, which I would like to quote (with permission):

I think the problem you are seeing may be caused by how the FORTRAN version of the LAPACK library you are using was compiled. Using gfortran 4.8.0 (on Mac OS 10.8), I could reproduce the problem you saw if I compile LAPACK with the -O3 option for gfortran. If I recompile the LAPACK and reference BLAS library with -fopenmp -O3, the problem goes away. There is a note in the gfortran manual stating "-fopenmp implies -frecursive, i.e., all local arrays will be allocated on the stack," so there may be local arrays used in some auxiliary routines called by dsyevr for which the default setting of the compiler is causing them to be allocated in a non-thread safe manner. In any case, allocating these on the stack, which -fopenmp seems to do, will address this issue.

I can confirm that this solves the problem for netlib-BLAS/LAPACK. One should keep in mind that the stack size is limited and has possibly to be adjusted if matrices get big and/or many.

OpenBLAS must be compiled with USE_OPENMP=1 and USE_THREAD=1 to get a single threaded and thread-safe library.

With these compiler and make flags, the sample program runs correctly with all libraries. It remains an open question, how one makes sure that the user to whom one hands ones code in the end is linking to a correctly compiled BLAS/LAPACK library? If the user would just get a segmentation fault one could add a note in a README file, but since the error is more subtle, it is not even guaranteed that the error is recognized by the user (users don't read the README file by default ;-) ). Shipping a BLAS/LAPACK with ones code is not a good idea, since it is the basic idea of BLAS/LAPACK that everyone has a specifically optimized version for his computer. Ideas are welcome...

Jhelum answered 19/8, 2013 at 23:4 Comment(1)

This is fixed in LAPACK 3.5.0: netlib.org/lapack/… – Erskine 11/2, 2016 at 15:11

Re another library: GSL. It's threadsafe. But that means that you have to create workspaces for each thread and be sure that each thread uses it workspace, e.g., index pointers by thread number.

Deprived answered 13/8, 2013 at 23:11 Comment(1)

That did also come to my mind and I just tested it by replacing the solve function with an equivalent function calling GSL. This gives consistent results for the serial and parallel versions and is also equal to the solution of the serial LAPACK call. Unfortunately though, GSL is something between one and two orders of magnitude slower. At least it is a workaround, though not really satisfying. – Jhelum 13/8, 2013 at 23:59

[the following answer was added before the correct solution was known]

Disclaimer: The following is a workaround, neither does it solve the original problem, nor does it explain what goes wrong with LAPACK. It may, however, help people facing the same problem.

The old f2c'ed version of LAPACK, called CLAPACK, does not seem to have the thread-safety problem. Note that this is not a C interface to the fortran library but a version of LAPACK that has been automatically translated to C.

Compiling and linking it with the last version of CLAPACK (3.2.1) worked. So CLAPACK does seem to be thread safe in this respect. Of course, the performance of CLAPACK is not that of netlib-BLAS/LAPACK or even that of OpenBLAS/LAPACK but at least it is not as bad as that of GSL.

Here are some timings for all tested LAPACK variants (and GSL) for the diagonalization of one 1000 x 1000 matrix (on one thread of course) initialized with the init function (see question for definition).

time -p ./gsl
real 17.45
user 17.42
sys 0.01

time -p ./netlib_dsyevr
real 0.67
user 0.84
sys 0.02

time -p ./openblas_dsyevr
real 0.66
user 0.46
sys 0.01

time -p ./clapack_dsyevr
real 1.47
user 1.46
sys 0.00

This indicates that GSL is definitely no good workaround for big matrices with dimension of a few thousands, especially if you have many of them.

Jhelum answered 18/8, 2013 at 11:2 Comment(0)

It seems you prompted the LAPACK developers to introduce a "fix". Indeed, they added -frecursive to the compiler flags in make.inc.example. From testing your example code the fix seems irrelevant (the numerical errors do not go away) and not desirable (a possible performance hit).

Even if the fix was correct, -frecursive is implied by -fopenmp, so people using consistent flags are on the safe side (those using prepackaged software are not).

To conclude, please fix your code rather than confuse people.

Shote answered 16/2, 2014 at 2:17 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags