Cholesky decomposition with OpenMP

Asked 18/3, 2014 at 12:22 Answered 14/4, 2014 at 15:8

Solved c matrix parallel-processing openmp linear-algebra

I have a project where we solve the inverse of large (over 3000x3000) positive definite dense matrices using Cholesky Decomposition. The project is in Java and we use are using the CERN Colt BLAS library. Profiling the code shows that the Cholesky decomposition is the bottleneck.

I decided to try and parallelize the Cholesky decomposition using OpenMP and use it as a DLL in Java (with JNA). I started with the Cholesky decomposition code in C from Rosetta Code.

What I noticed is that the values in a column except for the diagonal element are independent. So I decided to calculate the diagonal elements in serial and the rest of the values of the column in parallel. I also swapped the order of the loops so that the inner loop runs over the rows and the outer loop over the columns. The serial version is slightly slower than the one from RosettaCode but the parallel version is six times faster than the RosettaCode version on my 4 core (8 HT) system. Using the DLL in Java speeds up our results by six times as well. Here is the code:

double *cholesky(double *A, int n) {
    double *L = (double*)calloc(n * n, sizeof(double));
    if (L == NULL)
        exit(EXIT_FAILURE);

    for (int j = 0; j <n; j++) {            
        double s = 0;
        for (int k = 0; k < j; k++) {
            s += L[j * n + k] * L[j * n + k];
        }
        L[j * n + j] = sqrt(A[j * n + j] - s);
        #pragma omp parallel for
        for (int i = j+1; i <n; i++) {
            double s = 0;
            for (int k = 0; k < j; k++) {
                s += L[i * n + k] * L[j * n + k];
            }
            L[i * n + j] = (1.0 / L[j * n + j] * (A[i * n + j] - s));
        }
    }
    return L;
}

You can find the full code for testing this at http://coliru.stacked-crooked.com/a/6f5750c20d456da9

I initially thought that false sharing would be a problem when the remaining elements of a column were small compared to the number of threads but that does not seem to be the case. I tried

#pragma omp parallel for schedule(static, 8) // a cache line is 8 doubles

I have not found clear examples of how to parallelize Choleskey decomposition. I don't know if what I have done is ideal. For example, will it work well on a NUMA system?

Perhaps a tasked based approach is better in general? In slides 7-9 at http://courses.engr.illinois.edu/cs554/fa2013/notes/07_cholesky.pdf there is an example of parallel cholesky decomposition using "fine grained tasks". It's not clear to me how to implement this yet.

I have two questions, specific and general. Do you have any suggestions on how to improve my implementation of Cholesky Decomposition with OpenMP? Can you suggest a different implementation of Cholesky Decomposition with OpenMP e.g. with tasks?

Edit: as requested here is the AVX function I used to compute s. It did not help

double inner_sum_AVX(double *li, double *lj, int n) {
    __m256d s4;
    int i;
    double s;

    s4 = _mm256_set1_pd(0.0);
    for (i = 0; i < (n & (-4)); i+=4) {
        __m256d li4, lj4;
        li4 = _mm256_loadu_pd(&li[i]);
        lj4 = _mm256_loadu_pd(&lj[i]);
        s4 = _mm256_add_pd(_mm256_mul_pd(li4, lj4), s4);
    }
    double out[4];
    _mm256_storeu_pd(out, s4);
    s = out[0] + out[1] + out[2] + out[3];
    for(;i<n; i++) {
        s += li[i]*lj[i];
    }
    return s;
}

Stentor answered 18/3, 2014 at 12:22 Comment(19)

Your speed up is fine and I don't think just by using OpenMP you could gain some other performance. You could try AVX/SSE for the computation of s. Maybe there's improvement which could be done but that would be on the mathematical way.. – Evacuate 18/3, 2014 at 13:4

@user3018144, I agree that 6x is pretty good already. I guess the main question is if I will get the same speedup on a NUMA system or can the single threaded code be improved (the fact that hyper-threading is helping so much tells me it can). Good point about AVX/SSE on s. I have been thinking of that for a few days but have not tried it yet. it would be better to do it on multiple rows at the same time with SIMD but the diagonal makes it difficult. – Stentor 18/3, 2014 at 13:15

Correct me if I'm wrong, but you seem to be parallelising the inner loop with omp. If you want to have multiple threads calculating in parallel, you don't want to start a lot of short-lived threads, but keep a number of threads similar to the number of CPUs busy continuously. I'd try parallelising the outer loop, that way thread overhead (creating, scheduling, running, killing) is lower. – Revetment 19/3, 2014 at 0:6

@EOF, if only it were that simple...Each column depends on the values of all the columns before it. They have to be computed sequentially. But the values within a column can be done in parallel except for the first element. – Stentor 19/3, 2014 at 9:40

@user3018144, I tried using SSE and AVX for computing s. They did not help. Probably I'm memory bound and not compute bound. It's a O(n^3) algorithm so I think if done right it should be compute bound but I have to change the whole algorithm for this. I can't just make tiles like in GEMM. – Stentor 19/3, 2014 at 9:42

@Zboson Could you post your code with vectorization? – Evacuate 19/3, 2014 at 9:47

@user3018144, see the function cholesky7 at coliru.stacked-crooked.com/a/1d13b7806f333650 – Stentor 19/3, 2014 at 9:51

@user3018144, I just edited my question with the AVX code for computing s. – Stentor 19/3, 2014 at 9:54

@user3018144, I just tried aligning the arrays to 64 bytes so the loads/stores should be aligned. It did not make a difference either. – Stentor 19/3, 2014 at 9:57

There's no difference at all ? The compiler might have vectorized the code itself otherwise it's really weird. – Evacuate 19/3, 2014 at 10:25

You might be memory bound in openMP with AVX – Evacuate 19/3, 2014 at 10:26

@user3018144, there is not significant difference. I'm using MSVC2012. I put #pragma loop(no_vector) before the loops to turn of auto-vectorization. It makes no difference. – Stentor 19/3, 2014 at 11:0

Try to align your vector L and avoid using unaligned vecto – Evacuate 19/3, 2014 at 11:9

@user3018144, I already tried that. See a few comments above. – Stentor 19/3, 2014 at 12:58

yes but you are making a load unaligned, don't you ? – Evacuate 19/3, 2014 at 13:47

@user3018144, on SB load and unaligned loads/stores have the same latency/throughput. The only thing that matters is alignment not the instructions and alignment only matters if the 256-bit word crosses a cache-line. In any case I tried the aligned load instructions. It makes no difference. This is why I avoided implementing SIMD. The easy way is not the right way and the right way I don't know how to do (yet). – Stentor 19/3, 2014 at 15:53

you might check this post: #22511114 – Evacuate 19/3, 2014 at 15:54

@EOF, no OpenMP runtime that exists nowadays kills the worker threads at the end of the parallel region. Rather all threads are kept in a pool and summoned (cheaply) when a new parallel region is entered. MSVC's OpenMP runtime uses the Windows native thread pool implementation, therefore maximum performance with minimum overhead. – Conrado 20/3, 2014 at 17:45

@EOF, I got SSE, AVX, and FMA working with the Cholesky factorization. See my answer. It was not easy. – Stentor 14/4, 2014 at 15:14

I manged to get SIMD working with the Cholesky decomposition. I did this using loop tiling as I have used before in matrix multiplication. The solution was not trivial. Here are the times for a 5790x5790 matrix on my 4 core/ 8 HT Ivy Bridge system (eff = GFLOPS/(peak GFLOPS)):

double floating point peak GFLOPS 118.1
1 thread       time 36.32 s, GFLOPS  1.78, eff  1.5%
8 threads      time  7.99 s, GFLOPS  8.10, eff  6.9%
4 threads+AVX  time  1.36 s, GFLOPS 47.64, eff 40.3%
4 threads MKL  time  0.68 s, GFLOPS 95.14, eff 80.6% // from LAPACKE_dpotrf

single floating point peak GFLOPS 236.2
1 thread       time 33.88 s, GFLOPS  1.91, eff  0.8%
8 threads      time  4.74 s, GFLOPS 13.64, eff  5.8%
4 threads+AVX  time  0.78 s, GFLOPS 82.61, eff 35.0%

The new method is 25 times faster for double and 40 times faster for single. The efficiency is about 35-40% of the peak FLOPS now. With matrix multiplication I get up to 70% with AVX in my own code. I don't know what to expect from Cholesky decomposition. The algorithm is partially serial (when calculating the diagonal block, called triangle in my code below) unlike matrix multiplication.

Update: I'm within a factor for 2 of the MKL. I don't know if I should be proud of that or embarrassed by that but apparently my code can still be improved significantly. I found a PhD thesis on this which shows that my block algorithm is a common solution so I managed to reinvent the wheel.

I use 32x32 tiles for double and 64x64 tiles for float. I also reorder the memory for each tile to be contiguous and be its transpose. I defined a new matrix production function. Matrix multiplication is defined as:

C_i,j = A_i,k * B_k,j //sum over k

I realized that in the Cholesky algorithm there is something very similar

C_j,i = A_i,k * B_j,k //sum over k

By writing the transpose of the tiles I was able to use my optimzied function for matrix multiplication here almost exactly (I only had to change one line of code). Here is the main function:

reorder(tmp,B,n2,bs);
for(int j=0; j<nb; j++) {
    #pragma omp parallel for schedule(static) num_threads(ncores)
    for(int i=j; i<nb; i++) {
        for(int k=0; k<j; k++) {
            product(&B[stride*(nb*j+k)],&B[stride*(nb*i+k)],&B[stride*(nb*i+j)],bs);
        }
    }
    triangle(&B[stride*(nb*j+j)], bs);
    #pragma omp parallel for schedule(static)
    for(int i=j+1; i<nb; i++) {         
        block(&B[stride*(nb*i+j)],&B[stride*(nb*j+j)],bs);
    }           
}
reorder_inverse(B,tmp,n2,bs);

Here are the other functions. I have six product functions for SSE2, AVX, and FMA each with double and float version. I only show the one for AVX and double here:

template <typename Type>
void triangle(Type *A, int n) {
    for (int j = 0; j < n; j++) {
        Type s = 0;
        for(int k=0; k<j; k++) s+= A[k*n+j]*A[k*n+j];
        //if((A[j * n + j] - s)<0) printf("asdf3 j %d, %f %f\n", j, A[j * n + j] - s, sqrt(A[j * n + j] - s));
        A[j*n+j] = sqrt(A[j*n+j] - s);
        Type fact = 1.0/A[j*n+j];
        for (int i = j+1; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) s+=A[k*n+i]*A[k*n+j];
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void block(Type *A, Type *B, int n) {   
    for (int j = 0; j <n; j++) {
        Type fact = 1.0/B[j*n+j];   
        for (int i = 0; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) {
                s += A[k*n+i]*B[k*n+j];
            }
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void reorder(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[stride*(nb*i+j) + bs*j2+i2] = A[n*bs*i + j*bs + n*i2 + j2];
                }
            }
        }
    }
}

template <typename Type>
void reorder_inverse(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[n*bs*i + j*bs + n*i2 + j2] = A[stride*(nb*i+j) + bs*j2+i2];
                }
            }
        }
    }

extern "C" void product32x32_avx(double *a, double *b, double *c, int n) 
{
    for(int i=0; i<n; i++) {    
        __m256d t1 = _mm256_loadu_pd(&c[i*n +  0]);
        __m256d t2 = _mm256_loadu_pd(&c[i*n +  4]);
        __m256d t3 = _mm256_loadu_pd(&c[i*n +  8]);
        __m256d t4 = _mm256_loadu_pd(&c[i*n + 12]);
        __m256d t5 = _mm256_loadu_pd(&c[i*n + 16]);
        __m256d t6 = _mm256_loadu_pd(&c[i*n + 20]);
        __m256d t7 = _mm256_loadu_pd(&c[i*n + 24]);
        __m256d t8 = _mm256_loadu_pd(&c[i*n + 28]);
        for(int k=0; k<n; k++) {
            __m256d a1 = _mm256_set1_pd(a[k*n+i]);

            __m256d b1 = _mm256_loadu_pd(&b[k*n+0]);
            t1 = _mm256_sub_pd(t1,_mm256_mul_pd(a1,b1));

            __m256d b2 = _mm256_loadu_pd(&b[k*n+4]);
            t2 = _mm256_sub_pd(t2,_mm256_mul_pd(a1,b2));

            __m256d b3 = _mm256_loadu_pd(&b[k*n+8]);
            t3 = _mm256_sub_pd(t3,_mm256_mul_pd(a1,b3));

            __m256d b4 = _mm256_loadu_pd(&b[k*n+12]);
            t4 = _mm256_sub_pd(t4,_mm256_mul_pd(a1,b4));

            __m256d b5 = _mm256_loadu_pd(&b[k*n+16]);
            t5 = _mm256_sub_pd(t5,_mm256_mul_pd(a1,b5));

            __m256d b6 = _mm256_loadu_pd(&b[k*n+20]);
            t6 = _mm256_sub_pd(t6,_mm256_mul_pd(a1,b6));

            __m256d b7 = _mm256_loadu_pd(&b[k*n+24]);
            t7 = _mm256_sub_pd(t7,_mm256_mul_pd(a1,b7));

            __m256d b8 = _mm256_loadu_pd(&b[k*n+28]);
            t8 = _mm256_sub_pd(t8,_mm256_mul_pd(a1,b8));
        }
        _mm256_storeu_pd(&c[i*n +  0], t1);
        _mm256_storeu_pd(&c[i*n +  4], t2);
        _mm256_storeu_pd(&c[i*n +  8], t3);
        _mm256_storeu_pd(&c[i*n + 12], t4);
        _mm256_storeu_pd(&c[i*n + 16], t5);
        _mm256_storeu_pd(&c[i*n + 20], t6);
        _mm256_storeu_pd(&c[i*n + 24], t7);
        _mm256_storeu_pd(&c[i*n + 28], t8);
    }
}

Stentor answered 14/4, 2014 at 15:8 Comment(5)

Reinventing the wheel isn't something to be ashamed of. It simply shows that you're thinking along the same lines as other accomplished people who did it before you. You still had to figure it out. – Dissatisfy 23/6, 2014 at 17:57

Wouldn't you be so kind to write an example of using this code? I think I figured it out but I'm not sure with what parameters it is to be called. bs = blocksize, nb = number of blocks, right? – Parsonage 21/12, 2014 at 8:29

@ТимофейЛомоносов, there are parts of my code I can't release yet but here is the main function coliru.stacked-crooked.com/a/9c00d5ac7332e1c8 – Stentor 7/1, 2015 at 14:8

@ТимофейЛомоносов, and here is the product function for AVX coliru.stacked-crooked.com/a/4c934a4775dcd2f1 – Stentor 7/1, 2015 at 14:9

@ТимофейЛомоносов, if you want the product function for SSE2 and FMA let me know but that should be enough for you to figure it out. If I find time I'll clean up the code I can't release and make the whole thing public. – Stentor 7/1, 2015 at 14:10

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags