Cholesky decomposition with OpenMP
Asked Answered
S

1

4

I have a project where we solve the inverse of large (over 3000x3000) positive definite dense matrices using Cholesky Decomposition. The project is in Java and we use are using the CERN Colt BLAS library. Profiling the code shows that the Cholesky decomposition is the bottleneck.

I decided to try and parallelize the Cholesky decomposition using OpenMP and use it as a DLL in Java (with JNA). I started with the Cholesky decomposition code in C from Rosetta Code.

What I noticed is that the values in a column except for the diagonal element are independent. So I decided to calculate the diagonal elements in serial and the rest of the values of the column in parallel. I also swapped the order of the loops so that the inner loop runs over the rows and the outer loop over the columns. The serial version is slightly slower than the one from RosettaCode but the parallel version is six times faster than the RosettaCode version on my 4 core (8 HT) system. Using the DLL in Java speeds up our results by six times as well. Here is the code:

double *cholesky(double *A, int n) {
    double *L = (double*)calloc(n * n, sizeof(double));
    if (L == NULL)
        exit(EXIT_FAILURE);

    for (int j = 0; j <n; j++) {            
        double s = 0;
        for (int k = 0; k < j; k++) {
            s += L[j * n + k] * L[j * n + k];
        }
        L[j * n + j] = sqrt(A[j * n + j] - s);
        #pragma omp parallel for
        for (int i = j+1; i <n; i++) {
            double s = 0;
            for (int k = 0; k < j; k++) {
                s += L[i * n + k] * L[j * n + k];
            }
            L[i * n + j] = (1.0 / L[j * n + j] * (A[i * n + j] - s));
        }
    }
    return L;
}

You can find the full code for testing this at http://coliru.stacked-crooked.com/a/6f5750c20d456da9

I initially thought that false sharing would be a problem when the remaining elements of a column were small compared to the number of threads but that does not seem to be the case. I tried

#pragma omp parallel for schedule(static, 8) // a cache line is 8 doubles

I have not found clear examples of how to parallelize Choleskey decomposition. I don't know if what I have done is ideal. For example, will it work well on a NUMA system?

Perhaps a tasked based approach is better in general? In slides 7-9 at http://courses.engr.illinois.edu/cs554/fa2013/notes/07_cholesky.pdf there is an example of parallel cholesky decomposition using "fine grained tasks". It's not clear to me how to implement this yet.

I have two questions, specific and general. Do you have any suggestions on how to improve my implementation of Cholesky Decomposition with OpenMP? Can you suggest a different implementation of Cholesky Decomposition with OpenMP e.g. with tasks?

Edit: as requested here is the AVX function I used to compute s. It did not help

double inner_sum_AVX(double *li, double *lj, int n) {
    __m256d s4;
    int i;
    double s;

    s4 = _mm256_set1_pd(0.0);
    for (i = 0; i < (n & (-4)); i+=4) {
        __m256d li4, lj4;
        li4 = _mm256_loadu_pd(&li[i]);
        lj4 = _mm256_loadu_pd(&lj[i]);
        s4 = _mm256_add_pd(_mm256_mul_pd(li4, lj4), s4);
    }
    double out[4];
    _mm256_storeu_pd(out, s4);
    s = out[0] + out[1] + out[2] + out[3];
    for(;i<n; i++) {
        s += li[i]*lj[i];
    }
    return s;
}
Stentor answered 18/3, 2014 at 12:22 Comment(19)
Your speed up is fine and I don't think just by using OpenMP you could gain some other performance. You could try AVX/SSE for the computation of s. Maybe there's improvement which could be done but that would be on the mathematical way..Evacuate
@user3018144, I agree that 6x is pretty good already. I guess the main question is if I will get the same speedup on a NUMA system or can the single threaded code be improved (the fact that hyper-threading is helping so much tells me it can). Good point about AVX/SSE on s. I have been thinking of that for a few days but have not tried it yet. it would be better to do it on multiple rows at the same time with SIMD but the diagonal makes it difficult.Stentor
Correct me if I'm wrong, but you seem to be parallelising the inner loop with omp. If you want to have multiple threads calculating in parallel, you don't want to start a lot of short-lived threads, but keep a number of threads similar to the number of CPUs busy continuously. I'd try parallelising the outer loop, that way thread overhead (creating, scheduling, running, killing) is lower.Revetment
@EOF, if only it were that simple...Each column depends on the values of all the columns before it. They have to be computed sequentially. But the values within a column can be done in parallel except for the first element.Stentor
@user3018144, I tried using SSE and AVX for computing s. They did not help. Probably I'm memory bound and not compute bound. It's a O(n^3) algorithm so I think if done right it should be compute bound but I have to change the whole algorithm for this. I can't just make tiles like in GEMM.Stentor
@Zboson Could you post your code with vectorization?Evacuate
@user3018144, see the function cholesky7 at coliru.stacked-crooked.com/a/1d13b7806f333650Stentor
@user3018144, I just edited my question with the AVX code for computing s.Stentor
@user3018144, I just tried aligning the arrays to 64 bytes so the loads/stores should be aligned. It did not make a difference either.Stentor
There's no difference at all ? The compiler might have vectorized the code itself otherwise it's really weird.Evacuate
You might be memory bound in openMP with AVXEvacuate
@user3018144, there is not significant difference. I'm using MSVC2012. I put #pragma loop(no_vector) before the loops to turn of auto-vectorization. It makes no difference.Stentor
Try to align your vector L and avoid using unaligned vectoEvacuate
@user3018144, I already tried that. See a few comments above.Stentor
yes but you are making a load unaligned, don't you ?Evacuate
@user3018144, on SB load and unaligned loads/stores have the same latency/throughput. The only thing that matters is alignment not the instructions and alignment only matters if the 256-bit word crosses a cache-line. In any case I tried the aligned load instructions. It makes no difference. This is why I avoided implementing SIMD. The easy way is not the right way and the right way I don't know how to do (yet).Stentor
you might check this post: #22511114Evacuate
@EOF, no OpenMP runtime that exists nowadays kills the worker threads at the end of the parallel region. Rather all threads are kept in a pool and summoned (cheaply) when a new parallel region is entered. MSVC's OpenMP runtime uses the Windows native thread pool implementation, therefore maximum performance with minimum overhead.Conrado
@EOF, I got SSE, AVX, and FMA working with the Cholesky factorization. See my answer. It was not easy.Stentor
S
4

I manged to get SIMD working with the Cholesky decomposition. I did this using loop tiling as I have used before in matrix multiplication. The solution was not trivial. Here are the times for a 5790x5790 matrix on my 4 core/ 8 HT Ivy Bridge system (eff = GFLOPS/(peak GFLOPS)):

double floating point peak GFLOPS 118.1
1 thread       time 36.32 s, GFLOPS  1.78, eff  1.5%
8 threads      time  7.99 s, GFLOPS  8.10, eff  6.9%
4 threads+AVX  time  1.36 s, GFLOPS 47.64, eff 40.3%
4 threads MKL  time  0.68 s, GFLOPS 95.14, eff 80.6% // from LAPACKE_dpotrf

single floating point peak GFLOPS 236.2
1 thread       time 33.88 s, GFLOPS  1.91, eff  0.8%
8 threads      time  4.74 s, GFLOPS 13.64, eff  5.8%
4 threads+AVX  time  0.78 s, GFLOPS 82.61, eff 35.0%

The new method is 25 times faster for double and 40 times faster for single. The efficiency is about 35-40% of the peak FLOPS now. With matrix multiplication I get up to 70% with AVX in my own code. I don't know what to expect from Cholesky decomposition. The algorithm is partially serial (when calculating the diagonal block, called triangle in my code below) unlike matrix multiplication.

Update: I'm within a factor for 2 of the MKL. I don't know if I should be proud of that or embarrassed by that but apparently my code can still be improved significantly. I found a PhD thesis on this which shows that my block algorithm is a common solution so I managed to reinvent the wheel.

I use 32x32 tiles for double and 64x64 tiles for float. I also reorder the memory for each tile to be contiguous and be its transpose. I defined a new matrix production function. Matrix multiplication is defined as:

C_i,j = A_i,k * B_k,j //sum over k

I realized that in the Cholesky algorithm there is something very similar

C_j,i = A_i,k * B_j,k //sum over k

By writing the transpose of the tiles I was able to use my optimzied function for matrix multiplication here almost exactly (I only had to change one line of code). Here is the main function:

reorder(tmp,B,n2,bs);
for(int j=0; j<nb; j++) {
    #pragma omp parallel for schedule(static) num_threads(ncores)
    for(int i=j; i<nb; i++) {
        for(int k=0; k<j; k++) {
            product(&B[stride*(nb*j+k)],&B[stride*(nb*i+k)],&B[stride*(nb*i+j)],bs);
        }
    }
    triangle(&B[stride*(nb*j+j)], bs);
    #pragma omp parallel for schedule(static)
    for(int i=j+1; i<nb; i++) {         
        block(&B[stride*(nb*i+j)],&B[stride*(nb*j+j)],bs);
    }           
}
reorder_inverse(B,tmp,n2,bs); 

Here are the other functions. I have six product functions for SSE2, AVX, and FMA each with double and float version. I only show the one for AVX and double here:

template <typename Type>
void triangle(Type *A, int n) {
    for (int j = 0; j < n; j++) {
        Type s = 0;
        for(int k=0; k<j; k++) s+= A[k*n+j]*A[k*n+j];
        //if((A[j * n + j] - s)<0) printf("asdf3 j %d, %f %f\n", j, A[j * n + j] - s, sqrt(A[j * n + j] - s));
        A[j*n+j] = sqrt(A[j*n+j] - s);
        Type fact = 1.0/A[j*n+j];
        for (int i = j+1; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) s+=A[k*n+i]*A[k*n+j];
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void block(Type *A, Type *B, int n) {   
    for (int j = 0; j <n; j++) {
        Type fact = 1.0/B[j*n+j];   
        for (int i = 0; i<n; i++) {
            Type s = 0;
            for(int k=0; k<j; k++) {
                s += A[k*n+i]*B[k*n+j];
            }
            A[j*n+i] = fact * (A[j*n+i] - s);
        }
    }
}

template <typename Type>
void reorder(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[stride*(nb*i+j) + bs*j2+i2] = A[n*bs*i + j*bs + n*i2 + j2];
                }
            }
        }
    }
}

template <typename Type>
void reorder_inverse(Type *A, Type *B, int n, int bs) {
    int nb = n/bs;
    int stride = bs*bs;
    //printf("%d %d %d\n", bs, nb, stride);
    #pragma omp parallel for schedule(static)
    for(int i=0; i<nb; i++) {
        for(int j=0; j<nb; j++) {
            for(int i2=0; i2<bs; i2++) {
                for(int j2=0; j2<bs; j2++) {
                    B[n*bs*i + j*bs + n*i2 + j2] = A[stride*(nb*i+j) + bs*j2+i2];
                }
            }
        }
    }

extern "C" void product32x32_avx(double *a, double *b, double *c, int n) 
{
    for(int i=0; i<n; i++) {    
        __m256d t1 = _mm256_loadu_pd(&c[i*n +  0]);
        __m256d t2 = _mm256_loadu_pd(&c[i*n +  4]);
        __m256d t3 = _mm256_loadu_pd(&c[i*n +  8]);
        __m256d t4 = _mm256_loadu_pd(&c[i*n + 12]);
        __m256d t5 = _mm256_loadu_pd(&c[i*n + 16]);
        __m256d t6 = _mm256_loadu_pd(&c[i*n + 20]);
        __m256d t7 = _mm256_loadu_pd(&c[i*n + 24]);
        __m256d t8 = _mm256_loadu_pd(&c[i*n + 28]);
        for(int k=0; k<n; k++) {
            __m256d a1 = _mm256_set1_pd(a[k*n+i]);

            __m256d b1 = _mm256_loadu_pd(&b[k*n+0]);
            t1 = _mm256_sub_pd(t1,_mm256_mul_pd(a1,b1));

            __m256d b2 = _mm256_loadu_pd(&b[k*n+4]);
            t2 = _mm256_sub_pd(t2,_mm256_mul_pd(a1,b2));

            __m256d b3 = _mm256_loadu_pd(&b[k*n+8]);
            t3 = _mm256_sub_pd(t3,_mm256_mul_pd(a1,b3));

            __m256d b4 = _mm256_loadu_pd(&b[k*n+12]);
            t4 = _mm256_sub_pd(t4,_mm256_mul_pd(a1,b4));

            __m256d b5 = _mm256_loadu_pd(&b[k*n+16]);
            t5 = _mm256_sub_pd(t5,_mm256_mul_pd(a1,b5));

            __m256d b6 = _mm256_loadu_pd(&b[k*n+20]);
            t6 = _mm256_sub_pd(t6,_mm256_mul_pd(a1,b6));

            __m256d b7 = _mm256_loadu_pd(&b[k*n+24]);
            t7 = _mm256_sub_pd(t7,_mm256_mul_pd(a1,b7));

            __m256d b8 = _mm256_loadu_pd(&b[k*n+28]);
            t8 = _mm256_sub_pd(t8,_mm256_mul_pd(a1,b8));
        }
        _mm256_storeu_pd(&c[i*n +  0], t1);
        _mm256_storeu_pd(&c[i*n +  4], t2);
        _mm256_storeu_pd(&c[i*n +  8], t3);
        _mm256_storeu_pd(&c[i*n + 12], t4);
        _mm256_storeu_pd(&c[i*n + 16], t5);
        _mm256_storeu_pd(&c[i*n + 20], t6);
        _mm256_storeu_pd(&c[i*n + 24], t7);
        _mm256_storeu_pd(&c[i*n + 28], t8);
    }
}
Stentor answered 14/4, 2014 at 15:8 Comment(5)
Reinventing the wheel isn't something to be ashamed of. It simply shows that you're thinking along the same lines as other accomplished people who did it before you. You still had to figure it out.Dissatisfy
Wouldn't you be so kind to write an example of using this code? I think I figured it out but I'm not sure with what parameters it is to be called. bs = blocksize, nb = number of blocks, right?Parsonage
@ТимофейЛомоносов, there are parts of my code I can't release yet but here is the main function coliru.stacked-crooked.com/a/9c00d5ac7332e1c8Stentor
@ТимофейЛомоносов, and here is the product function for AVX coliru.stacked-crooked.com/a/4c934a4775dcd2f1Stentor
@ТимофейЛомоносов, if you want the product function for SSE2 and FMA let me know but that should be enough for you to figure it out. If I find time I'll clean up the code I can't release and make the whole thing public.Stentor

© 2022 - 2024 — McMap. All rights reserved.