I have a project where we solve the inverse of large (over 3000x3000) positive definite dense matrices using Cholesky Decomposition. The project is in Java and we use are using the CERN Colt BLAS library. Profiling the code shows that the Cholesky decomposition is the bottleneck.
I decided to try and parallelize the Cholesky decomposition using OpenMP and use it as a DLL in Java (with JNA). I started with the Cholesky decomposition code in C from Rosetta Code.
What I noticed is that the values in a column except for the diagonal element are independent. So I decided to calculate the diagonal elements in serial and the rest of the values of the column in parallel. I also swapped the order of the loops so that the inner loop runs over the rows and the outer loop over the columns. The serial version is slightly slower than the one from RosettaCode but the parallel version is six times faster than the RosettaCode version on my 4 core (8 HT) system. Using the DLL in Java speeds up our results by six times as well. Here is the code:
double *cholesky(double *A, int n) {
double *L = (double*)calloc(n * n, sizeof(double));
if (L == NULL)
exit(EXIT_FAILURE);
for (int j = 0; j <n; j++) {
double s = 0;
for (int k = 0; k < j; k++) {
s += L[j * n + k] * L[j * n + k];
}
L[j * n + j] = sqrt(A[j * n + j] - s);
#pragma omp parallel for
for (int i = j+1; i <n; i++) {
double s = 0;
for (int k = 0; k < j; k++) {
s += L[i * n + k] * L[j * n + k];
}
L[i * n + j] = (1.0 / L[j * n + j] * (A[i * n + j] - s));
}
}
return L;
}
You can find the full code for testing this at http://coliru.stacked-crooked.com/a/6f5750c20d456da9
I initially thought that false sharing would be a problem when the remaining elements of a column were small compared to the number of threads but that does not seem to be the case. I tried
#pragma omp parallel for schedule(static, 8) // a cache line is 8 doubles
I have not found clear examples of how to parallelize Choleskey decomposition. I don't know if what I have done is ideal. For example, will it work well on a NUMA system?
Perhaps a tasked based approach is better in general? In slides 7-9 at http://courses.engr.illinois.edu/cs554/fa2013/notes/07_cholesky.pdf there is an example of parallel cholesky decomposition using "fine grained tasks". It's not clear to me how to implement this yet.
I have two questions, specific and general. Do you have any suggestions on how to improve my implementation of Cholesky Decomposition with OpenMP? Can you suggest a different implementation of Cholesky Decomposition with OpenMP e.g. with tasks?
Edit: as requested here is the AVX function I used to compute s
. It did not help
double inner_sum_AVX(double *li, double *lj, int n) {
__m256d s4;
int i;
double s;
s4 = _mm256_set1_pd(0.0);
for (i = 0; i < (n & (-4)); i+=4) {
__m256d li4, lj4;
li4 = _mm256_loadu_pd(&li[i]);
lj4 = _mm256_loadu_pd(&lj[i]);
s4 = _mm256_add_pd(_mm256_mul_pd(li4, lj4), s4);
}
double out[4];
_mm256_storeu_pd(out, s4);
s = out[0] + out[1] + out[2] + out[3];
for(;i<n; i++) {
s += li[i]*lj[i];
}
return s;
}
s
. Maybe there's improvement which could be done but that would be on the mathematical way.. – Evacuatecholesky7
at coliru.stacked-crooked.com/a/1d13b7806f333650 – Stentor#pragma loop(no_vector)
before the loops to turn of auto-vectorization. It makes no difference. – StentorL
and avoid using unaligned vecto – Evacuate