Calculate eigenvalues/eigenvectors of hundreds of small matrices using CUDA

Asked 9/7, 2012 at 18:51 Answered 6/2, 2024 at 18:40

matrix cuda opencl linear-algebra numerical-methods

I have a question on the eigen-decomposition of hundreds of small matrices using CUDA.

I need to calculate the eigenvalues and eigenvectors of hundreds (e.g. 500) of small (64-by-64) real symmetric matrices concurrently. I tried to implement it by the Jacobi method using chess tournament ordering (see this paper (PDF) for more information).

In this algorithm, 32 threads are defined in each block, while each block handles one small matrix, and the 32 threads work together to inflate 32 off-diagonal elements until convergence. However, I am not very satisfied with its performance.

I am wondering where there is any better algorithm for my question, i.e. the eigen-decomposition of many 64-by-64 real symmetric matrices. I guess the householder's method may be a better choice but not sure whether it can be efficiently implemented in CUDA. There are not a lot of useful information online, since most of other programmers are more interested in using CUDA/OpenCL to decompose one large matrix instead of a lot of small matrices.

Dail answered 9/7, 2012 at 18:51 Comment(6)

What do you want to compute? The entire decomposition? Or only the Eigenvalues? Or only a few of the eigenvalues/eigenvectors? – Cyclops 9/7, 2012 at 19:20

What are your performance goals? Have you spent any time profiling? What are the results? – Elyn 9/7, 2012 at 19:29

@TimChild is correct - your "I am not satisfied with its performance" doesn't tell us much. – Minim 9/7, 2012 at 21:26

Yifei, if you don't supply more details on your problem, we won't be able to give you any reasonable answers. – Cyclops 10/7, 2012 at 9:55

@yifei-huang we would like to help. If you can provide more info on what you mean by "not satisfied with performance", it might help, otherwise I would vote to close... – Gleiwitz 28/2, 2013 at 2:58

That paper you linked says they only get a 1.8x speedup on CUDA for 64x64 matrices. What speedup are you getting? In general, GPUs need a LOT of work to cover global memory latency sufficiently (and stay busy), so it's not surprising that "hundreds" of threads can barely utilize the GPU. – Inductor 24/7, 2013 at 15:49

At least for the Eigenvalues, a sample can be found in the Cuda SDK

http://www.nvidia.de/content/cudazone/cuda_sdk/Linear_Algebra.html

Images seem broken, but download of samples still works. I would suggest downloading the full SDK and having a look at that exsample. Also, this Paper could be helpfull:

http://docs.nvidia.com/cuda/samples/6_Advanced/eigenvalues/doc/eigenvalues.pdf

Lowminded answered 17/4, 2013 at 13:24 Comment(0)

cusolverDn*batched seems to do exactly what you are looking for. For example cusolverDnSsyevjBatched will calculate eigenvalues for complex hermitian matrices using single precision. You can pass many matrices into the function with one call.

https://docs.nvidia.com/cuda/cusolver/index.html?highlight=cusolverDnCheevjBatched#cusolverdn-t-gesvdjbatched

Blowy answered 6/2, 2024 at 18:40 Comment(0)

Recommended topics

Hot tags