SVD very slow when using cuSolver as compared to MATLAB
Asked Answered
D

1

3

I'm trying to use the gesvd function from cuSOLVER which I found to be much slower than the svd function in MATLAB, for both cases using double array or gpuArray.

C++ code [using cuSolver]:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
// Macro for timing kernel runs
#define START_METER {\
    cudaEvent_t start, stop;\
    float elapsedTime;\
    cudaEventCreate(&start);\
    cudaEventRecord(start, 0);
#define STOP_METER cudaEventCreate(&stop);\
    cudaEventRecord(stop, 0);\
    cudaEventSynchronize(stop);\
    cudaEventElapsedTime(&elapsedTime, start, stop);\
    printf("Elapsed time : %f ms\n", elapsedTime);\
                }

void cusolverSVD_Test()
{
    const int m = 64;
    const int rows = m;
    const int cols = m;
    /*       | 3.5 0.5 0 |
    *   A = | 0.5 3.5 0 |
    *       | 0   0   2 |
    *
    */
    double A[rows*m];
    for (int i = 0; i < cols; i++)
    {
        for (int j = 0; j < rows; j++)
        {
            A[i*rows + j] = (double)rand() / RAND_MAX;
            if (i == j){
                A[i*rows + j] += 1;
            }
        }
    }

    cusolverDnHandle_t handle;
    cusolverDnCreate(&handle);
    int lwork;

    cusolverDnDgesvd_bufferSize(
        handle,
        rows,
        cols,
        &lwork);

    double *d_A;
    cudaMalloc(&d_A, sizeof(double)*rows*cols);
    cudaMemcpy(d_A, A, sizeof(double)*rows*cols, cudaMemcpyHostToDevice);

    double *d_S;
    cudaMalloc(&d_S, sizeof(double)*rows);

    double *d_U;
    cudaMalloc(&d_U, sizeof(double)*rows*rows);

    double *d_VT;
    cudaMalloc(&d_VT, sizeof(double)*rows*rows);

    double *d_work;
    cudaMalloc(&d_work, sizeof(double)*lwork);

    double *d_rwork;
    cudaMalloc(&d_rwork, sizeof(double)*(rows - 1));

    int *devInfo;
    cudaMalloc(&devInfo, sizeof(int));

    for (int t = 0; t < 10; t++)
    {
        signed char jobu = 'A';
        signed char jobvt = 'A';
        START_METER
            cusolverDnDgesvd(
            handle,
            jobu,
            jobvt,
            rows,
            cols,
            d_A,
            rows,
            d_S,
            d_U,
            rows,
            d_VT,
            rows,
            d_work,
            lwork,
            d_rwork,
            devInfo);
        STOP_METER
    }

    cudaFree(d_A);
    cudaFree(d_rwork);
    cudaFree(d_S);
    cudaFree(d_U);
    cudaFree(d_VT);
    cudaFree(d_work);

}

int main()
{
    cusolverSVD_Test();
}

Output:

Elapsed time : 63.318016 ms
Elapsed time : 66.745316 ms
Elapsed time : 65.966530 ms
Elapsed time : 65.999939 ms
Elapsed time : 64.821053 ms
Elapsed time : 65.184547 ms
Elapsed time : 65.722916 ms
Elapsed time : 60.618786 ms
Elapsed time : 54.937569 ms
Elapsed time : 53.751263 ms
Press any key to continue . . .

**Matlab code using the svd function*:

%% SVD on gpu
A = rand(64, 64) + eye(64);
tic
[~, ~, ~] = svd(A);
t = toc;
fprintf('CPU time: %f ms\n', t*1000);


d_A = gpuArray(A);
tic
[~, ~, ~] = svd(d_A);
t = toc;
fprintf('GPU time: %f ms\n', t*1000);

%% Output
% >> CPU time: 0.947754 ms
% >> GPU time: 2.168100 ms

Does Matlab use some faster algorithm? Or am I just doing some mistakes? I really need a good implementation/algorithm for SVD that I can use in CUDA.

UPDATE: Execution times when using 1000 x 1000 matrix

C++:

3655 ms (Double Precision)
2970 ms (Single Precision)

Matlab:

CPU time: 280.641123 ms
GPU time: 646.033498 ms
Danella answered 20/1, 2017 at 9:24 Comment(14)
[64x64] is not the size of array where you would notice gains.Premillenarian
Even then how come the Matlab function is so much faster!!Danella
Let me change the size and see also...Danella
MATLAB is incredibly optimized for these operations. Still, try something bigger!Robertoroberts
@AnderBiguri No. Increasing size also does nothing. For 1000x1000 matrix the c++ code is taking 3655ms where as the matlab code is taking only around 250 ms for both gpu and cpu...Danella
When you say "I need a good implementation" what do you mean? Is this just a question of performance? If so, what GPU are you using and what is its double precision performance? Is double precision actually necessary for the SVD?Spermatophyte
@Spermatophyte It's GTX 960m. and performance is the real issue right now.Danella
@Spermatophyte I tried using the single precision format also; didn't make much of a difference. I took 2970 ms where as double precision was taking 3655 ms. Though in wikipedia It says 1317 & 41.16 gflops for single and double precision respectivelyDanella
So basically you are running on a modest mobile GPU and complaining about its performanceSpermatophyte
@Spermatophyte I am not complaining about its performance; I am just trying to achieve what's already possible on my system (as in Matlab)...Danella
Implicitly, you are assuming that it is possible to perform an SVD on a mobile GPU as fast or faster than your CPU can. But you have absolutely no sound basis for that assumption.Spermatophyte
@Spermatophyte Yes I do understand that routines like svd can be faster on CPUs compared to mobile gpus. But my intention was to know how Matlab can achieve such good performance both on cpu & GPU; please note that I have added GPU running time for matlab also. I want to stick to gpu because svd is a small part of the program I am writing on cuda.Danella
How do you even know that the Matlab SVD is running on the GPU? For all you know it might download the gpu array to the CPU and run it there.Spermatophyte
@Spermatophyte in.mathworks.com/help/distcomp/…; but here it says svd will be run on gpu if any of the argument is a gpuArrayDanella
C
3

It is a known issue that the SVD algorithm does not parallelize well. You will find that you need very large arrays to see a benefit in double precision. You may get better results for single precision for your GPU. You will also get better results if you only request one output, since computing the singular values alone uses a much faster algorithm.

This is also highly dependent on the quality of your GPU. If you are using a graphics card such as GeForce GTX you really aren't going to see much benefit for a GPU in double precision for an algorithm like SVD.

Fundamentally, GPU cores have a much lower performance than modern CPU cores, and they make up for this with very wide parallelism. The SVD algorithm is too highly dependent on a serial factorization iteration. Perhaps you can solve your problem by rethinking the algebra so you don't need to compute the complete factorization every time.

Cotillion answered 22/1, 2017 at 16:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.