SVD speed in CPU and GPU

Asked 7/11, 2014 at 8:38 Answered 15/5, 2017 at 11:36

I'm testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I'm using a GTX 460 card and a Core 2 duo E8500.

Here is my code:

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

Also, the run times are different from run to run, but the CPU and GPU versions are about the same. Why there is no speedup?

Here are some tests

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.

Defamatory answered 7/11, 2014 at 8:38 Comment(4)

What runtimes are you getting? What's your CPU model? Did you run this benchmark code a few times to warm up GPU before noting down the runtimes? – Une 7/11, 2014 at 9:11

Also, for a fair benchmarking, I think you need to use gather i.e. gather(svd(Md)). – Une 7/11, 2014 at 10:0

svd(Md) is calculating the only singular values, 1000 for the matrix at hand. Do you still have no speedup when using [Ud,Sd,Vd]= svd(Md)? – Alenealenson 7/11, 2014 at 13:9

Please note that CPU version of Matlab function is also optimized well and probably using multiple cores behind of scene if you have parallel computing toolbox. So matlab is not usually good software to demonstrate GPU speedup. – Pleochroism 7/11, 2014 at 20:34

Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.

You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.

So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.

Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.

Update 1:

After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.

Trifurcate answered 7/11, 2014 at 9:47 Comment(3)

To extend the statement about single/double-precision, you can suggest using Md = single(Md) before timing the GPU codes, if little loss in precision is fine by OP. – Une 7/11, 2014 at 9:58

@Une Thank you! I added the comment to the post. – Trifurcate 7/11, 2014 at 10:25

Dongarra and his co-authors outline some of the challenges with accelerating SVD on GPUs in "Accelerating Numerical Dense Linear Algebra Calculations with GPUs". Quote: "There are many ways to formulate mathematically and solve these problems numerically, but in all cases, designing an efficient computation is challenging because of the nature of the algorithms." – Cabin 7/11, 2014 at 23:21

As VAndrei has already stated, the SVD is an algorithm which is difficult to parallelize.

Your main problem is the size of your matrix. The performance of the SVD drops rapidly with a growing matrix size. So your main goal should be to reduce the size of the matrix. This can be accomplished using Gaussian normal equations (which is basically a reduction of an overdetermined linear system in the least-squares sense).

This can be done by simply multiplying the transpose onto the matrix:

MhReduced = Mh' * Mh;

This reduces your matrix to the size of cols*cols (if cols is the number of columns of Mh). Then you just call [U,S,V] = svd(MhReduced);

Note: Using this method may yield singular vectors with opposite sign (just important if you're comparing these methods).

If your matix is well-conditioned this should work without problems. However, in case of an ill-conditioned matrix, this method may fail to produce a usable result, whereas applying SVD directly could still yield a usable result due to SVD's robustness.

This should increase your performance immensly, at least with matrices big enough. Another advantage is that you can use much larger matrices. You'll probably won't have to use the GPU at all (since either matrices are so big that copying to GPU costs too much or after reduction the matrix is so small that the speedup of the GPU won't be big enough).

Also note that a large chunk of performance is lost, if you use return values. If you're only interested in the performance of the SVD caluclation, don't take any return values. If you are only interested in the "solution vector", just get V (and access the last column): [~,~, V] = svd(Mh);.

EDIT:

I've looked at your sample code, but I'm not sure what it is, you are calculating. Also I realized that it's rather hard to understand what I did with A'*A, so I will explain in detail.

Given a linear system with A*x=b, A denoting the coefficient matrix with m rows and n cols, x the solution vector and b the constant vector (both with m rows), a solution can be calculated as follows:

if A is square (m=n): x = A^-1 * b,
if A is not square (m!=n, m > n):

A * x = b

A'* A * x = A' * b

x = (A' * A)^-1 * A'*b

A" = (A'*A)^-1 * A' is typically called pseudo-inverse. However this calculation does influence the condition number of the matrix negatively. A solution to this problem is using a singular value decomposition (SVD). If USV = svd(A) denotes the results of the SVD, the pseudo-inverse is given by VS"U', with S" is formed by taking the inverse of the non-zero elements of S. So A" = VS"U'.

x = A"*b

However since a SVD is rather costly, especially with large matrices. If matrix A is well-conditioned and very precicse results are not necessarily required (we're talking 1e-13 or 1e-14), the much faster approach by calculating the peseudo-inverse via (A'*A)^-1 * A can be used.

If your case actually is A*x=0, just use a SVD and read the last column vector from V, it is the solution.

If you use the SVD not to solve a linear system but for the results of U and S (as your example suggests), I'm not sure what I've posted will help you.

Sources: 1, 2, 3

Here is some sample code for you to test. Test it with large matrices, you will see that using (A'*A)^-1 * A' is much faster than the alternatives.

clear all

nbRows = 30000;
nbCols = 100;
% Matrix A
A = rand(nbRows,nbCols);

% Vector b
b = rand(nbRows,1);

% A*x=b

% Solve for x, using SVD
% [U,S,V]=svd(A,0);
% x= V*((U'*b)./diag(S))
tic
[U1,S1,V1]=svd(A,0);
x1= V1*((U1'*b)./diag(S1));
toc

tic
[U1,S1,V1]=svd(A,0);
x2 = V1*inv(S1)*U1'*b;
toc

% Solve for x, using manual pseudo-inverse
% A*x=b
% A'*A*x = A'*b
% x = (A'*A)^-1 * A'*b
tic
x3 = inv(A'*A) * A'*b;
toc

% Solve for x, let Matlab decide how (most likely SVD)
tic
x4 = A\b;
toc

Mclane answered 8/11, 2014 at 14:58 Comment(14)

How this MhReduced = Mh' * Mh; trick called? It's something like covariance matrix? – Defamatory 9/11, 2014 at 9:58

I only know it by the name of "normal equations". I have only little knowledge of matrices and linear equations, but this could help: math.wsu.edu/faculty/genz/448/lessons/l401.pdf – Mclane 9/11, 2014 at 10:7

seems a= rand(3,2) svd(a) svd(a'*a) gives different answers. – Defamatory 9/11, 2014 at 12:12

Yes, U and S are different. V however seems to be identical (apart from a very small epsilon, depending on the matrix size) and in most cases you use a SVD to access the last column vector in V (since it is the vector corresponding to the smallest singular value and represents the best solution to the linear system). However if you need U and S, I'm not sure how to preserve them with this method. What parts of the results of the SVD do you use? – Mclane 9/11, 2014 at 12:34

something like this gist.github.com/mrgloom/3943410759f04265f7cb also test a= rand(3,2);svd(a)-svd(a'*a) by your own. – Defamatory 13/11, 2014 at 9:48

I've updated my answer to make clear, how linear systems can be solved using A'*A. Let me know if it helps for your stuff. – Mclane 13/11, 2014 at 20:49

I'm not using SVD to solve A*x=b, but I use SVD for PCA calculation. Anyway in your example you didn't put A'*A directly to SVD. – Defamatory 14/11, 2014 at 10:22

If you're just solving linear equations in the form of A*x=0, you're only interested in V anyway, so calling svd(A) is equivalent to svd(A'*A) (apart from numerical stability). Since you are doing a PCA, you're interested in the singular vectors (contained in V) and the corresponding singular values (diagonal of V). When calling [U1,S1,V1] = svd(A,0); and [U2,S2,V2] = svd(A'*A); V1 and V2 are identical and S1 and S2 only differ in the regard that S1 = sqrt(S2). – Mclane 14/11, 2014 at 11:53

Ok, I understand, but columns of V1 and V2 matrix can differ in +-sign. – Defamatory 14/11, 2014 at 14:54

This is an effect of A'*A (which only happens sometimes) but does not matter, since a singular vector is still valid if multiplied by -1 (can't find the source of this just now). However after I've read up about the PCA analysis and am a little bit confused, especially of the SVD's role. Anyway, if you want to calculate the principal components there two classical ways: 1. eigenVectors = princomp(A); 2. [eigenVectors, eigenValues] = eig(cov(A)); Currently I'm getting different results when using the SVD. Maybe I'm doing something wrong. – Mclane 14/11, 2014 at 15:5

Ok, I found out what the problem is. I used an uncentered data set (and so did you by the looks of it). This will lead to wrong singular values, the vectors will be correct. If you plan on using also the singular values, you will have to correct this mistake by either using princomp(A) (is centers the data itself) or pass the covariance matrix to the eigen function: eig(covarianceMat). However, if you just use SVD(A), the singular values and the vectors!!! are simply wrong. – Mclane 14/11, 2014 at 15:36

Short version: SVD(A) will not give the correct principal components, you have to pass either a centered data set (basically row-centered) or the covariance matrix itself into the SVD. Or you could just use eig(cov(A)). – Mclane 14/11, 2014 at 15:46

math.stackexchange.com/questions/3869/… – Defamatory 17/11, 2014 at 10:29

I have read several articles on the topic, including the link you have posted. Note that the example used there assumes the dataset to have zero mean. This is not the case when you create matices using M = rand(n,m); You have to first demean your dataset (see my last comment). Also note that the covariance matrix is given by matCovar = (X'*X)/(nbRows-1) if the colums are the variables and the rows are the obvervations. Not sure why it's handled differently in the link you posted. – Mclane 17/11, 2014 at 11:51

The issue

First of all, I have replicated your issue in Matlab2016b using the following code:

clear all
close all
clc

Nrows = 2500;
Ncols = 2500;

NumTests = 10;

h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);

timingCPU = 0;
timingGPU = 0;

for k = 1 : NumTests
    % --- Host
    tic
    [h_U, h_S, h_V] = svd(h_A);
%     h_S = svd(h_A);
    timingCPU = timingCPU + toc;

    % --- Device
    tic
    [d_U, d_S, d_V] = svd(d_A);
%     d_S = svd(d_A);
    timingGPU = timingGPU + toc;
end

fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);

By the above code, it is possible to either compute the singular values only or compute the full SVD including the singular vectors. It is possible also to compare the different behavior of the CPU and GPU versions of the SVD code.

The timing is reported in the following table (timing in s; Intel Core i7-6700K CPU @ 4.00GHz, 16288 MB, Max threads(8), GTX 960):

              Sing. values only | Full SVD         | Sing. val. only | Full
                                |                  |                 |
Matrix size   CPU      GPU      | CPU       GPU    |                 |
                                |                  |                 |
 200 x  200   0.0021    0.043   |  0.0051    0.024 |   0.098         |  0.15
1000 x 1000   0.0915    0.3     |  0.169     0.458 |   0.5           |  2.3
2500 x 2500   3.35      2.13    |  4.62      3.97  |   2.9           |  23
5000 x 5000   5.2      13.1     | 26.6      73.8   |  16.1           | 161

The first 4 columns refer to a comparison between the CPU and GPU Matlab versions of the svd routine when it is used to calculate the singular values only or the full SVD. As it can be seen, the GPU version can be significantly slower than the GPU one. The motivation has been already pointed out in some answers above: there is an inherent difficulty to parallelize the SVD computation.

Using cuSOLVER?

At this point, the obvious question is: can we get some speedup with cuSOLVER? Indeed, we could use mexFiles to make the cuSOLVER routines run under Matlab. Unfortunately, the situation with cuSOLVER is even worse, as it can be deduced from the last two columns of the above table. Such columns report the timing of the codes at Singular values calculation only with CUDA and Parallel implementation for multiple SVDs using CUDA using cusolverDnSgesvd for the singular values only calculation and full SVD calculation, respectively. As it can be seen, cuSOLVER's cusolverDnSgesvd performs even worser than Matlab, if one takes into account that it deals with single precision, while Matlab with double precision.

The motivation for this behavior is further explained at cusolverDnCgesvd performance vs MKL where Joe Eaton, manager of cuSOLVER library, says

I understand the confusion here. We do provide a decent speedup for LU, QR and LDL^t factorizations, which is what we would like to say for SVD as well. Our purpose with cuSOLVER is to provide dense and sparse direct solvers as part of the CUDA toolkit for the first time; we have to start somewhere. Since CULA is no longer supported, we felt it was urgent to get some functionality into the hands of developers in CUDA 7.0. Since CUDA runs on more that x86 host CPUs these days, cuSOLVER fills a need where there is no MKL. That being said, we can do better with SVD, but it will have to wait for the next CUDA release, priorities and timelines being tight already.

Using other libraries

At this point, other possibilities are using other libraries like

CULA;
MAGMA;
ArrayFire.

CULA is not offered for free, so I have not tried it.

I had some installation issues with MAGMA dependencies, so I have not investigated this point further (disclaimer: I expect that, with some more time, I could be able to solve such issues).

I then finally ended up with using ArrayFire.

Using ArrayFire, I had the following timing for the full SVD computation:

 200 x  200      0.036
1000 x 1000      0.2
2500 x 2500      4.5
5000 x 5000     29

As it can be seen, the timing is slightly higher, but now comparable, to the CPU case.

Here is the ArrayFire code:

#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>

using namespace af;

int main(int argc, char *argv[])
{
    const int N = 1000;

    try {

        // --- Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
        af::setDevice(device);
        af::info();

        array A = randu(N, N, f64);
        af::array U, S, Vt;

        // --- Warning up
        timer time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        double elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

        time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

    }
    catch (af::exception& e) {

        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    return 0;
}

Alenealenson answered 15/5, 2017 at 11:36 Comment(0)

I have tried to parallelize SVD on my laptop equipped with GTX 460 for over one months, which was also a part of my undergraduate thesis, I did so many experiments that I later discovered that MATLAB is extremely fast and outperforms my code, by the way, I used one side Jacobi, and I have not yet seen any paper that reveals an algorithm faster than svd of MATLAB. On GPU, the time cost of memory copy can be very high if you are not using an elegant model, I refer you to read more about CUDA. If you need any help, please contact me.

Andromache answered 28/10, 2016 at 6:3 Comment(1)

This should not be an answer but a comment. :) – Incurrent 28/10, 2016 at 6:8

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

EDIT:

Recommended topics

Hot tags