What is the fastest way to calculate first two principal components in R?
Asked Answered
A

8

18

I am using princomp in R to perform PCA. My data matrix is huge (10K x 10K with each value up to 4 decimal points). It takes ~3.5 hours and ~6.5 GB of Physical memory on a Xeon 2.27 GHz processor.

Since I only want the first two components, is there a faster way to do this?

Update :

In addition to speed, Is there a memory efficient way to do this ?

It takes ~2 hours and ~6.3 GB of physical memory for calculating first two components using svd(,2,).

Atticism answered 28/11, 2011 at 17:3 Comment(1)
The NIPALS algorithm can be used. Search the R packages for that.Euterpe
T
21

You sometimes gets access to so-called 'economical' decompositions which allow you to cap the number of eigenvalues / eigenvectors. It looks like eigen() and prcomp() do not offer this, but svd() allows you to specify the maximum number to compute.

On small matrices, the gains seem modest:

R> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N, N)
R> library(rbenchmark)
R> benchmark(eigen(M), svd(M,2,0), prcomp(M), princomp(M), order="relative")
          test replications elapsed relative user.self sys.self user.child
2 svd(M, 2, 0)          100   0.021  1.00000      0.02        0          0
3    prcomp(M)          100   0.043  2.04762      0.04        0          0
1     eigen(M)          100   0.050  2.38095      0.05        0          0
4  princomp(M)          100   0.065  3.09524      0.06        0          0
R> 

but the factor of three relative to princomp() may be worth your while reconstructing princomp() from svd() as svd() allows you to stop after two values.

Tugboat answered 28/11, 2011 at 20:4 Comment(3)
With N=200 my machine does princomp the fastest (not by much, basically equal to svd(,2,), so the results may vary with processor and with scaling.Saideman
In the rbenchmark package. There is also a the microbenchmark package.Tugboat
fast.svd in the corpcor package is wicked fast.Gibert
W
6

The 'svd' package provides the routines for truncated SVD / eigendecomposition via Lanczos algorithm. You can use it to calculate just first two principal components.

Here I have:

> library(svd)
> set.seed(42); N <- 1000; M <- matrix(rnorm(N*N), N, N)
> system.time(svd(M, 2, 0))
   user  system elapsed 
  7.355   0.069   7.501 
> system.time(princomp(M))
   user  system elapsed 
  5.985   0.055   6.085 
> system.time(prcomp(M))
   user  system elapsed 
  9.267   0.060   9.368 
> system.time(trlan.svd(M, neig = 2))
   user  system elapsed 
  0.606   0.004   0.614 
> system.time(trlan.svd(M, neig = 20))
   user  system elapsed 
  1.894   0.009   1.910
> system.time(propack.svd(M, neig = 20))
   user  system elapsed 
  1.072   0.011   1.087 
Wendiwendie answered 29/11, 2011 at 16:47 Comment(2)
As my data is square matrix, is there a hack to input only upper/lower triangular matrix to any of the functions (svd,princomp,prcomp) ? That would save memory consuming step of duplicating lower triangle as upper triangle !Atticism
I don't think that this is possible for "usual" functions. For stuff from "svd" package you can use the so-called "external matrix interface" where you just define how to multiply matrix by a vector, and that's all. Right now this API is C-level only, but rumors are that everything will be propagated to ordinary R level soon, so one can write their own routines in R (and surely exploit the symmetry or sparseness of the matrix).Wendiwendie
D
4

I tried the pcaMethods package's implementation of the nipals algorithm. By default it calculates the first 2 principal components. Turns out to be slower than the other suggested methods.

set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N, N)
library(pcaMethods)
library(rbenchmark)
m1 <- pca(M, method="nipals", nPcs=2)
benchmark(pca(M, method="nipals"),
          eigen(M), svd(M,2,0), prcomp(M), princomp(M), order="relative")

                       test replications elapsed relative user.self sys.self
3              svd(M, 2, 0)          100    0.02      1.0      0.02        0
2                  eigen(M)          100    0.03      1.5      0.03        0
4                 prcomp(M)          100    0.03      1.5      0.03        0
5               princomp(M)          100    0.05      2.5      0.05        0
1 pca(M, method = "nipals")          100    0.23     11.5      0.24        0
Dkl answered 28/11, 2012 at 15:27 Comment(0)
D
4

I am surprised nobody mentioned the irlba package yet:

It is even a bit faster than svd's propack.svd, provides with irlba::prcomp_irlba(X, n=2) a stats::prcomp-like interface for convenience and did not require parameter adjustments in the following benchmark for rectangular matrices (2:1) of varying size. For matrices of size 6000x3000 it is 50 times faster than stats::prcomp. For matrices smaller than 100x50 stats::svd is still faster though.

benchmark results

library(microbenchmark)
library(tidyverse)
#install.packages("svd","corpcor","irlba","rsvd")

exprs <- rlang::exprs(
  svd(M, 2, 2)$v,
  prcomp(M)$rotation[,1:2],
  irlba::prcomp_irlba(M, n=2)$rotation,
  irlba::svdr(M, k=2)$v,
  rsvd::rsvd(M, 2)$v,
  svd::propack.svd(M, neig=2, opts=list(maxiter=100))$v,
  corpcor::fast.svd(M)$v[,1:2]
)

set.seed(42)
tibble(N=c(10,30,100,300,1000,3000)) %>%
  group_by(N) %>%
  do({
    M <- scale(matrix(rnorm(.$N*.$N*2), .$N*2, .$N))
    microbenchmark(!!!exprs,
      times=min(100, ceiling(3000/.$N)))%>%
      as_tibble
  }) %>% 
ggplot(aes(x=N, y=time/1E9,color=expr)) +
  geom_jitter(width=0.05) +
  scale_x_log10("matrix size (2N x N)") +
  scale_y_log10("time [s]") +
  stat_summary(fun.y = median, geom="smooth") +
  scale_color_discrete(labels = partial(str_wrap, width=30))

The randomized svd provided by rsvd is even faster, but unfortunately, quite off:

set.seed(42)
N <- 1000
M <- scale(matrix(rnorm(N^2*2), N*2, N))
cor(set_colnames(sapply(exprs, function(x) eval(x)[,1]), sapply(exprs, deparse)))
                                                       svd(M, 2, 2)$v prcomp(M)$rotation[, 1:2] irlba::prcomp_irlba(M, n = 2)$rotation irlba::svdr(M, k = 2)$v rsvd::rsvd(M, 2)$v svd::propack.svd(M, neig = 2, opts = list(maxiter = 100))$v corpcor::fast.svd(M)$v[, 1:2]
svd(M, 2, 2)$v                                                   1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
prcomp(M)$rotation[, 1:2]                                        1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
irlba::prcomp_irlba(M, n = 2)$rotation                          -1.0000000                -1.0000000                              1.0000000              -0.9998748          -0.286184                                                  -1.0000000                    -1.0000000
irlba::svdr(M, k = 2)$v                                          0.9998748                 0.9998748                             -0.9998748               1.0000000           0.290397                                                   0.9998748                     0.9998748
rsvd::rsvd(M, 2)$v                                               0.2861840                 0.2861840                             -0.2861840               0.2903970           1.000000                                                   0.2861840                     0.2861840
svd::propack.svd(M, neig = 2, opts = list(maxiter = 100))$v      1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000
corpcor::fast.svd(M)$v[, 1:2]                                    1.0000000                 1.0000000                             -1.0000000               0.9998748           0.286184                                                   1.0000000                     1.0000000

This might be better when the data actually has a structure though.

Donner answered 5/4, 2019 at 20:51 Comment(0)
E
1

The power method might be what you want. If you code it in R, which is not hard at all, I think you may find that it is no faster than the SVD approach suggested in other answer, which makes use of LAPACK compiled routines.

Everyway answered 29/11, 2011 at 8:23 Comment(2)
I would advise against this as the power method has extremely slow convergence.Fremd
This is true in many cases. Speed depends on the relative magnitude of the largest eigenvalue to the next; so it will be problem dependent. Still, I think the method might be competitive if only two eigenvectors are sought and the matrix is very large. No way of knowing without trying.Everyway
Q
0

you can use neural network approach to find the principal component. Basic description is given here.. http://www.heikohoffmann.de/htmlthesis/node26.html

First principal Component, y= w1*x1+w2*x2 and Second orthogonal Component can be calculated as q = w2*x1-w1*x2.

Quean answered 29/11, 2011 at 7:43 Comment(0)
S
0

The "gmodels" and "corpcor" R packages come with faster implementations of SVD and PCA. These perform similarly to the core versions for small matrices:

> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N*N, N)
> library("rbenchmark")
> library("gmodels")    
> benchmark(svd(M,2,0), svd(M), gmodels::fast.svd(M), corpcor::fast.svd(M), prcomp(M), gmodels::fast.prcomp(M), princomp(M), order="relative")
                     test replications elapsed relative user.self sys.self user.child sys.child
1            svd(M, 2, 0)          100   0.005      1.0     0.005    0.000          0         0
2                  svd(M)          100   0.006      1.2     0.005    0.000          0         0
3    gmodels::fast.svd(M)          100   0.007      1.4     0.006    0.000          0         0
4    corpcor::fast.svd(M)          100   0.007      1.4     0.007    0.000          0         0
6 gmodels::fast.prcomp(M)          100   0.014      2.8     0.014    0.000          0         0
5               prcomp(M)          100   0.015      3.0     0.014    0.001          0         0
7             princomp(M)          100   0.030      6.0     0.029    0.001          0         0
> 

However, they provide a faster result for larger matrices (especially those with many rows).

> set.seed(42); N <- 10; M <- matrix(rnorm(N*N), N*N*N, N)
> library("rbenchmark")
> library("gmodels")
> benchmark(svd(M,2,0), svd(M), gmodels::fast.svd(M), corpcor::fast.svd(M), prcomp(M), gmodels::fast.prcomp(M), order="relative")

                     test replications elapsed relative user.self sys.self user.child sys.child
4    corpcor::fast.svd(M)          100   0.029    1.000     0.028    0.001          0         0
3    gmodels::fast.svd(M)          100   0.035    1.207     0.033    0.001          0         0
2                  svd(M)          100   0.037    1.276     0.035    0.002          0         0
1            svd(M, 2, 0)          100   0.039    1.345     0.037    0.001          0         0
5               prcomp(M)          100   0.068    2.345     0.061    0.006          0         0
6 gmodels::fast.prcomp(M)          100   0.068    2.345     0.060    0.007          0         0
Swansdown answered 13/6, 2018 at 5:38 Comment(4)
your benchmark nicely shows that the gmodels functions are not actually faster.Donner
That depends on whether your using PCA or SVD. The question also specifically pertains to performance on large matrices.Swansdown
35 ms instead of 37 ms is not really faster. 1000x10 is still very small compared to OP's 10000 squared. You probably meant to add the *Ns also in the rnorm calls , currently you are testing matices where all columns are identical (one of the many design flaws of R), this is probably not an ideal test case. Both packages claim an advantage only for fat/wide matrices, but even there I did not observe an real advantage a quick test. If you find the time to address these issues, your answers would be as useful as Kevin Wright's answer.Donner
Yes, this is not an ideal benchmarking. I did not have much time to run large matrices at the time of posting this. The purpose was not to extensively test or give the right answer but to put more options on the table (using the same benchmarking as that answer). I’d recommend anyone seriously applying this to try out larger test jobs before applying it a significantly larger matrix and consider that results may differ to smaller matrixes due to overheads.Swansdown
Y
-1

You could write the function yourself and stop at 2 components. It is not too difficult. I have it laying around somewhere, if I find it I will post it.

Yoo answered 28/11, 2011 at 17:8 Comment(4)
May be you can give logic of function, I can try to code myself !Atticism
As a primer to PCA, I did a blog post where I tried to explain this in terms of OLS: cerebralmastication.com/2010/09/… Down at the bottom there's a link to an article by Lindsay I Smith which I found really helpful. Link to Smith PDF: cs.otago.ac.nz/cosc453/student_tutorials/…Pizor
@JD Long: That's an interesting article. Let me try !Atticism
it might be worth looking at the pcaMethods package from the Bioc project. I have no idea how fast it is, but it's another reference point. bioconductor.org/packages/release/bioc/html/pcaMethods.htmlPizor

© 2022 - 2024 — McMap. All rights reserved.