crossprod(m1, m2) is running slower than t(m1) %*% m2 on my machine
Asked Answered
J

2

5

Why does t(mat1) %*% mat2 work quicker than crossprod(mat1, mat2). Isn't the whole point of the latter that it calls a more efficient low-level routine?

r$> mat1 <- array(rnorm(100 * 600), dim = c(100, 600))

r$> mat2 <- array(rnorm(100 * 800), dim = c(100, 800))

r$> microbenchmark::microbenchmark(crossprod(mat1, mat2), t(mat1) %*% mat2)
Unit: milliseconds
                  expr     min       lq     mean   median       uq     max neval
 crossprod(mat1, mat2) 37.1905 44.35975 48.49262 47.61035 50.99725 76.1250   100
      t(mat1) %*% mat2 25.0813 31.68405 35.46811 35.18380 39.16565 49.8131   100

Running on

r$> version
               _
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
crt            ucrt
system         x86_64, mingw32
status
major          4
minor          4.1
year           2024
month          06
day            14
svn rev        86737
language       R
version.string R version 4.4.1 (2024-06-14 ucrt)
nickname       Race for Your Life

EDIT: local machine env

r$> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default

r$> Sys.getenv("R_BLAS_LIBS")
    Sys.getenv("R_LAPACK_LIBS")

[1] ""
[1] ""

r$> getOption("matprod")
[1] "default"

r$> extSoftVersion()
                     zlib                     bzlib                        xz 
                  "1.3.1"      "1.0.8, 13-Jul-2019"                   "5.4.6" 
               libdeflate                      PCRE                       ICU 
                   "1.19"        "10.43 2024-02-16"                    "74.2" 
                      TRE                     iconv                  readline 
"TRE 0.8.0 R_fixes (BSD)"               "win_iconv"                        "" 
                     BLAS 
                       "" 
Jessejessee answered 3/10 at 5:24 Comment(8)
collinerickson.github.io/2018/04/23/using-crossprod-in-rHeater
@Heater DYK by chance which BLAS/LAPACK your link is using? Speed might heavily depend on that.Snowblind
@Snowblind unfortunately it was not revealed in that postHeater
@Snowblind Sys.getenv is just showing the environment variables, there are no explicit BLAS or LAPACK libs linked, so from my understanding it's just using the BLAS library R ships with on Windows.Jessejessee
I'm getting similar results with /usr/local/lib/R/lib/libRblas.so, R-devel, with 10x as many rows crossprod median is 343 ms vs. t(x) %*% y median of 240 ms. If no-one can answer here, maybe worth bringing this up on [email protected] ?Blotter
@Jessejessee Might be using default Netlib BLAS since no other BLAS is set. Could you try sth like dir(file.path(R.home(), "bin", "x64"), pattern="blas|lapack", full.names=TRUE)?Snowblind
@Snowblind r$> dir(file.path(R.home(), "bin", "x64"), pattern="blas|lapack", full.names=TRUE) [1] ".../R/R-44~1.1/bin/x64/Rblas.dll" ".../R/R-44~1.1/bin/x64/Rlapack.dll" just the default binaries for Windows...right?Jessejessee
@Jessejessee I think they are NETLIB and coming with R.Snowblind
B
4

This is not an answer, but too long for a comment:

Both crossprod and simple matrix multiplication can either do simple brute-force triple loops, or call BLAS routine gemm ["general matrix-matrix (multiplication)"] with appropriate arguments, depending on (1) how R is configured and (2) whether there are any NA values in the arguments. Thus, given that t(x) %*% y necessarily involves the intermediate steps of allocating space to hold the transposed matrix and doing the actual transposition of values, one would expect that either t(x) %*% y and crossprod would both use BLAS, or neither would, and therefore that crossprod would never be slower than transposition + matrix multiplication; if the transposition step is very quick relative to all of the floating-point operations involved in the multiplication, though, the difference might be lost in the noise.

It makes sense that as the matrices get bigger, the cost of doing the transposition (O(n^2), not counting allocation time) shrinks relative to the cost of the multiplication (O(n^3) if being boneheaded, looks like BLAS uses Strassen's algorithm which is O(n^2.8)), so the advantage of crossprod should shrink. (Furthermore, transposition just has to move stuff around/assign values, whereas matrix multiplication has to do actual arithmetic, so the matrix multiplication should have a larger constant/be more expensive, all other things being equal). This pattern is noted by the blog post linked by @ThomasIsCoding in the comments. But that still doesn't explain why crossprod should do worse ...

If you want to dig through the code, you can find e.g.

   R_xlen_t NRX = nrx, NRY = nry, NCX = ncx;
    for (int i = 0; i < ncx; i++)
    for (int k = 0; k < ncy; k++) {
        sum = 0.0;                  
        for (int j = 0; j < nrx; j++)   
        sum += x[j + i * NRX] * y[j + k * NRY];
        z[i + k * NCX] = (double) sum;  
    }

Note related questions here, here, but neither seems to explain this pattern; the latter (which does a better job than I did comparing the transposition and multiplication costs) explains why t(x) %*% y might be faster under some circumstances when y contains lots of zeros — not the case here ...

Blotter answered 4/10 at 1:47 Comment(1)
Thanks for the insight! In my answer, I ran benchmarks across different BLAS libraries—crossprod() is consistently faster for all BLAS except NETLIB, where t(m1) %*% m2 unexpectedly outperforms. My results also support your explanation about transposition costs since crossprod() performance decreases with growing matrices.Snowblind
S
4

Also too long for a comment:

Since I use flexiblas, I was able to do a cross-BLAS benchmarking relatively easily.

The matrices m1, m2 were created with increasing scaling factors m1 was created with dim = [100, 6*fac], m2 was created with dim = [100, 8*fac], where the size factor fac = [1, 8, 64, 512].

Using NETLIB, t(mat1) %*% mat2 was significantly faster than crossprod(mat1, mat2) (fac=64; d=15.959; p<0.0001). In contrast, for all other BLAS libraries, crossprod() consistently outperformed t(mat1) %*% mat2, with OPENBLAS-SERIAL showing the fastest performance (fac=64; d=-0.460; p<0.0001) among single-threaded BLAS libraries compared to NETLIB. Notably, OPENBLAS-OPENMP, being multi-threaded, showed minimal difference between the two operations (fac=64; d=-0.145; p=0.3206).

Conclusion: The results suggest a potential inefficiency in NETLIB's handling of crossprod(). Overall, the within effect sizes decrease, which supports @Ben Bolker's' mechanistic explanation of transposition costs.

Results

In the result tables below, wth refers to within BLAS and btw to between BLAS wit NETLIB as reference BLAS. d refers to Cohen'd d, and p to the p value. Benchmarks were performed using microbenchmark() with 100 repetitions for each configuration. Times are measured in milliseconds.

Scaling factor 1

library                  expr                fct   mean_time   median_time   sd_time   p_wth       d_wth    p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   1     0.008       0.008         0.001     1.430e-08   -0.616   NA              NA  
NETLIBv3.9.0*            t(m1) %*% m2        1     0.011       0.011         0.006     1.430e-08   -0.616   NA              NA  
ATLASv3.10.3             crossprod(m1, m2)   1     0.005       0.005         0.001     8.934e-13   -0.148   1.550e-57   -3.382  
ATLASv3.10.3             t(m1) %*% m2        1     0.011       0.010         0.008     8.934e-13   -0.148   4.034e-01   -0.006  
BLISv0.9.0               crossprod(m1, m2)   1     0.011       0.011         0.001     4.456e-03   -0.215   4.203e-77    6.260  
BLISv0.9.0               t(m1) %*% m2        1     0.021       0.017         0.032     4.456e-03   -0.215   7.246e-04    0.020  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   1     0.004       0.004         0.001     6.543e-10   -0.351   5.952e-70   -4.883  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        1     0.010       0.009         0.010     6.543e-10   -0.351   9.297e-03   -0.042  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   1     0.004       0.004         0.001     1.850e-09   -0.164   1.577e-71   -4.749  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        1     0.011       0.009         0.011     1.850e-09   -0.164   1.557e-01   -0.019  

Scaling factor 8

library                  expr                fct   mean_time   median_time   sd_time   p_wth       d_wth    p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   8     0.437       0.420         0.079     9.181e-22    1.464   NA              NA  
NETLIBv3.9.0*            t(m1) %*% m2        8     0.332       0.324         0.063     9.181e-22    1.464   NA              NA  
ATLASv3.10.3             crossprod(m1, m2)   8     0.142       0.142         0.009     6.875e-47   -3.022   1.083e-60   -4.984  
ATLASv3.10.3             t(m1) %*% m2        8     0.171       0.173         0.010     6.875e-47   -3.022   1.872e-46   -3.210  
BLISv0.9.0               crossprod(m1, m2)   8     0.071       0.066         0.039     1.535e-10   -0.946   2.060e-65   -5.887  
BLISv0.9.0               t(m1) %*% m2        8     0.099       0.099         0.015     1.535e-10   -0.946   1.614e-60   -4.769  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   8     0.088       0.072         0.167     2.691e-01   -0.129   2.640e-35   -2.665  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        8     0.106       0.107         0.022     2.691e-01   -0.129   1.186e-57   -4.709  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   8     0.054       0.053         0.012     2.108e-40   -2.472   1.146e-71   -6.384  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        8     0.082       0.085         0.011     2.108e-40   -2.472   7.725e-64   -4.982  

Scaling factor 64

library                  expr                fct   mean_time   median_time   sd_time   p_wth        d_wth    p_btw        d_btw    
NETLIBv3.9.0*            crossprod(m1, m2)   64    26.041      25.967        0.412     5.163e-104   15.959   NA                NA  
NETLIBv3.9.0*            t(m1) %*% m2        64    15.590      15.529        0.841     5.163e-104   15.959   NA                NA  
ATLASv3.10.3             crossprod(m1, m2)   64     7.569       7.443        0.283     8.863e-05    -0.511   3.913e-182   -48.035  
ATLASv3.10.3             t(m1) %*% m2        64     7.755       7.587        0.422     8.863e-05    -0.511   7.258e-98    -11.433  
BLISv0.9.0               crossprod(m1, m2)   64     2.258       2.073        0.347     4.938e-10    -0.743   1.732e-195   -60.983  
BLISv0.9.0               t(m1) %*% m2        64     2.552       2.366        0.433     4.938e-10    -0.743   1.217e-119   -18.932  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   64     1.421       0.836        1.063     2.079e-01    -0.145   4.644e-148   -22.831  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        64     1.658       1.012        1.948     2.079e-01    -0.145   4.666e-84     -9.233  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   64     2.168       2.050        0.277     1.798e-04    -0.460   5.601e-198   -59.534  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        64     2.330       2.173        0.406     1.798e-04    -0.460   4.006e-121   -19.346  

Scaling factor 512

library                  expr                fct   mean_time   median_time   sd_time   p_wth        d_wth    p_btw        d_btw     
NETLIBv3.9.0*            crossprod(m1, m2)   512   1733.759    1728.830      16.205    5.592e-129   25.400   NA                 NA  
NETLIBv3.9.0*            t(m1) %*% m2        512   1241.799    1247.721      21.981    5.592e-129   25.400   NA                 NA  
ATLASv3.10.3             crossprod(m1, m2)   512    489.746     486.333      10.618    1.794e-01    -0.153   4.427e-181    -90.790  
ATLASv3.10.3             t(m1) %*% m2        512    491.527     486.131      12.480    1.794e-01    -0.153   6.680e-149    -41.802  
BLISv0.9.0               crossprod(m1, m2)   512    147.600     146.746       4.762    1.699e-17    -1.431   3.411e-204   -109.626  
BLISv0.9.0               t(m1) %*% m2        512    152.862     152.876       1.994    1.699e-17    -1.431   1.190e-170    -61.923  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   512    100.148      96.742      14.906    3.206e-01    -0.136   2.568e-189   -104.908  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        512    102.031     101.905      12.570    3.206e-01    -0.136   1.421e-162    -64.349  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   512    139.406     137.633       6.863    4.804e-03    -0.324   1.387e-202   -118.417  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        512    141.396     139.434       5.187    4.804e-03    -0.324   8.653e-169    -69.579  

* NETLIB is reference for between (btw) comparisons.

Note

R version 4.4.1 (2024-06-14)
Platform: x86_64-redhat-linux-gnu
Running under: AlmaLinux 9.4 (Seafoam Ocelot)
CPUs: 8

Additional results with one NA in matrix m1

Scaling factor 1

library                  expr                fct   mean_time   median_time   sd_time   p_wth       d_wth    p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   1     0.007       0.007         0.002     1.450e-26   -1.967   NA             NA   
NETLIBv3.9.0*            t(m1) %*% m2        1     0.013       0.012         0.003     1.450e-26   -1.967   NA             NA   
ATLASv3.10.3             crossprod(m1, m2)   1     0.007       0.007         0.000     1.272e-35   -1.011   2.891e-02   -0.296  
ATLASv3.10.3             t(m1) %*% m2        1     0.013       0.012         0.003     1.272e-35   -1.011   2.127e-03   -0.060  
BLISv0.9.0               crossprod(m1, m2)   1     0.007       0.007         0.001     6.845e-26   -1.854   5.205e-01   -0.091  
BLISv0.9.0               t(m1) %*% m2        1     0.013       0.012         0.004     6.845e-26   -1.854   3.049e-01   0.019   
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   1     0.007       0.007         0.000     2.076e-34   -0.982   2.691e-02   -0.301  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        1     0.012       0.012         0.003     2.076e-34   -0.982   1.230e-05   -0.094  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   1     0.007       0.007         0.001     7.888e-30   -1.909   3.296e-01   -0.134  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        1     0.013       0.012         0.004     7.888e-30   -1.909   1.188e-02   0.120   

Scaling factor 8

library                  expr                fct   mean_time   median_time   sd_time   p_wth       d_wth    p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   8     0.402       0.401         0.007     6.664e-60   -3.085   NA             NA   
NETLIBv3.9.0*            t(m1) %*% m2        8     0.437       0.439         0.012     6.664e-60   -3.085   NA             NA   
ATLASv3.10.3             crossprod(m1, m2)   8     0.404       0.402         0.009     1.349e-55   -2.863   8.134e-02   0.196   
ATLASv3.10.3             t(m1) %*% m2        8     0.439       0.440         0.014     1.349e-55   -2.863   3.782e-03   0.212   
BLISv0.9.0               crossprod(m1, m2)   8     0.409       0.409         0.010     5.239e-47   -3.026   3.474e-09   0.772   
BLISv0.9.0               t(m1) %*% m2        8     0.444       0.446         0.013     5.239e-47   -3.026   1.284e-10   0.560   
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   8     0.409       0.409         0.007     1.867e-59   -3.908   9.066e-11   0.976   
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        8     0.446       0.444         0.011     1.867e-59   -3.908   2.657e-10   0.818   
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   8     0.401       0.401         0.006     3.623e-58   -3.076   1.955e-01   -0.139  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        8     0.437       0.439         0.013     3.623e-58   -3.076   6.519e-01   0.032   

Scaling factor 64

library                  expr                fct   mean_time   median_time   sd_time   p_wth        d_wth     p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   64    25.667      25.543        0.312     5.846e-108   -16.586   NA              NA  
NETLIBv3.9.0*            t(m1) %*% m2        64    33.771      33.649        0.617     5.846e-108   -16.586   NA              NA  
ATLASv3.10.3             crossprod(m1, m2)   64    25.737      25.710        0.344     8.459e-103   -13.497   7.703e-02    0.212  
ATLASv3.10.3             t(m1) %*% m2        64    33.219      33.028        0.689     8.459e-103   -13.497   3.771e-13   -0.841  
BLISv0.9.0               crossprod(m1, m2)   64    25.669      25.548        0.536     1.261e-90    -10.111   9.735e-01    0.004  
BLISv0.9.0               t(m1) %*% m2        64    33.240      32.872        0.900     1.261e-90    -10.111   1.195e-06   -0.684  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   64    25.788      25.529        0.636     2.939e-94    -10.619   6.034e-02    0.233  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        64    34.009      33.786        0.882     2.939e-94    -10.619   6.737e-03    0.308  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   64    25.658      25.445        0.492     6.705e-89     -9.504   8.455e-01   -0.020  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        64    33.318      33.051        0.998     6.705e-89     -9.504   1.099e-04   -0.542

Scaling factor 512

library                  expr                fct   mean_time   median_time   sd_time   p_wth        d_wth     p_btw       d_btw   
NETLIBv3.9.0*            crossprod(m1, m2)   512   1903.292    1896.555       22.096   1.405e-95    -11.054   NA              NA  
NETLIBv3.9.0*            t(m1) %*% m2        512   3615.217    3703.760      198.884   1.405e-95    -11.054   NA              NA  
ATLASv3.10.3             crossprod(m1, m2)   512   1922.887    1913.746       25.148   1.231e-95    -12.668   5.083e-08    0.828  
ATLASv3.10.3             t(m1) %*% m2        512   3649.588    3699.465      194.301   1.231e-95    -12.668   2.758e-01    0.175  
BLISv0.9.0               crossprod(m1, m2)   512   1893.862    1888.070       18.741   1.044e-94    -11.142   3.426e-03   -0.461  
BLISv0.9.0               t(m1) %*% m2        512   3610.446    3686.645      202.273   1.044e-94    -11.142   8.833e-01   -0.024  
OPENBLASv0.3.21-OPENMP   crossprod(m1, m2)   512   1887.136    1884.890       13.131   8.125e-87     -9.971   1.191e-08   -0.890  
OPENBLASv0.3.21-OPENMP   t(m1) %*% m2        512   3701.934    3692.606      254.968   8.125e-87     -9.971   7.987e-03    0.379  
OPENBLASv0.3.21-SERIAL   crossprod(m1, m2)   512   1875.622    1871.131       17.110   8.153e-104   -14.174   2.929e-16   -1.401  
OPENBLASv0.3.21-SERIAL   t(m1) %*% m2        512   3592.483    3654.150      162.627   8.153e-104   -14.174   4.305e-01   -0.125 
Snowblind answered 4/10 at 20:31 Comment(4)
I wonder what happens if you effectively disable BLAS by including one NA value in the matrix?Blotter
@BenBolker Great idea, added additional results!Snowblind
Sorry a bit late - does there end up being any satisfying resolution to this, or is it just an example warning for unsophisticated users (like me) to avoid premature optimization? Or go link in a non-NETLIB BLAS if t(x)%*%y is unbearably slow?Jessejessee
@Jessejessee I see three things to avoid: Netlib, NAs, and not using crossprod/tcrossprod.Snowblind

© 2022 - 2024 — McMap. All rights reserved.