Numpy vs Eigen vs Xtensor Linear Algebra Benchmark Oddity
Asked Answered
G

1

7

I recently was trying to compare different python and C++ matrix libraries against each other for their linear algebra performance in order to see which one(s) to use in an upcoming project. While there are multiple types of linear algebra operations, I have chosen to focus mainly on matrix inversion, as it seems to be the one giving strange results. I have written the following code below for the comparison, but am thinking I must be doing something wrong.

C++ Code

    #include <iostream>
    #include "eigen/Eigen/Dense"
    #include <xtensor/xarray.hpp>
    #include <xtensor/xio.hpp>
    #include <xtensor/xview.hpp>
    #include <xtensor/xrandom.hpp>
    #include <xtensor-blas/xlinalg.hpp> //-lblas -llapack for cblas,   -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread for openblas

    //including accurate timer
    #include <chrono>
    //including vector array
    #include <vector>

    void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats = 1000);
    void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats = 1000);
    
    int main()
    {
      std::vector<int> sizings{1, 10, 100, 1000, 10000, 100000};
    
      basicMatrixComparisonEigen(sizings, 2);
      basicMatrixComparisonXtensor(sizings,2);
      return 0;
    }
    
    
    void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats)
    {
      std::chrono::high_resolution_clock::time_point t1;
      std::chrono::high_resolution_clock::time_point t2;
      using time = std::chrono::high_resolution_clock;
    
      std::cout << "Timing Eigen: " << std::endl;
      for (auto &dim : dims)
      {
    
        std::cout << "Scale Factor: " << dim << std::endl;
        try
        {
          //Linear Operations
          auto l = Eigen::MatrixXd::Random(dim, dim);
    
          //Eigen Matrix inversion
          t1 = time::now();
          for (int i = 0; i < numrepeats; i++)
          {
            Eigen::MatrixXd pinv = l.completeOrthogonalDecomposition().pseudoInverse();
            //note this does not come out to be identity.  The inverse is wrong.
            //std::cout<<l*pinv<<std::endl;
          }
          t2 = time::now();
          std::cout << "Eigen Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
          std::cout << "\n\n\n";
        }
        catch (const std::exception &e)
        {
          std::cout << "Error:   '" << e.what() << "'\n";
        }
      }
    }
    
    
    void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats)
    {
      std::chrono::high_resolution_clock::time_point t1;
      std::chrono::high_resolution_clock::time_point t2;
      using time = std::chrono::high_resolution_clock;
    
      std::cout << "Timing Xtensor: " << std::endl;
      for (auto &dim : dims)
      {
    
        std::cout << "Scale Factor: " << dim << std::endl;
        try
        {
    
          //Linear Operations
          auto l = xt::random::randn<double>({dim, dim});
    
          //Xtensor Matrix inversion
          t1 = time::now();
          for (int i = 0; i < numrepeats; i++)
          {
            auto inverse = xt::linalg::pinv(l);
            //something is wrong here.  The inverse is not actually the inverse when you multiply it out. 
            //std::cout << xt::linalg::dot(inverse,l) << std::endl;
          }
          t2 = time::now();
          std::cout << "Xtensor Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
        
          std::cout << "\n\n\n";
        }
        catch (const std::exception &e)
        {
          std::cout << "Error:   '" << e.what() << "'\n";
        }
      }
    }

This is compiled with:

g++ cpp_library.cpp -O2  -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread -march=native -o benchmark.exe

for OpenBLAS, and

g++ cpp_library.cpp -O2  -lblas -llapack -march=native -o benchmark.exe

for cBLAS.
g++ version 9.3.0.

And for Python 3:

import numpy as np
from datetime import datetime as dt

#import timeit

start=dt.now()
l=np.random.rand(1000,1000)
for i in range(2):
    result=np.linalg.inv(l)
end=dt.now()
print("Completed in: "+str((end-start)/2))
#print(np.matmul(l,result))
#print(np.dot(l,result))
#Timeit also gives similar results

I will focus on the largest decade that runs in a reasonable amount of time on my computer: 1000x1000. I know that only 2 runs introduces a bit of variance, but I've run it with more and the results are roughly the same as below:

  • Eigen 3.3.9: 196.804 milliseconds
  • Xtensor/Xtensor-blas w/ OpenBlas: 378.156 milliseconds
  • Numpy 1.17.4: 172.582 milliseconds

Is this a reasonable result to expect? Why are the C++ libraries slower than Numpy? All 3 packages are using some sort of Lapack/BLAS backend, yet there is a significant difference between the 3. Particularly, Xtensor will pin my CPU to 100% usage with OpenBlas' threads, yet still manage to have worse performance.

I'm wondering if the C++ libraries are actually performing the inverse/pseudoinverse of the matrix, and if this is what is causing these results. In the commented sections of the C++ test code, I have noted that when I sanity-checked the results from both Eigen and Xtensor, the resulting matrix product between the matrix and its inverse was not even close to the identity matrix. I tried with smaller matrices (10x10) thinking it might be a precision error, but the problem remained. In another test, I test for rank, and these matrices are full rank. To be sure I wasn't going crazy, I tried with inv() instead of pinv() in both cases, and the results are the same. Am I using the wrong functions for this linear algebra benchmark, or is this Numpy twisting the knife on 2 disfunctional low level libraries?

EDIT: Thank you everyone for your interest in this problem. I think I have figured out the issue. I suspect Eigen and Xtensor have lazy evaluation and this actually is causing errors downstream, and outputting random matrices instead of the inversed matrices. I was able to correct the strange numerical inversion failure with the following replacements in the code:

auto temp = Eigen::MatrixXd::Random(dim, dim);
Eigen::MatrixXd l(dim,dim);
l=temp;

and

auto temp = xt::random::randn<double>({dim, dim});
xt::xarray<double> l =temp;

However, the timings didn't change much:

  • Eigen 3.3.9: 201.386 milliseconds
  • Xtensor/Xtensor-blas w/ OpenBlas: 337.299 milliseconds.
  • Numpy 1.17.4: (from before) 172.582 milliseconds

Actually, a little strangely, adding -O3 and -ffast-math actually slowed down the code a little. -march=native had the biggest performance increase for me when I tried it. Also, OpenBLAS is 2-3X faster than CBLAS for these problems.

Girandole answered 10/5, 2021 at 2:12 Comment(5)
most of the important Python libraries like Numpy, Pandas, TF ... their computational expenses are written in C++ and the functions in python just pass the data to the c++ functions for calculation. and the c++ codes are written very optimized. I tried many times to improve them but no luck.Decency
I suspected as much, but was thinking there might have been a confounding factor, either in how I was doing the benchmark, or particularly with the functions the C++ libraries called, as they didn't seem to actually be generating the inverse matrix.Girandole
You could try to optimise a bit further by compiler with -03. Also you could use xsimd for xtensor, but I doubt that that would help in this particular caseLatinist
Regarding your 'precision' questions, and that is probably quite relevant for your efficiency questions. Are you sure that the matrices can in fact be inverted?Latinist
Auto-vectorization is not enabled in -O2, while it is in -O3. For a small additional speed up at the expense of a less portable binary, you can use -march=native. Moreover, if you do not care about having very precise results and you are sure that NaN/Inf/etc. should not appear, you can use -ffast-math. It should make your C++ code a bit faster. Besides this, the algorithm used by the 3 implementation are not necessary the same, some can be faster or slower regarding the input matrix. They can be more or less numerically stable too.Scenic
J
5

Firstly, you are not computing same things.

To compute inverse of l matrix, use l.inverse() for Eigen and xt::linalg::inv() for xtensor

When you link Blas to Eigen or xtensor, these operations are automatically dispatched to the your choosen Blas.

I tried replacing the inverse functions, replaced auto with MatrixXd and xt::xtensor to avoid lazy evaluation, linked openblas to Eigen, xtensor and numpy and compiled with only -O3 flag, the following are the results on my Macbook pro M1:

Eigen-3.3.9 (with openblas) - ~ 38 ms

Eigen-3.3.9 (without openblas) - ~ 85 ms

xtensor-master (with openblas) - ~41 ms

Numpy- 1.21.2 (with openblas) - ~35 ms.

Joub answered 20/9, 2021 at 12:4 Comment(1)
why numpy faster then eigen with openblas?Rihana

© 2022 - 2024 — McMap. All rights reserved.