Numpy vs Eigen vs Xtensor Linear Algebra Benchmark Oddity

I recently was trying to compare different python and C++ matrix libraries against each other for their linear algebra performance in order to see which one(s) to use in an upcoming project. While there are multiple types of linear algebra operations, I have chosen to focus mainly on matrix inversion, as it seems to be the one giving strange results. I have written the following code below for the comparison, but am thinking I must be doing something wrong.

C++ Code

    #include <iostream>
    #include "eigen/Eigen/Dense"
    #include <xtensor/xarray.hpp>
    #include <xtensor/xio.hpp>
    #include <xtensor/xview.hpp>
    #include <xtensor/xrandom.hpp>
    #include <xtensor-blas/xlinalg.hpp> //-lblas -llapack for cblas,   -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread for openblas

    //including accurate timer
    #include <chrono>
    //including vector array
    #include <vector>

    void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats = 1000);
    void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats = 1000);
    
    int main()
    {
      std::vector<int> sizings{1, 10, 100, 1000, 10000, 100000};
    
      basicMatrixComparisonEigen(sizings, 2);
      basicMatrixComparisonXtensor(sizings,2);
      return 0;
    }
    
    
    void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats)
    {
      std::chrono::high_resolution_clock::time_point t1;
      std::chrono::high_resolution_clock::time_point t2;
      using time = std::chrono::high_resolution_clock;
    
      std::cout << "Timing Eigen: " << std::endl;
      for (auto &dim : dims)
      {
    
        std::cout << "Scale Factor: " << dim << std::endl;
        try
        {
          //Linear Operations
          auto l = Eigen::MatrixXd::Random(dim, dim);
    
          //Eigen Matrix inversion
          t1 = time::now();
          for (int i = 0; i < numrepeats; i++)
          {
            Eigen::MatrixXd pinv = l.completeOrthogonalDecomposition().pseudoInverse();
            //note this does not come out to be identity.  The inverse is wrong.
            //std::cout<<l*pinv<<std::endl;
          }
          t2 = time::now();
          std::cout << "Eigen Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
          std::cout << "\n\n\n";
        }
        catch (const std::exception &e)
        {
          std::cout << "Error:   '" << e.what() << "'\n";
        }
      }
    }
    
    
    void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats)
    {
      std::chrono::high_resolution_clock::time_point t1;
      std::chrono::high_resolution_clock::time_point t2;
      using time = std::chrono::high_resolution_clock;
    
      std::cout << "Timing Xtensor: " << std::endl;
      for (auto &dim : dims)
      {
    
        std::cout << "Scale Factor: " << dim << std::endl;
        try
        {
    
          //Linear Operations
          auto l = xt::random::randn<double>({dim, dim});
    
          //Xtensor Matrix inversion
          t1 = time::now();
          for (int i = 0; i < numrepeats; i++)
          {
            auto inverse = xt::linalg::pinv(l);
            //something is wrong here.  The inverse is not actually the inverse when you multiply it out. 
            //std::cout << xt::linalg::dot(inverse,l) << std::endl;
          }
          t2 = time::now();
          std::cout << "Xtensor Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
        
          std::cout << "\n\n\n";
        }
        catch (const std::exception &e)
        {
          std::cout << "Error:   '" << e.what() << "'\n";
        }
      }
    }

This is compiled with:

g++ cpp_library.cpp -O2  -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread -march=native -o benchmark.exe

for OpenBLAS, and

g++ cpp_library.cpp -O2  -lblas -llapack -march=native -o benchmark.exe

for cBLAS.
g++ version 9.3.0.

And for Python 3:

import numpy as np
from datetime import datetime as dt

#import timeit

start=dt.now()
l=np.random.rand(1000,1000)
for i in range(2):
    result=np.linalg.inv(l)
end=dt.now()
print("Completed in: "+str((end-start)/2))
#print(np.matmul(l,result))
#print(np.dot(l,result))
#Timeit also gives similar results

I will focus on the largest decade that runs in a reasonable amount of time on my computer: 1000x1000. I know that only 2 runs introduces a bit of variance, but I've run it with more and the results are roughly the same as below:

Eigen 3.3.9: 196.804 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 378.156 milliseconds
Numpy 1.17.4: 172.582 milliseconds

Is this a reasonable result to expect? Why are the C++ libraries slower than Numpy? All 3 packages are using some sort of Lapack/BLAS backend, yet there is a significant difference between the 3. Particularly, Xtensor will pin my CPU to 100% usage with OpenBlas' threads, yet still manage to have worse performance.

I'm wondering if the C++ libraries are actually performing the inverse/pseudoinverse of the matrix, and if this is what is causing these results. In the commented sections of the C++ test code, I have noted that when I sanity-checked the results from both Eigen and Xtensor, the resulting matrix product between the matrix and its inverse was not even close to the identity matrix. I tried with smaller matrices (10x10) thinking it might be a precision error, but the problem remained. In another test, I test for rank, and these matrices are full rank. To be sure I wasn't going crazy, I tried with inv() instead of pinv() in both cases, and the results are the same. Am I using the wrong functions for this linear algebra benchmark, or is this Numpy twisting the knife on 2 disfunctional low level libraries?

EDIT: Thank you everyone for your interest in this problem. I think I have figured out the issue. I suspect Eigen and Xtensor have lazy evaluation and this actually is causing errors downstream, and outputting random matrices instead of the inversed matrices. I was able to correct the strange numerical inversion failure with the following replacements in the code:

auto temp = Eigen::MatrixXd::Random(dim, dim);
Eigen::MatrixXd l(dim,dim);
l=temp;

and

auto temp = xt::random::randn<double>({dim, dim});
xt::xarray<double> l =temp;

However, the timings didn't change much:

Eigen 3.3.9: 201.386 milliseconds
Xtensor/Xtensor-blas w/ OpenBlas: 337.299 milliseconds.
Numpy 1.17.4: (from before) 172.582 milliseconds

Actually, a little strangely, adding -O3 and -ffast-math actually slowed down the code a little. -march=native had the biggest performance increase for me when I tried it. Also, OpenBLAS is 2-3X faster than CBLAS for these problems.

Recommended topics

Hot tags