I recently was trying to compare different python and C++ matrix libraries against each other for their linear algebra performance in order to see which one(s) to use in an upcoming project. While there are multiple types of linear algebra operations, I have chosen to focus mainly on matrix inversion, as it seems to be the one giving strange results. I have written the following code below for the comparison, but am thinking I must be doing something wrong.
C++ Code
#include <iostream>
#include "eigen/Eigen/Dense"
#include <xtensor/xarray.hpp>
#include <xtensor/xio.hpp>
#include <xtensor/xview.hpp>
#include <xtensor/xrandom.hpp>
#include <xtensor-blas/xlinalg.hpp> //-lblas -llapack for cblas, -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread for openblas
//including accurate timer
#include <chrono>
//including vector array
#include <vector>
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats = 1000);
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats = 1000);
int main()
{
std::vector<int> sizings{1, 10, 100, 1000, 10000, 100000};
basicMatrixComparisonEigen(sizings, 2);
basicMatrixComparisonXtensor(sizings,2);
return 0;
}
void basicMatrixComparisonEigen(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Eigen: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = Eigen::MatrixXd::Random(dim, dim);
//Eigen Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
Eigen::MatrixXd pinv = l.completeOrthogonalDecomposition().pseudoInverse();
//note this does not come out to be identity. The inverse is wrong.
//std::cout<<l*pinv<<std::endl;
}
t2 = time::now();
std::cout << "Eigen Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
void basicMatrixComparisonXtensor(std::vector<int> dims, int numrepeats)
{
std::chrono::high_resolution_clock::time_point t1;
std::chrono::high_resolution_clock::time_point t2;
using time = std::chrono::high_resolution_clock;
std::cout << "Timing Xtensor: " << std::endl;
for (auto &dim : dims)
{
std::cout << "Scale Factor: " << dim << std::endl;
try
{
//Linear Operations
auto l = xt::random::randn<double>({dim, dim});
//Xtensor Matrix inversion
t1 = time::now();
for (int i = 0; i < numrepeats; i++)
{
auto inverse = xt::linalg::pinv(l);
//something is wrong here. The inverse is not actually the inverse when you multiply it out.
//std::cout << xt::linalg::dot(inverse,l) << std::endl;
}
t2 = time::now();
std::cout << "Xtensor Matrix inversion took: " << std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count() * 1000 / (double)numrepeats << " milliseconds." << std::endl;
std::cout << "\n\n\n";
}
catch (const std::exception &e)
{
std::cout << "Error: '" << e.what() << "'\n";
}
}
}
This is compiled with:
g++ cpp_library.cpp -O2 -llapack -L OpenBLAS/OpenBLAS_Install/lib -l:libopenblas.a -pthread -march=native -o benchmark.exe
for OpenBLAS, and
g++ cpp_library.cpp -O2 -lblas -llapack -march=native -o benchmark.exe
for cBLAS.
g++ version 9.3.0.
And for Python 3:
import numpy as np
from datetime import datetime as dt
#import timeit
start=dt.now()
l=np.random.rand(1000,1000)
for i in range(2):
result=np.linalg.inv(l)
end=dt.now()
print("Completed in: "+str((end-start)/2))
#print(np.matmul(l,result))
#print(np.dot(l,result))
#Timeit also gives similar results
I will focus on the largest decade that runs in a reasonable amount of time on my computer: 1000x1000. I know that only 2 runs introduces a bit of variance, but I've run it with more and the results are roughly the same as below:
- Eigen 3.3.9: 196.804 milliseconds
- Xtensor/Xtensor-blas w/ OpenBlas: 378.156 milliseconds
- Numpy 1.17.4: 172.582 milliseconds
Is this a reasonable result to expect? Why are the C++ libraries slower than Numpy? All 3 packages are using some sort of Lapack/BLAS backend, yet there is a significant difference between the 3. Particularly, Xtensor will pin my CPU to 100% usage with OpenBlas' threads, yet still manage to have worse performance.
I'm wondering if the C++ libraries are actually performing the inverse/pseudoinverse of the matrix, and if this is what is causing these results. In the commented sections of the C++ test code, I have noted that when I sanity-checked the results from both Eigen and Xtensor, the resulting matrix product between the matrix and its inverse was not even close to the identity matrix. I tried with smaller matrices (10x10) thinking it might be a precision error, but the problem remained. In another test, I test for rank, and these matrices are full rank. To be sure I wasn't going crazy, I tried with inv() instead of pinv() in both cases, and the results are the same. Am I using the wrong functions for this linear algebra benchmark, or is this Numpy twisting the knife on 2 disfunctional low level libraries?
EDIT: Thank you everyone for your interest in this problem. I think I have figured out the issue. I suspect Eigen and Xtensor have lazy evaluation and this actually is causing errors downstream, and outputting random matrices instead of the inversed matrices. I was able to correct the strange numerical inversion failure with the following replacements in the code:
auto temp = Eigen::MatrixXd::Random(dim, dim);
Eigen::MatrixXd l(dim,dim);
l=temp;
and
auto temp = xt::random::randn<double>({dim, dim});
xt::xarray<double> l =temp;
However, the timings didn't change much:
- Eigen 3.3.9: 201.386 milliseconds
- Xtensor/Xtensor-blas w/ OpenBlas: 337.299 milliseconds.
- Numpy 1.17.4: (from before) 172.582 milliseconds
Actually, a little strangely, adding -O3 and -ffast-math actually slowed down the code a little. -march=native had the biggest performance increase for me when I tried it. Also, OpenBLAS is 2-3X faster than CBLAS for these problems.
-03
. Also you could usexsimd
for xtensor, but I doubt that that would help in this particular case – Latinist-O2
, while it is in-O3
. For a small additional speed up at the expense of a less portable binary, you can use-march=native
. Moreover, if you do not care about having very precise results and you are sure thatNaN
/Inf
/etc. should not appear, you can use-ffast-math
. It should make your C++ code a bit faster. Besides this, the algorithm used by the 3 implementation are not necessary the same, some can be faster or slower regarding the input matrix. They can be more or less numerically stable too. – Scenic