Why is it faster to perform float by float matrix multiplication compared to int by int?
Asked Answered
M

2

30

Having two int matrices A and B, with more than 1000 rows and 10K columns, I often need to convert them to float matrices to gain speedup (4x or more).

I'm wondering why is this the case? I realize that there is a lot of optimization and vectorizations such as AVX, etc going on with float matrix multiplication. But yet, there are instructions such AVX2, for integers (if I'm not mistaken). And, can't one make use of SSE and AVX for integers?

Why isn't there a heuristic underneath matrix algebra libraries such as Numpy or Eigen to capture this and perform integer matrix multiplication faster just like float?

About accepted answer: While @sascha's answer is very informative and relevant, @chatz's answer is the actual reason why the int by int multiplication is slow irrespective of whether BLAS integer matrix operations exist.

Manaus answered 28/7, 2017 at 12:33 Comment(2)
It would help to make the question more specific, but since more people need it for float, more effort was made to optimize it for float (in both software and hardware).Luhey
This question is in need of a specific example code to demonstrate the performance difference (see minimal reproducible example). Particularly given that the code is tagged [c++] and [numpy] it is completely unclear what you are referring to.Shamblin
F
15

If you compile these two simple functions which essentially just calculate a product (using the Eigen library)

#include <Eigen/Core>

int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
    Eigen::MatrixXi C= A*B;
    return C(0,0);
}

int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
    Eigen::MatrixXf C= A*B;
    return C(0,0);
}

using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version. The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)

Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).

Fukuoka answered 28/7, 2017 at 15:39 Comment(3)
Very interesting! Never thought vpmulld and vmulps could be this different in terms of throughput and latency.Manaus
Unexpected yet unsurprising. Floating-point matrix operations are heavily used in computer graphics, prompting a large interest in hardware-optimizing them. Applications range from the obvious (video games and web apps) to research-oriented simulation engines and mathematical modeling. Also, if you think those are fast, you can get even more floating point operation throughput programming these types of operations on a video card (a good example is nVidia's CUDA platform). Video cards are purpose-built for massively parallel floating-point operations.Catadromous
Also, FP can use FMA instructions if you use -march=native - all AVX2 CPUs (except one from Via) also have FMA. x86 doesn't have integer mul-add instructions, just FP. Most of the FLOPs in a matmul can be done with FMAs, nearly doubling the throughput yet again if you can keep the execution units fed (only 1 load per FMA, otherwise you bottleneck on load throughput).Nadia
I
18

All those vector-vector and matrix-vector operations are using BLAS internally. BLAS, optimized over decades for different archs, cpus, instructions and cache-sizes has no integer-type!

Here is some branch of OpenBLAS working on it (and some tiny discussion at google-groups linking it).

And i think i heard Intel's MKL (Intel's BLAS implementation) might be working on integer-types too. This talk looks interesting (mentioned in that forum), although it's short and probably more approaching small integral types useful in embedded Deep-Learning).

Incubator answered 28/7, 2017 at 12:37 Comment(9)
Looks like Blaze supports integersAstrometry
Eigen works with integers, and when you compile it with g++ -O3 -march=somethingrecent it is vectorized and you see instructions like vpmulld.Luhey
Supporting and vectorizing integer-ops is possible (and i would expect it in high-quality libs), sure, but the question is: can it compete with hand-tuned BLAS code?Incubator
@MarcGlisse Eigen's int matrix multiplications are usually 4 times slower than float ones. This is based on my experience though, and I'm using the flags you mentioned along with -fopenmp otherwise it be single threaded and take a long time to finish.Manaus
@Astrometry thanks for the suggestion. Haven't used Blaze yet. Will try it out to see if its int by int matmul is faster.Manaus
@Manaus I have not used it either but I have watched a talk about it and they did present some decent performance numbers.Astrometry
@Incubator right, so basically we are stuck with either converting to float, or performing slow int matmul?Manaus
@Manaus I have not much experience with this task. But it looks, like you have to check all the software available if performance is so important for you as there seem to be differences. Maybe one day OpenBLAS or MKL will add native support, but that's the future. I would be scared of float-based op's in some use-cases, but if that's working for you (no numerical-trouble) that's good.Incubator
Eigen does not depend on a separate BLAS implementation. By default it uses its own implementations (you can tell it to use an external BLAS, however).Fukuoka
F
15

If you compile these two simple functions which essentially just calculate a product (using the Eigen library)

#include <Eigen/Core>

int mult_int(const Eigen::MatrixXi& A, Eigen::MatrixXi& B)
{
    Eigen::MatrixXi C= A*B;
    return C(0,0);
}

int mult_float(const Eigen::MatrixXf& A, Eigen::MatrixXf& B)
{
    Eigen::MatrixXf C= A*B;
    return C(0,0);
}

using the flags -mavx2 -S -O3 you will see very similar assembler code, for the integer and the float version. The main difference however is that vpmulld has 2-3 times the latency and just 1/2 or 1/4 the throughput of vmulps. (On recent Intel architectures)

Reference: Intel Intrinsics Guide, "Throughput" means the reciprocal throughput, i.e., how many clock-cycles are used per operation, if no latency happens (somewhat simplified).

Fukuoka answered 28/7, 2017 at 15:39 Comment(3)
Very interesting! Never thought vpmulld and vmulps could be this different in terms of throughput and latency.Manaus
Unexpected yet unsurprising. Floating-point matrix operations are heavily used in computer graphics, prompting a large interest in hardware-optimizing them. Applications range from the obvious (video games and web apps) to research-oriented simulation engines and mathematical modeling. Also, if you think those are fast, you can get even more floating point operation throughput programming these types of operations on a video card (a good example is nVidia's CUDA platform). Video cards are purpose-built for massively parallel floating-point operations.Catadromous
Also, FP can use FMA instructions if you use -march=native - all AVX2 CPUs (except one from Via) also have FMA. x86 doesn't have integer mul-add instructions, just FP. Most of the FLOPs in a matmul can be done with FMAs, nearly doubling the throughput yet again if you can keep the execution units fed (only 1 load per FMA, otherwise you bottleneck on load throughput).Nadia

© 2022 - 2024 — McMap. All rights reserved.