Fast SSE low precision exponential using double precision operations

Asked 5/6, 2018 at 10:20 Answered 5/6, 2018 at 12:59

Solved c++precision sse simd exponential

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.

I came across this great answer:

/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x). This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.

What changes would be needed?

I naively tried this:

__m128i double_to_uint64(__m128d x) {
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

__m128d FastExpSseDouble(__m128d x) {

    #define S 52
    #define C (1llu << S) / log(2)

    __m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
    __m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);

    auto y = double_to_uint64(_mm_mul_pd(a, x));

    __m128i t = _mm_add_epi64(y, b);
    return _mm_castsi128_pd(t);
}

Of course this returns garbage as I don't know what I'm doing...

edit:

About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).

Here is the code I used:

// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements

    const auto I = v.size();

    const auto N = (I / 4) * 4;

    for (int n = 0; n < N; n += 4) {

        float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };

        __m128 x;
        x = _mm_load_ps(a);

        auto r = FastExpSse(x);

        _mm_store_ps(a, r);

        v[n]     = a[0];
        v[n + 1] = a[1];
        v[n + 2] = a[2];
        v[n + 3] = a[3];
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }

}

And here is what I would do if I had this "double precision" version:

void FastExpSseVectorDouble(std::vector<double> & v) {

    const auto I = v.size();

    const auto N = (I / 2) * 2;

    for (int n = 0; n < N; n += 2) {
        __m128d x;
        x = _mm_load_pd(&v[n]);
        auto r = FastExpSseDouble(x);

        _mm_store_pd(&v[n], r);
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }
}

Bloodstream answered 5/6, 2018 at 10:20 Comment(10)

How did you measure that double->float and float->double is taking 50% of your CPU time? You aren't doing those with separate load/store/convert loops are you?? So did you use a profiler and find that cvtpd2ps + cvtps2pd instructions had 50% of the clock cycle event samples for FastExpSse(__m128d) with on-the-fly conversion? (Not that that would be efficient, though! Unlike pack / unpack, you'd get a vector of 2 floats so it would be pure overhead.) – Pashalik 5/6, 2018 at 10:59

Anyway, 127 is the exponent bias in en.wikipedia.org/wiki/Single-precision_floating-point_format, which uses an 8-bit exponent. I'm not sure what the 298765 is from. You should leave a comment on Nic's answer on the single-precision question with a link to this question, if you haven't already. (The guy who wrote the paper is an SO user :) – Pashalik 5/6, 2018 at 11:6

@PeterCordes thanks! About the 50% factor, please see my edit. – Bloodstream 5/6, 2018 at 12:22

No wonder it's slow if you convert like that! gcc7.3 -O3 godbolt.org/g/G1MGs9 does manage to vectorize the double->float and avoid actually going through a[] in memory for that step. But the float -> double step uses 4 separate scalar float->double conversion instructions. (It does manage to avoid bouncing through memory, though.) The obvious way would be to use _mm_cvtpd_ps (felixcloutier.com/x86/CVTPD2PS.html) twice and feed FastExpSse a vector with garbage in the upper 2 elements. Or convert x2 / unpcklpd / FastExpSse / convert / unpckhpd / convert. – Pashalik 5/6, 2018 at 12:53

Probably a working FastExpSseDouble would be even faster though, if double_to_int64 isn't too slow without AVX512 packed double <-> 64-bit integer instructions. (BTW, the float version is using float-> signed int, epi32 not epu32. I think you need to handle both positive and negative numbers.) – Pashalik 5/6, 2018 at 12:57

@PeterCordes Oh, I see. And then, would the double precision version bring any speedup ? edit: looks like you anticipated my question :) – Bloodstream 5/6, 2018 at 12:58

BTW, if you _mm_load_pd(&v[n]), gcc keeps reloading v.data because it doesn't know that v[n] doesn't alias the control block; unlike with your loop where type-based aliasing works for vectors of non-pointers. Getting &v[0] into a local solves the problem. godbolt.org/g/d3G9P5 does 2 values per iteration the simple way, but cvtpd2ps and the inverse each cost a shuffle + an FMA uop, and of course you only get half the work done per vector, so it's a serious bottleneck. But might be 2x as fast as your loop. – Pashalik 5/6, 2018 at 13:12

Just tested it (with fastexp from chtz answer), unfortunately this change almost nothing (maybe just sligthly slower). I'm testing with MSVC though, I'll try it with icc 17. – Bloodstream 5/6, 2018 at 13:22

Are you building for AVX2+FMA? If so, IDK if MSVC will automatically contract addpd(mulpd(x,y), z) into an FMA, so you might need to do that manually. (And you could do 256 bits at a time.) You are enabling optimization, right? – Pashalik 5/6, 2018 at 13:31

My bad, I actually was using icc 17 with full optmization. I am building for SSE2. – Bloodstream 5/6, 2018 at 13:35

Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:

__m128d fastexp(const __m128d &x)
{
    __m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));

    return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}

This just gets about 2.5% relative precision -- for better precision you may need to add a second term.

Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

Frailty answered 5/6, 2018 at 12:59 Comment(12)

Wow, this is great, thanks! Bench-marking this vs the original FastExp function using an union, it's "only" 20% faster. Am I doing something wrong? – Bloodstream 5/6, 2018 at 13:15

(I'm testing with MSVC though, I'll try it with icc 17) – Bloodstream 5/6, 2018 at 13:23

You may get significantly faster by unrolling your loop. _mm_add_pd and _mm_mul_pd have a relatively high latency. But perhaps your CPU already does this internally (doing out-of-order execution). – Frailty 5/6, 2018 at 13:32

I tried the version above with MSVC on godbolt, and it failed to compile-time compute _mm_set1_pd(1.0/std::log(2.0)). Replacing this by the hand-computed _mm_set1_pd(1.4426950408889634) gave much better code (one more magical number of course). And I guess I just don't know the correct options to give to MSVC. – Frailty 5/6, 2018 at 13:54

I actually was using icc 17 with full optmization, my bad :( – Bloodstream 5/6, 2018 at 13:56

For the record, I am feeding the functions with a 4e6 double prec elements vector, and I get this results: std::exp: 103ms, FastExp: 67ms, SSE FastExp (single prec) 127ms, SSE FastExp (double prec) 59ms – Bloodstream 5/6, 2018 at 13:58

Generally, I don't think a low-precision fastExpDouble will have much use cases. If you don't care about precision, you should work with single precision as soon and as long as possible. And doing just a single operation on a std::vector<double> may result in memory bandwidth being a limiting factor (try doing more operations without storing intermediate results to memory). – Frailty 5/6, 2018 at 14:4

yeah, but I'm just doing a gaussian blur on a structure which stores doubles, so I can't switch to floats. Anyway, thanks a lot! – Bloodstream 5/6, 2018 at 14:7

Doing a Gaussian Blur is much more computation than computing exp, which can be joined into this computation. But that would be a different question. – Frailty 5/6, 2018 at 14:33

Yes, you'r right, for each discrete point that I want to "blur", in each boxes, I'm computing a squared distance (dx²+dy²+dz²), two multiplications, and an addition (and the exponential obviously). Maybe I could convert a few double to floats, do everything else in floats and convert back the sum to double in the end. But that's maybe too much work and obfuscation for the benefit... – Bloodstream 5/6, 2018 at 14:41

@ThreeStarProgrammer57: That would probably be worth converting two vectors of double and packing into one vector of float, unlike for just exp. The conversion results are packed into the low 2 float elements of a __m128 (felixcloutier.com/x86/CVTPD2PS.html), so like I said you can shuffle them together with shufpd, unpcklpd, or movlhps. That would add 2x convert + 1x shuffle + 2x convert per 4 elements, but cut the rest of the work in half by having twice as many elements per vector. (Or better, if FastExpSSE can be done more efficiently with float) – Pashalik 5/6, 2018 at 18:55

@PeterCordes True, but the thing is for doing a Gaussian blur, you barely need a lot of exp calculations. You could just precalculate a 1D look-up table and use the exp(a+b)==exp(a)*exp(b) identity -- assuming dx, dy, dz are discrete. Or if the convolution kernel is relatively big, consider doing a 3D-FastFourierTransform (but this is getting away from the original question). – Frailty 6/6, 2018 at 7:14

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

edit:

Recommended topics

Hot tags