Fast SSE low precision exponential using double precision operations
Asked Answered
B

1

6

I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.

I came across this great answer:

/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
    __m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
    __m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
    __m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
    return _mm_castsi128_ps (t);
}

Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.

Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x). This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.

What changes would be needed?

I naively tried this:

__m128i double_to_uint64(__m128d x) {
    x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
    return _mm_xor_si128(
        _mm_castpd_si128(x),
        _mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
    );
}

__m128d FastExpSseDouble(__m128d x) {

    #define S 52
    #define C (1llu << S) / log(2)

    __m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
    __m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);

    auto y = double_to_uint64(_mm_mul_pd(a, x));

    __m128i t = _mm_add_epi64(y, b);
    return _mm_castsi128_pd(t);
}

Of course this returns garbage as I don't know what I'm doing...

edit:

About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).

Here is the code I used:

// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements

    const auto I = v.size();

    const auto N = (I / 4) * 4;

    for (int n = 0; n < N; n += 4) {

        float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };

        __m128 x;
        x = _mm_load_ps(a);

        auto r = FastExpSse(x);

        _mm_store_ps(a, r);

        v[n]     = a[0];
        v[n + 1] = a[1];
        v[n + 2] = a[2];
        v[n + 3] = a[3];
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }

}

And here is what I would do if I had this "double precision" version:

void FastExpSseVectorDouble(std::vector<double> & v) {

    const auto I = v.size();

    const auto N = (I / 2) * 2;

    for (int n = 0; n < N; n += 2) {
        __m128d x;
        x = _mm_load_pd(&v[n]);
        auto r = FastExpSseDouble(x);

        _mm_store_pd(&v[n], r);
    }

    for (int n = N; n < I; ++n) {
        v[n] = FastExp(v[n]);
    }
}
Bloodstream answered 5/6, 2018 at 10:20 Comment(10)
How did you measure that double->float and float->double is taking 50% of your CPU time? You aren't doing those with separate load/store/convert loops are you?? So did you use a profiler and find that cvtpd2ps + cvtps2pd instructions had 50% of the clock cycle event samples for FastExpSse(__m128d) with on-the-fly conversion? (Not that that would be efficient, though! Unlike pack / unpack, you'd get a vector of 2 floats so it would be pure overhead.)Pashalik
Anyway, 127 is the exponent bias in en.wikipedia.org/wiki/Single-precision_floating-point_format, which uses an 8-bit exponent. I'm not sure what the 298765 is from. You should leave a comment on Nic's answer on the single-precision question with a link to this question, if you haven't already. (The guy who wrote the paper is an SO user :)Pashalik
@PeterCordes thanks! About the 50% factor, please see my edit.Bloodstream
No wonder it's slow if you convert like that! gcc7.3 -O3 godbolt.org/g/G1MGs9 does manage to vectorize the double->float and avoid actually going through a[] in memory for that step. But the float -> double step uses 4 separate scalar float->double conversion instructions. (It does manage to avoid bouncing through memory, though.) The obvious way would be to use _mm_cvtpd_ps (felixcloutier.com/x86/CVTPD2PS.html) twice and feed FastExpSse a vector with garbage in the upper 2 elements. Or convert x2 / unpcklpd / FastExpSse / convert / unpckhpd / convert.Pashalik
Probably a working FastExpSseDouble would be even faster though, if double_to_int64 isn't too slow without AVX512 packed double <-> 64-bit integer instructions. (BTW, the float version is using float-> signed int, epi32 not epu32. I think you need to handle both positive and negative numbers.)Pashalik
@PeterCordes Oh, I see. And then, would the double precision version bring any speedup ? edit: looks like you anticipated my question :)Bloodstream
BTW, if you _mm_load_pd(&v[n]), gcc keeps reloading v.data because it doesn't know that v[n] doesn't alias the control block; unlike with your loop where type-based aliasing works for vectors of non-pointers. Getting &v[0] into a local solves the problem. godbolt.org/g/d3G9P5 does 2 values per iteration the simple way, but cvtpd2ps and the inverse each cost a shuffle + an FMA uop, and of course you only get half the work done per vector, so it's a serious bottleneck. But might be 2x as fast as your loop.Pashalik
Just tested it (with fastexp from chtz answer), unfortunately this change almost nothing (maybe just sligthly slower). I'm testing with MSVC though, I'll try it with icc 17.Bloodstream
Are you building for AVX2+FMA? If so, IDK if MSVC will automatically contract addpd(mulpd(x,y), z) into an FMA, so you might need to do that manually. (And you could do 256 bits at a time.) You are enabling optimization, right?Pashalik
My bad, I actually was using icc 17 with full optmization. I am building for SSE2.Bloodstream
F
4

Something like this should do the job. You need to tune the 1.05 constant to get a lower maximal error -- I'm too lazy to do that:

__m128d fastexp(const __m128d &x)
{
    __m128d scaled = _mm_add_pd(_mm_mul_pd(x, _mm_set1_pd(1.0/std::log(2.0)) ), _mm_set1_pd(3*1024.0-1.05));

    return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(scaled), 11));
}

This just gets about 2.5% relative precision -- for better precision you may need to add a second term.

Also, for values which overflow or underflow this will result in unspecified values, you can avoid this by clamping the scaled value to some values.

Frailty answered 5/6, 2018 at 12:59 Comment(12)
Wow, this is great, thanks! Bench-marking this vs the original FastExp function using an union, it's "only" 20% faster. Am I doing something wrong?Bloodstream
(I'm testing with MSVC though, I'll try it with icc 17)Bloodstream
You may get significantly faster by unrolling your loop. _mm_add_pd and _mm_mul_pd have a relatively high latency. But perhaps your CPU already does this internally (doing out-of-order execution).Frailty
I tried the version above with MSVC on godbolt, and it failed to compile-time compute _mm_set1_pd(1.0/std::log(2.0)). Replacing this by the hand-computed _mm_set1_pd(1.4426950408889634) gave much better code (one more magical number of course). And I guess I just don't know the correct options to give to MSVC.Frailty
I actually was using icc 17 with full optmization, my bad :(Bloodstream
For the record, I am feeding the functions with a 4e6 double prec elements vector, and I get this results: std::exp: 103ms, FastExp: 67ms, SSE FastExp (single prec) 127ms, SSE FastExp (double prec) 59msBloodstream
Generally, I don't think a low-precision fastExpDouble will have much use cases. If you don't care about precision, you should work with single precision as soon and as long as possible. And doing just a single operation on a std::vector<double> may result in memory bandwidth being a limiting factor (try doing more operations without storing intermediate results to memory).Frailty
yeah, but I'm just doing a gaussian blur on a structure which stores doubles, so I can't switch to floats. Anyway, thanks a lot!Bloodstream
Doing a Gaussian Blur is much more computation than computing exp, which can be joined into this computation. But that would be a different question.Frailty
Yes, you'r right, for each discrete point that I want to "blur", in each boxes, I'm computing a squared distance (dx²+dy²+dz²), two multiplications, and an addition (and the exponential obviously). Maybe I could convert a few double to floats, do everything else in floats and convert back the sum to double in the end. But that's maybe too much work and obfuscation for the benefit...Bloodstream
@ThreeStarProgrammer57: That would probably be worth converting two vectors of double and packing into one vector of float, unlike for just exp. The conversion results are packed into the low 2 float elements of a __m128 (felixcloutier.com/x86/CVTPD2PS.html), so like I said you can shuffle them together with shufpd, unpcklpd, or movlhps. That would add 2x convert + 1x shuffle + 2x convert per 4 elements, but cut the rest of the work in half by having twice as many elements per vector. (Or better, if FastExpSSE can be done more efficiently with float)Pashalik
@PeterCordes True, but the thing is for doing a Gaussian blur, you barely need a lot of exp calculations. You could just precalculate a 1D look-up table and use the exp(a+b)==exp(a)*exp(b) identity -- assuming dx, dy, dz are discrete. Or if the convolution kernel is relatively big, consider doing a 3D-FastFourierTransform (but this is getting away from the original question).Frailty

© 2022 - 2024 — McMap. All rights reserved.