I am looking for for a fast-SSE-low-precision (~1e-3) exponential function.
I came across this great answer:
/* max. rel. error = 3.55959567e-2 on [-87.33654, 88.72283] */
__m128 FastExpSse (__m128 x)
{
__m128 a = _mm_set1_ps (12102203.0f); /* (1 << 23) / log(2) */
__m128i b = _mm_set1_epi32 (127 * (1 << 23) - 298765);
__m128i t = _mm_add_epi32 (_mm_cvtps_epi32 (_mm_mul_ps (a, x)), b);
return _mm_castsi128_ps (t);
}
Based on the work of Nicol N. Schraudolph: N. N. Schraudolph. "A fast, compact approximation of the exponential function." Neural Computation, 11(4), May 1999, pp.853-862.
Now I would need a "double precision" version: __m128d FastExpSSE (__m128d x)
.
This is because I don't control the input and output precision, which happen to be double precision, and the two conversions double -> float, then float -> double is eating 50% of the CPU resources.
What changes would be needed?
I naively tried this:
__m128i double_to_uint64(__m128d x) {
x = _mm_add_pd(x, _mm_set1_pd(0x0010000000000000));
return _mm_xor_si128(
_mm_castpd_si128(x),
_mm_castpd_si128(_mm_set1_pd(0x0010000000000000))
);
}
__m128d FastExpSseDouble(__m128d x) {
#define S 52
#define C (1llu << S) / log(2)
__m128d a = _mm_set1_pd(C); /* (1 << 52) / log(2) */
__m128i b = _mm_set1_epi64x(127 * (1llu << S) - 298765llu << 29);
auto y = double_to_uint64(_mm_mul_pd(a, x));
__m128i t = _mm_add_epi64(y, b);
return _mm_castsi128_pd(t);
}
Of course this returns garbage as I don't know what I'm doing...
edit:
About the 50% factor, it is a very rough estimation, comparing the speedup (with respect to std::exp) converting a vector of single precision numbers (great) to the speedup with a list of double precision numbers (not so great).
Here is the code I used:
// gives the result in place
void FastExpSseVector(std::vector<double> & v) { //vector with several millions elements
const auto I = v.size();
const auto N = (I / 4) * 4;
for (int n = 0; n < N; n += 4) {
float a[4] = { float(v[n]), float(v[n + 1]), float(v[n + 2]), float(v[n + 3]) };
__m128 x;
x = _mm_load_ps(a);
auto r = FastExpSse(x);
_mm_store_ps(a, r);
v[n] = a[0];
v[n + 1] = a[1];
v[n + 2] = a[2];
v[n + 3] = a[3];
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
And here is what I would do if I had this "double precision" version:
void FastExpSseVectorDouble(std::vector<double> & v) {
const auto I = v.size();
const auto N = (I / 2) * 2;
for (int n = 0; n < N; n += 2) {
__m128d x;
x = _mm_load_pd(&v[n]);
auto r = FastExpSseDouble(x);
_mm_store_pd(&v[n], r);
}
for (int n = N; n < I; ++n) {
v[n] = FastExp(v[n]);
}
}
cvtpd2ps
+cvtps2pd
instructions had 50% of the clock cycle event samples forFastExpSse(__m128d)
with on-the-fly conversion? (Not that that would be efficient, though! Unlikepack
/unpack
, you'd get a vector of 2float
s so it would be pure overhead.) – Pashalik127
is the exponent bias in en.wikipedia.org/wiki/Single-precision_floating-point_format, which uses an 8-bit exponent. I'm not sure what the298765
is from. You should leave a comment on Nic's answer on the single-precision question with a link to this question, if you haven't already. (The guy who wrote the paper is an SO user :) – Pashalik-O3
godbolt.org/g/G1MGs9 does manage to vectorize the double->float and avoid actually going througha[]
in memory for that step. But the float -> double step uses 4 separate scalar float->double conversion instructions. (It does manage to avoid bouncing through memory, though.) The obvious way would be to use_mm_cvtpd_ps
(felixcloutier.com/x86/CVTPD2PS.html) twice and feedFastExpSse
a vector with garbage in the upper 2 elements. Or convert x2 /unpcklpd
/ FastExpSse / convert /unpckhpd
/ convert. – PashalikFastExpSseDouble
would be even faster though, ifdouble_to_int64
isn't too slow without AVX512 packed double <-> 64-bit integer instructions. (BTW, the float version is using float->signed int
, epi32 not epu32. I think you need to handle both positive and negative numbers.) – Pashalik_mm_load_pd(&v[n])
, gcc keeps reloadingv.data
because it doesn't know thatv[n]
doesn't alias the control block; unlike with your loop where type-based aliasing works for vectors of non-pointers. Getting&v[0]
into a local solves the problem. godbolt.org/g/d3G9P5 does 2 values per iteration the simple way, butcvtpd2ps
and the inverse each cost a shuffle + an FMA uop, and of course you only get half the work done per vector, so it's a serious bottleneck. But might be 2x as fast as your loop. – Pashalik