How to convert scalar code of the double version of VDT's Pade Exp fast_ex() approx into SSE2?
Asked Answered
T

1

6

Here's the code I'm trying to convert: the double version of VDT's Pade Exp fast_ex() approx (here's the old repo resource):

inline double fast_exp(double initial_x){
    double x = initial_x;
    double px=details::fpfloor(details::LOG2E * x +0.5);

    const int32_t n = int32_t(px);

    x -= px * 6.93145751953125E-1;
    x -= px * 1.42860682030941723212E-6;

    const double xx = x * x;

    // px = x * P(x**2).
    px = details::PX1exp;
    px *= xx;
    px += details::PX2exp;
    px *= xx;
    px += details::PX3exp;
    px *= x;

    // Evaluate Q(x**2).
    double qx = details::QX1exp;
    qx *= xx;
    qx += details::QX2exp;
    qx *= xx;
    qx += details::QX3exp;
    qx *= xx;
    qx += details::QX4exp;

    // e**x = 1 + 2x P(x**2)/( Q(x**2) - P(x**2) )
    x = px / (qx - px);
    x = 1.0 + 2.0 * x;

    // Build 2^n in double.
    x *= details::uint642dp(( ((uint64_t)n) +1023)<<52);

    if (initial_x > details::EXP_LIMIT)
      x = std::numeric_limits<double>::infinity();
    if (initial_x < -details::EXP_LIMIT)
      x = 0.;

    return x; 
}

I got this:

__m128d PExpSSE_dbl(__m128d x) {
    __m128d initial_x = x;

    __m128d half = _mm_set1_pd(0.5);
    __m128d one = _mm_set1_pd(1.0);
    __m128d log2e = _mm_set1_pd(1.4426950408889634073599);
    __m128d p1 = _mm_set1_pd(1.26177193074810590878E-4);
    __m128d p2 = _mm_set1_pd(3.02994407707441961300E-2);
    __m128d p3 = _mm_set1_pd(9.99999999999999999910E-1);
    __m128d q1 = _mm_set1_pd(3.00198505138664455042E-6);
    __m128d q2 = _mm_set1_pd(2.52448340349684104192E-3);
    __m128d q3 = _mm_set1_pd(2.27265548208155028766E-1);
    __m128d q4 = _mm_set1_pd(2.00000000000000000009E0);

    __m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half);
    __m128d t = _mm_cvtepi64_pd(_mm_cvttpd_epi64(px));  
    px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one));

    __m128i n = _mm_cvtpd_epi64(px);

    x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1)));
    x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(1.42860682030941723212E-6)));
    __m128d xx = _mm_mul_pd(x, x);

    px = _mm_mul_pd(xx, p1);
    px = _mm_add_pd(px, p2);
    px = _mm_mul_pd(px, xx);
    px = _mm_add_pd(px, p3);
    px = _mm_mul_pd(px, x);

    __m128d qx = _mm_mul_pd(xx, q1);
    qx = _mm_add_pd(qx, q2);
    qx = _mm_mul_pd(xx, qx);
    qx = _mm_add_pd(qx, q3);
    qx = _mm_mul_pd(xx, qx);
    qx = _mm_add_pd(qx, q4);

    x = _mm_div_pd(px, _mm_sub_pd(qx, px));
    x = _mm_add_pd(one, _mm_mul_pd(_mm_set1_pd(2.0), x));

    n = _mm_add_epi64(n, _mm_set1_epi64x(1023));
    n = _mm_slli_epi64(n, 52);

    // return?
}

But I'm not able to finish the last lines - i.e. this code:

    if (initial_x > details::EXP_LIMIT)
      x = std::numeric_limits<double>::infinity();
    if (initial_x < -details::EXP_LIMIT)
      x = 0.;

    return x; 

How would you convert in SSE2?

Than of course I need to check the whole, since I'm not quite sure I've converted it correctly.

EDIT: I found the SSE conversion of float exp - i.e. from this:

/* multiply by power of 2 */
z *= details::uint322sp((n + 0x7f) << 23);

if (initial_x > details::MAXLOGF) z = std::numeric_limits<float>::infinity();
if (initial_x < details::MINLOGF) z = 0.f;

return z;

to this:

n = _mm_add_epi32(n, _mm_set1_epi32(0x7f));
n = _mm_slli_epi32(n, 23);

return _mm_mul_ps(z, _mm_castsi128_ps(n));
Tricyclic answered 25/1, 2019 at 11:44 Comment(12)
the link points to the documentation of a moderately old release, the code is maintained in the vdt project: hereChappell
@pseyfert: thanks! This specific code seems to be the same thoughTricyclic
Updated my answer; the first version was a quick post that I didn't take time to go into detail with.Allineallis
32-bit left-shift by 23 is obviously stuffing bits into the exponent field of a single-precision float, not double. i.e. multiplying a float by 1<<n. It doesn't implement the range-check, just the z *= statement. You already have the n = _mm_add_epi64(n, _mm_set1_epi64x(1023)); / n = _mm_slli_epi64(n, 52); part implemented in your code. for double. I didn't notice you were missing the _mm_mul_pd(result, _mm_castsi128_pd(n)) part. Is that why you were quoting that single-precision code?Allineallis
@PeterCordes yeah basically I just write how someone translate the float version to simd. Its pretty similar the code between float and double, isn't?Tricyclic
Yes, of course it's similar. IEEE binary32 and binary64 are identical except for the field widths. en.wikipedia.org/wiki/Single-precision_floating-point_format vs. en.wikipedia.org/wiki/Double-precision_floating-point_format. I'm not sure if there's a new part of the question, or why you're quoting that. Is it just for the _mm_mul_ps / _pd line that you're missing, with the type-pun _mm_cast intrinsic? Or are you saying that someone's float version left out the range-check? That's very possible if they decided they didn't want the overhead of range-checking.Allineallis
@PeterCordes: oh no :) I mean that the float version seems to translate the whole last part of code in 1 line of code, while it seems you are suggestiing lots of lines. Or am I getting wrong?Tricyclic
@PeterCordes: yes it seems it ignore out of range :) I would say I can ignore as well...Tricyclic
_mm_mul_ps(z, _mm_castsi128_ps(n)) just implements the *= part of z *= details::uint322sp((n + 0x7f) << 23);. I thought that was obvious. Of course it would take several intrinsics to faithfully implement the range checking, especially if you don't have SSE4 for blendv. It would be easy and cheaper to implement it by forcing the result to NaN, though. But still not as cheap as nothing at all.Allineallis
@PeterCordes in my case the range come from a lookup table, where I already endure it won't be out of range (hopefully :P).Tricyclic
I was trying to see whether you need double->double rounding to nearest integer, or if you only ever need it during the process of converting to integer (like How to floor/int in double using only SSE2?). I notice the result of x = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one)); is never used; you later do x = _mmstuff(px, ... px). So that's weird.Allineallis
@PeterCordes: that's the price of my inexperience in translate scalar code into vectorized, sorry. There was an error: its x = _mm_sub_pd(x, _mm_mul_pd(px, _mm_set1_pd(6.93145751953125E-1))); ... I've fixed the code.Tricyclic
A
6

Yup, dividing two polynomials can often give you a better tradeoff between speed and precision than one huge polynomial. As long as there's enough work to hide the divpd throughput. (The latest x86 CPUs have pretty decent FP divide throughput. Still bad vs. multiply, but it's only 1 uop so it doesn't stall the pipeline if you use it rarely enough, i.e. mixed with lots of multiplies. Including in the surrounding code that uses exp)


However, _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); won't work with SSE2. Packed-conversion intrinsics to/from 64-bit integers requires AVX512DQ.

To do packed rounding to the nearest integer, ideally you'd use SSE4.1 _mm_round_pd(x, _MM_FROUND_TO_NEAREST_INT |_MM_FROUND_NO_EXC), (or truncation towards zero, or floor or ceil towards -+Inf).

But we don't actually need that.

The scalar code ends up with int n and double px both representing the same numeric value. It uses the bad/buggy floor(val+0.5) idiom instead of rint(val) or nearbyint(val) to round to nearest, and then converts that already-integer double to an int (with C++'s truncation semantics, but that doesn't matter because the double value's already an exact integer.)

With SIMD intrinsics, it appears to be easiest to just convert to 32-bit integer and back.

__m128i n  = _mm_cvtpd_epi32( _mm_mul_pd(log2e, x) );   // round to nearest
__m128d px = _mm_cvtepi32_pd( n );

Rounding to int with the desired mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)

cvtsd2si and si2sd instructions are 2 uops each, and shuffle the 32-bit integers to packed in the low 64 bits of a vector. So to set up for 64-bit integer shifts to stuff the bits into a double again, you'll need to shuffle. The top 64 bits of n will be zeros, so we can use that to create 64-bit integer n lined up with the doubles:

n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3,1,2,0));   // 64-bit integers

But with just SSE2, there are workarounds. Converting to 32-bit integer and back is one option: you don't care about inputs too small or too large. But packed-conversion between double and int costs at least 2 uops on Intel CPUs each way, so a total of 4. But only 2 of those uops need the FMA units, and your code probably doesn't bottleneck on port 5 with all those multiplies and adds.

Or add a very large number and subtract it again: large enough that each double is 1 integer apart, so normal FP rounding does what you want. (This works for inputs that won't fit in 32 bits, but not double > 2^52. So either way that would work.) Also see How to efficiently perform double/int64 conversions with SSE/AVX? which uses that trick. I couldn't find an example on SO, though.


Related:

Then of course I need to check the whole, since I'm not quite sure I've converted it correctly.

iterating over all 2^64 double bit-patterns is impractical, unlike for float where there are only 4 billion, but maybe iterating over all doubles that have the low 32 bits of their mantissa all zero would be a good start. i.e. check in a loop with

bitpatterns = _mm_add_epi64(bitpatterns, _mm_set1_epi64x( 1ULL << 32 ));
doubles = _mm_castsi128_pd(bitpatterns);

https://randomascii.wordpress.com/2014/01/27/theres-only-four-billion-floatsso-test-them-all/


For those last few lines, correcting the input for out-of-range inputs:

The float version you quote just leaves out the range-check entirely. This is obviously the fastest way, if your inputs will always be in range or if you don't care about what happens for out-of-range inputs.

Alternate cheaper range-checking (maybe only for debugging) would be to turn out-of-range values into NaN by ORing the packed-compare result into the result. (An all-ones bit-pattern represents a NaN.)

__m128d out_of_bounds = _mm_cmplt_pd( limit, abs(initial_x) );  // abs = mask off the sign bit
result = _mm_or_pd(result, out_of_bounds);

In general, you can vectorize simple condition setting of a value using branchless compare + blend. Instead of if(x) y=0;, you have the SIMD equivalent of y = (condition) ? 0 : y;, on a per-element basis. SIMD compares produce a mask of all-zero / all-one elements so you can use it to blend.

e.g. in this case cmppd the input and blendvpd the output if you have SSE4.1. Or with just SSE2, and/andnot/or to blend. See SSE intrinsics for comparison (_mm_cmpeq_ps) and assignment operation for a _ps version of both, _pd is identical.

In asm it will look like this:

; result in xmm0  (in need of fixups for out of range inputs)
; initial_x in xmm2
; constants:
;     xmm5 = limit
;     xmm6 = +Inf
cmpltpd  xmm2, xmm5    ; xmm2 = input_x < limit ? 0xffff... : 0
andpd    xmm0, xmm2    ; result = result or 0
andnpd   xmm2, xmm6    ; xmm2 =  0 or +Inf   (In that order because we used ANDN)
orpd     xmm0, xmm2    ; result |= 0 or +Inf
; xmm0 = (input < limit) ? result : +Inf

(In an earlier version of the answer, I thought I was maybe saving a movaps to copy a register, but this is just a bog-standard blend. It destroys initial_x, so the compiler needs to copy that register at some point while calculating result, though.)


Optimizations for this special condition

Or in this case, 0.0 is represented by an all-zero bit-pattern, so do a compare that will produce true if in-range, and AND the output with that. (To leave it unchanged or force it to +0.0). This is better than _mm_blendv_pd, which costs 2 uops on most Intel CPUs (and the AVX 128-bit version always costs 2 uops on Intel). And it's not worse on AMD or Skylake.

+-Inf is represented by a bit-pattern of significand=0, exponent=all-ones. (Any other value in the significand represents +-NaN.) Since too-large inputs will presumably still leave non-zero significands, we can't just AND the compare result and OR that into the final result. I think we need to do a regular blend, or something as expensive (3 uops and a vector constant).

It adds 2 cycles of latency to the final result; both the ANDNPD and ORPD are on the critical path. The CMPPD and ANDPD aren't; they can run in parallel with whatever you do to compute the result.

Hopefully your compiler will actually use ANDPS and so on, not PD, for everything except the CMP, because it's 1 byte shorter but identical because they're both just bitwise ops. I wrote ANDPD just so I didn't have to explain this in comments.


You might be able to shorten the critical path latency by combining both fixups before applying to the result, so you only have one blend. But then I think you also need to combine the compare results.

Or since your upper and lower bounds are the same magnitude, maybe you can compare the absolute value? (mask off the sign bit of initial_x and do _mm_cmplt_pd(abs_initial_x, _mm_set1_pd(details::EXP_LIMIT))). But then you have to sort out whether to zero or set to +Inf.

If you had SSE4.1 for _mm_blendv_pd, you could use initial_x itself as the blend control for the fixup that might need applying, because blendv only cares about the sign bit of the blend control (unlike with the AND/ANDN/OR version where all bits need to match.)

__m128d  fixup = _mm_blendv_pd( _mm_setzero_pd(), _mm_set1_pd(INFINITY), initial_x );  // fixup = (initial_x signbit) ? 0 : +Inf
 // see below for generating fixup with an SSE2 integer arithmetic-shift

const  signbit_mask = _mm_castsi128_pd(_mm_set1_epi64x(0x7fffffffffffffff));  // ~ set1(-0.0)
__m128d  abs_init_x = _mm_and_pd( initial_x, signbit_mask );

__m128d out_of_range = _mm_cmpgt_pd(abs_init_x, details::EXP_LIMIT);

// Conditionally apply the fixup to result
result = _mm_blendv_pd(result, fixup, out_of_range);

Possibly use cmplt instead of cmpgt and rearrange if you care what happens for initial_x being a NaN. Choosing the compare so false applies the fixup instead of true will mean that an unordered comparison results in either 0 or +Inf for an input of -NaN or +NaN. This still doesn't do NaN propagation. You could _mm_cmpunord_pd(initial_x, initial_x) and OR that into fixup, if you want to make that happen.

Especially on Skylake and AMD Bulldozer/Ryzen where SSE2 blendvpd is only 1 uop, this should be pretty nice. (The VEX encoding, vblendvpd is 2 uops, having 3 inputs and a separate output.)

You might still be able to use some of this idea with only SSE2, maybe creating fixup by doing a compare against zero and then _mm_and_pd or _mm_andnot_pd with the compare result and +Infinity.


Using an integer arithmetic shift to broadcast the sign bit to every position in the double isn't efficient: psraq doesn't exist, only psraw/d. Only logical shifts come in 64-bit element size.

But you could create fixup with just one integer shift and mask, and a bitwise invert

__m128i  ix = _mm_castsi128_pd(initial_x);
__m128i ifixup = _mm_srai_epi32(ix, 11);               // all 11 bits of exponent field = sign bit
ifixup = _mm_and_si128(ifixup, _mm_set1_epi64x(0x7FF0000000000000ULL) );  // clear other bits
// ix = the bit pattern for 0 (non-negative x) or +Inf (negative x)  

__m128d fixup = _mm_xor_si128(ifixup, _mm_set1_epi32(-1));  // bitwise invert

Then blend fixup into result for out-of-range inputs as normal.


Cheaply checking abs(initial_x) > details::EXP_LIMIT

If the exp algorithm was already squaring initial_x, you could compare against EXP_LIMIT squared. But it's not, xx = x*x only happens after some calculation to create x.


If you have AVX512F/VL, VFIXUPIMMPD might be handy here. It's designed for functions where the special case outputs are from "special" inputs like NaN and +-Inf, negative, positive, or zero, saving a compare for those cases. (e.g. for after a Newton-Raphson reciprocal(x) for x=0.)

But both of your special cases need compares. Or do they?

If you square your input and subtract, it only costs one FMA to do initial_x * initial_x - details::EXP_LIMIT * details::EXP_LIMIT to create a result that's negative for abs(initial_x) < details::EXP_LIMIT, and non-negative otherwise.

Agner Fog reports that vfixupimmpd is only 1 uop on Skylake-X.

Allineallis answered 26/1, 2019 at 11:15 Comment(31)
I feel like there should be something you can do with an integer compare on a bit pattern you already have at some point, to check the magnitude of initial_x. (The integer value of the bit-pattern increases monotonically as the double increases, except for sign/magnitude vs. two's complement), so nextafter can be done with an integer increment.) This is why the exponent field is biased.Allineallis
Like maybe compare a right-shifted ifixup with a right-shifted details::EXP_LIMIT. But with srai/and, you're not stripping the sign bit until you also throw away all the bits other than the copies of the sign bit in the exponent field.Allineallis
please check the last edit I've made for this question: does it could help? Not quite sure you are suggesting to me doing this, but before learn the awesome reply you give to me, I'd be sure I can "quickly" doing somethings like that :)Tricyclic
YES! You were right: _mm_cvtepi64_pd(_mm_cvttpd_epi64(px)); crash. How can I fix it? Should I open a new question?Tricyclic
@markzzz: Oh right, you're using MSVC which lets you use intrinsic for instruction sets you haven't enabled. (GCC wouldn't let it compile without -march=skylake-avx512). Anyway, yes, open a new question. I searched but didn't find an exact duplicate for how to truncate a __m128d to the next smallest-magnitude integer without SSE4.1 for roundpd. In your case only small-magnitude inputs are relevant, so adding a huge number and subtracting it again will work, like my answer says, so you can probably post an answer for the new question if you can get the right constant.Allineallis
considering the original double function will work in the range of +708/-708, probably _mm_cvtpd_epi32 is enough, without any kind of loss precision...Tricyclic
I meant that also on original code github.com/dpiparo/vdt/blob/master/include/exp.h#L75 it cast double to int32, so probably _mm_cvtpd_epi32 is enough (since it doesn't exist _mm_cvtpd_epi64 on SSE2). As well with _mm_cvtepi32_pd(_mm_cvttpd_epi32(px)) should be enough, right?Tricyclic
@markzzz: yes, but you should probably be using the same conversion for both cases, not truncation for one and nearest for the other. Isn't the algorithm trying to split the integer and fractional parts? In the scalar version, it's casting to int (i.e. with truncation) on a value that's already an integer from floor, so it doesn't matter what rounding mode is used. So you should probably do n = _mm_cvtpd_epi32(something); (round to nearest) / px = _mm_cvtepi32_pd(n) (convert back). Thus a total of 2 conversions.Allineallis
It would be nice to avoid the extra shuffling that converting to 32-bit int does, but I don't think we can. Unless maybe Mysticial's IEEE tricks for double->int are more efficient overall. But you do still need the fractional part as a double, too, so you either have to convert back or expand the tricky stuff.Allineallis
Not sure what you mean. I think it uses floor(d + 0.5) approch instead of round() because the latter depends by the system's rounding mode. Instead, with floor(d + 0.5) you ensure that the rounding will round to-nearest every time. Thus, _mm_cvtepi32_pd(_mm_cvttpd_epi32(px)) is needed (and not only _mm_cvtepi32_pd). If you only do _mm_cvtepi32_pd you are not sure the rounding is to-nearest (because, as said, depends of fegetround()). Am I wrong?Tricyclic
@markzzz: you can assume that the rounding mode is nearest unless you changed it. Normal FP code always assumes the rounding mode is the default. Compiler optimizations and compile-time evaluation / constant-propagation assume this unless you use #pragma STDC FENV_ACCESS ON and/or compile with special options. To put it another way, nobody expects you can set the rounding mode and then call other functions and expect them to work. Equally importantly, truncation towards 0 is not the same as floor toward -Infinity, for negative numbers. -1.3 floors to -2, but truncates to -1.Allineallis
@markzzz: and most importantly of all, px = floor(...); n = int32_t(px); guarantees that px and n represent the same integer value, not off-by-1 from each other if they rounded differently. (Like if you truncated the LOG2E * x +0.5 to get n, but floored it to get px.) Therefore, if your replacement for floor involves converting to integer and back, you should keep that integer value and use it as n. The only questions are what convert-to-int method you use, and what input do you feed it. floor(foo + 0.5) is a sloppy round-to-nearest, so use _mm_cvtpd_epi32, not tt.Allineallis
So you suggest to simply convert double px = details::fpfloor(details::LOG2E * x + 0.5); to __m128d px = _mm_cvtepi32_pd(_mm_cvttpd_epi32(_mm_mul_pd(log2e, x)));? But this will give different results from the library, which use floor(x + 0.5) on github.com/dpiparo/vdt/blob/master/include/exp.h#L73 . I got that your suggestion will round "better" using native rounding to nearest, but what I'm looking for is traslate library to simd :)Tricyclic
No, read carefully. n = _mm_cvtpd_epi32(_mm_mul_pd(log2e, x))); using nearest, not _mm_cvttpd_epi32 truncation, so it is the same as what floor(stuff + 0.5) is doing, except without being wrong for a couple corner cases like 0.49999999999999994. See round() for float in C++. Anyway, then px = _mm_cvtepi32_pd(n); gives you that as packed-double again. (int)floor(val + 0.5) is a common but bad and slightly buggy idiom for round-to-nearest. I'm suggesting you emulate that without the bugs.Allineallis
But I also need to calculate px as floor(stuff + 0.5) :) Not that this github.com/dpiparo/vdt/blob/master/include/exp.h#L77 will use px, which is calculated with floor(x + 0.5). That's why I also need to calculate it: can't ignore it!! And yeah, I know that (int)floor(val + 0.5) is a common but bad and slightly buggy idiom for round-to-nearest.Tricyclic
@markzzz: Right, so you do px = _mm_cvtepi32_pd(n), like I've said at least 3 times now. Rounding to int with the correct mode, then converting back to double, is equivalent to double->double rounding and then grabbing an int version of that like the scalar version does. (Because you don't care what happens for doubles too large to fit in an int.)Allineallis
:) But n = _mm_cvtpd_epi32(_mm_mul_pd(log2e, x))); not seems to me the "Rounding to int with the correct mode". Try with value = 0.499999999999999944488848768742172978818416595458984375: floor(value + 0.5) != round(value). I'm going to have comparisons of scalar vs vector. I know that's corner case, but as I said, I'd like to obtain exactly the same :) Sorry for being so pedantic...Tricyclic
@markzzz: 0.49999999999999994 is less than 0.5, so the correct round-to-nearest result is 0, which you get from _mm_cvtpd_epi32, right? I already linked round() for float in C++ which explains that floor(value + 0.5) is not the correctly rounded result for that input. They're different because it's the floor emulation of round() that sucks, not the actual round-to-nearest version. (And BTW, floor(val+0.5) doesn't really emulate the away-from-zero tiebreak behaviour of round(), it tie-breaks towards +Inf I think.)Allineallis
of course ;) But the original Pade double function doesn't use round to nearest: so my conversion in simd code would output different results. I know it will be better, but I'm not after improve the results, but to get the same valida in simd code, given the same inputs...Tricyclic
@markzzz: you really want to insist on being bug-compatible / bit-exact with the scalar version? Why? If you want to have both in your code, why not fix the scalar version and use rint( val ) in it?Allineallis
I don't want to "touch" the original code; I'm making some benchmark based on different library, and I need to calibrate performance with accuracy. I won't "fake" benchmark replacing code, that's all (even because I would use it as library, etc). So, for the range I can have in input, doing __m128d px = _mm_add_pd(_mm_mul_pd(log2e, x), half); __m128d t = _mm_cvtepi32_pd(_mm_cvttpd_epi32(px)); px = _mm_sub_pd(t, _mm_and_pd(_mm_cmplt_pd(px, t), one)); is equivalent in your opinion?Tricyclic
@markzzz: yeah, if you insist on emulating the stupid floor exactly without SSE4 _mm_floor_pd, then I think that emulation based on truncation looks right. And that's probably better than actually changing the MXCSR rounding mode. But for benchmarking, why not use SSE4.1 _mm_floor_pd? Or are you testing on ancient CPUs like first-gen Core 2?Allineallis
@markzzz: Is that cmplt_pd supposed to correct for floor (-Inf) vs. trunc (0) for negative inputs? Why is it comparing px and t, instead of px against 0? I guess it needs to avoid changing the value if it was already a negative exact integer? Otherwise you could do the -1 / -0 fixup in integer, with _mm_cmpgt_epi32, before converting back to _pd. Then you already have your __m128i n instead of having to convert back to integer yet again. Plus, _mm_cmpgt_epi32's all-ones / all-zeroes result is 0 or -1 so you can just add it.Allineallis
Because I'll release "audio plugin" which need to work also on "old" pc. In this field, SSE2 seems to be the right req. Anyway, the code seems to not work. Or at least: it returns the same values of even position within m128d (check my updated answer with the last version I have). Maybe those _mm_add_epi64 _mm_set1_epi64x in the end? I've changed it with the epi32 version, still doesn't works. Uhm.,...Tricyclic
Yeah... debugging it seems that n become 0.0 using n = _mm_slli_epi32(n, 52);Tricyclic
@markzzz: right but for now you're just benchmarking vs. the library. When you actually release a plugin, surely you want to use round to nearest because it's faster and (presumably) more accurate. Although I guess you'd maybe want to see if the min/max error shifted any because of sometimes having different fraction bits for some inputs.Allineallis
@markzzz: I updated this answer a while ago with the shuffle you need to turn a _mm_cvt[t]pd_epi32 result into 64-bit integers lined up with your doubles. And yes, SSE shifts saturate the shift count, so a 52-bit shift on 32-bit elements zeroes them. This shuffling to the bottom is why cvtpd_epi32 is slower than cvtps_epi32: the convert instruction decodes to an extra shuffle uop.Allineallis
Oh I see. I miss that, sorry. Need to learn what it in fact does; tried on the fly n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3, 1, 2, 0)); n = _mm_add_epi32(n, _mm_set1_epi32(1023)); n = _mm_slli_epi32(n, 52); but it fails :)Tricyclic
@markzzz: epi32 is 32-bit element size. You want epi64. Like I already said, a 52-bit shift on 32-bit elements zeroes them.Allineallis
Oh wait, keeping epi64 seems to works :) n = _mm_shuffle_epi32(n, _MM_SHUFFLE(3, 1, 2, 0)); n = _mm_add_epi64(n, _mm_set1_epi64x(1023)); n = _mm_slli_epi64(n, 52);Tricyclic
@markzzz: yes, setting up for _epi64 is the entire point of the shuffle, obviously.Allineallis

© 2022 - 2024 — McMap. All rights reserved.