Comparing two values in the form (a + sqrt(b)) as fast as possible?

Asked 8/5, 2019 at 8:58 Answered 8/5, 2019 at 20:2

Solved c++optimization algebra micro-optimization sqrt

As part of a program that I'm writing, I need to compare two values in the form a + sqrt(b) where a and b are unsigned integers. As this is part of a tight loop, I'd like this comparison to run as fast as possible. (If it matters, I'm running the code on x86-64 machines, and the unsigned integers are no larger than 10^6. Also, I know for a fact that a1<a2.)

As a stand-alone function, this is what I'm trying to optimize. My numbers are small enough integers that double (or even float) can exactly represent them, but rounding error in sqrt results must not change the outcome.

// known pre-condition: a1 < a2  in case that helps
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);  // computed mathematically exactly
}

Test case: is_smaller(900000, 1000000, 900001, 998002) should return true, but as shown in comments by @wim computing it with sqrtf() would return false. So would (int)sqrt() to truncate back to integer.

a1+sqrt(b1) = 90100 and a2+sqrt(b2) = 901000.00050050037512481206. The nearest float to that is exactly 90100.

As the sqrt() function is generally quite expensive even on modern x86-64 when fully inlined as a sqrtsd instruction, I'm trying to avoid calling sqrt() as far as possible.

Removing sqrt by squaring potentially also avoids any danger of rounding errors by making all computation exact.

If instead the function was something like this ...

bool is_smaller(unsigned a1, unsigned b1, unsigned x) {
    return a1+sqrt(b1) < x;
}

... then I could just do return x-a1>=0 && static_cast<uint64_t>(x-a1)*(x-a1)>b1;

But now since there are two sqrt(...) terms, I cannot do the same algebraic manipulation.

I could square the values twice, by using this formula:

      a1 + sqrt(b1) = a2 + sqrt(b2)
<==>  a1 - a2 = sqrt(b2) - sqrt(b1)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1) * sqrt(b2)
<==>  (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1 * b2)
<==>  (a1 - a2) * (a1 - a2) - (b1 + b2) = - 2 * sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 2 = sqrt(b1 * b2)
<==>  ((b1 + b2) - (a1 - a2) * (a1 - a2)) * ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 4 = b1 * b2

Unsigned division by 4 is cheap because it is just a bitshift, but since I square the numbers twice I will need to use 128-bit integers and I will need to introduce a few >=0 checks (because I'm comparing inequality instead of equality).

It feels like there might be a way do this faster, by applying better algebra to this problem. Is there a way to do this faster?

Prefatory answered 8/5, 2019 at 8:58 Comment(23)

Just an observation: if a1+sqrt(b1)<a2 is true then you can skip the calculation of sqrt(b2). – Rollback 8/5, 2019 at 9:6

assemblyrequired.crashworks.org/timing-square-root – Neron 8/5, 2019 at 9:15

you can also observe the fact that max(sqrt(b)) = 1000 if b <= 10^6. so you only need to investigate further if abs(a1-a2) <= 1000. otherwise there is always unequality – Anglofrench 8/5, 2019 at 9:20

Clang actually does a pretty decent job of optimizing this with FP shuffles, to do both sqrt operations in parallel. (Use signed int so it can do SIMD packed int->FP conversion cheaply) godbolt.org/z/GvNe2B. (includes versions with float and double; clang vectorizes better with double, but sqrtpd is slower than sqrtps) But yeah, multiplying twice with scalar 128-bit math might be good, especially if we don't care about Intel before Broadwell (where adc was 2 uops). – Mercurochrome 8/5, 2019 at 9:23

I'm afraid you are going too fast into the code: there's another question "#52807571", where I've given a way to reduce floating point arithmetic, just by interpreting the use case. Can you explain us the use case, maybe we might come up with a better solution? (What's the meaning of "a1+sqrt(b1)<a2+sqrt(b2)"?) – Margheritamargi 8/5, 2019 at 9:25

another brute-force approach would be lookup array: x2->x: [0-10^6] -> [0-1000]. you'd have to measeure this ... more space needed (2byte integer x 10^6 integers)-> 2 * 10^6 bytes .. – Anglofrench 8/5, 2019 at 9:29

@MarekR: That's assuming your number already started as a float or double, not integer. cvtsi2ss isn't free. And it's on a Core 2, which has much slower hardware sqrt than Skylake (6 to 29 cycles per instruction on Core2, with the fastest numbers only for trivial cases like 0). Skylake has fixed one per 3-cycle throughput for single-precision FP sqrt sqrtps (but still 12 cycle latency). agner.org/optimize But yes, there might be something we can gain from sqrt(x) ~= x * rsqrtps(x) if we exclude the x==0 case. That might be enough precision to round to nearest integer. – Mercurochrome 8/5, 2019 at 9:29

@Anglofrench That would work for my simpler example (with 3 parameters only). But when I have square roots on both sides on the inequality, I don't think I can use a lookup array... (what if I'm trying to do is_smaller(1, 7, 2, 3) ?) – Prefatory 8/5, 2019 at 9:33

@StPiere: Using a LUT for sqrt is going to be horrible on modern x86 if inputs are fairly uniformly distributed. A 4MiB cache footprint is way bigger than L2 cache size (typically 256kiB), so you'll mostly be getting L3 hits at best, like 45 cycle latency on Skylake. But even on a really old Core 2, single-precision sqrt has worst-case 29 cycle latency. (With a couple more cycles to convert to FP in the first place). On Skylake, FP sqrt latency ~= L2 cache hit latency and is pipelined at throughput = latency/4. Not to mention the impact of cache pollution on other code. – Mercurochrome 8/5, 2019 at 9:33

@Anglofrench Since a1 < a2, the abs() can be removed too. – Overdress 8/5, 2019 at 9:56

Since a1 < a2, you can already exclude directly all cases where b1 < b2 – Blister 8/5, 2019 at 10:4

@Peter Cordes: yes, you're probably right. But I would still measure it first. We know nothing about how uniformly the data distribution is or about the problem itself. For LUT values 2 Byte suffiicies (<=1000), so I come to 2MB needed, but this doesnt change much. Measure first then decide. It is possible (althgough with small probability) that for this problem LUT outperformes the sqrt calculation if data distribution is nonuniform. Looking at the problem from multiple sides is always good – Anglofrench 8/5, 2019 at 10:33

@StPiere: An early-out fast path using integer checks is what I'd investigate first. Only if it was branch-mispredicting too much would I even consider a giant LUT. (But yes, 2MB, good point that we can use a narrow type for LUT values.) Cache misses lead to occasional big stalls too big for OoO exec to hide. But depending how sqrt latency fits into the critical path, out-of-order exec may be able to mostly hide it. (But sqrt throughput is definitely a concern). – Mercurochrome 8/5, 2019 at 10:50

@500-InternalServerError you can actually avoid the sqrt here. Set t=a2-a1, which is positive. You know that you want to test a1+sqrt(b1) < a2+sqrt(b2) which is equivalent to testing sqrt(b1) - sqrt(b2) < t. So the quick way is to test sqrt(b1) < t which is equivlanet to testing b1 < t*t, but t*t might be to big, so you could just test with integer divistion if (b1/t)/t == 0 – Blister 8/5, 2019 at 10:52

Very important: How is the distribution of those values? Are they likely clustered or uniformly distributed? – Ennead 8/5, 2019 at 15:53

@DanielJour They are reasonably uniformly distributed in range [0, 10^6]. – Prefatory 8/5, 2019 at 16:15

Note that the results with single precision: return a1+sqrtf(b1) < a2+sqrtf(b2);, differ from the double precision results (return a1+sqrt(b1) < a2+sqrt(b2);), for a1 = 900000; b1 = 1000000; a2 = 900001; b2 = 998002; which is true in double precision and false in in single precision. This suggests that you shouldn't use fast approximate sqrt methods. – Bronder 8/5, 2019 at 21:17

@wim: there should be enough precision; maybe we need to roundps to the nearest integer, with nearest or floor. Or convert back to integer with truncation. Maybe I made a mistake when I moved the OP's pseudocode-comment in the question into actual C, because I assumed computing in FP would be equivalent to integer-sqrt (i.e. floor), in case that's what they meant. – Mercurochrome 8/5, 2019 at 22:44

@wim: the exact results there are a1+sqrt(b1) = 90100 and a2+sqrt(b2) = 901000.00050050037512481206. The nearest float to that is 90100, so they're equal not less. h-schmidt.net/FloatConverter/IEEE754.html. So IDK whether the OP wants a1 + (int)sqrt(b1) or a1 + lrint(sqrt(b1)), or exact (for which double-precision may do the job), or what. – Mercurochrome 8/5, 2019 at 22:54

@PeterCordes; (replaces an earlier comment which was wrongly formatted). With an integer sqrt we don't have, for example, 10 + sqrt(10) < 10 +sqrt(11), because both side are equal. That would be undesirable I think. Because, after some algebraic integer manipulations you may find that actually 10 + sqrt(10) < 10 +sqrt(11). Therefore a double precision a1+sqrt(b1) < a2+sqrt(b2); seems most reasonable to me. – Bronder 8/5, 2019 at 23:10

@PeterCordes I want an exact comparison, so is_smaller(900000, 1000000, 900001, 998002) should return true. – Prefatory 9/5, 2019 at 9:44

I updated your question with that test case, that's extremely helpful and definitely rules out that way of using float like @Bronder proved. Any comment on expected distribution of inputs? If we assume uniform, then \@EricTowers calculated that even the simple a2-a1 < 1000 early-out check rejects 99.8% of all inputs, and thus might be better than geza's check that requires a multiply to reject 99.946%. But most numbers in software aren't uniform. – Mercurochrome 9/5, 2019 at 10:13

@PeterCordes I'm not too sure about the expected distribution of inputs. I'll need to do some testing/profiling of different solutions. – Prefatory 10/5, 2019 at 4:40

Here's a version without sqrt, though I'm not sure whether it is faster than a version which has only one sqrt (it may depend on the distribution of values).

Here's the math (how to remove both sqrts):

ad = a2-a1
bd = b2-b1

a1+sqrt(b1) < a2+sqrt(b2)              // subtract a1
   sqrt(b1) < ad+sqrt(b2)              // square it
        b1  < ad^2+2*ad*sqrt(b2)+b2    // arrange
   ad^2+bd  > -2*ad*sqrt(b2)

Here, the right side is always negative. If the left side is positive, then we have to return true.

If the left side is negative, then we can square the inequality:

ad^4+bd^2+2*bd*ad^2 < 4*ad^2*b2

The key thing to notice here is that if a2>=a1+1000, then is_smaller always returns true (because the maximum value of sqrt(b1) is 1000). If a2<=a1+1000, then ad is a small number, so ad^4 will always fit into 64 bit (there is no need for 128-bit arithmetic). Here's the code:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    if (ad>1000) {
        return true;
    }

    int bd = b2 - b1;
    if (ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;

    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}

EDIT: As Peter Cordes noticed, the first if is not necessary, as the second if handles it, so the code becomes smaller and faster:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    int ad = a2 - a1;
    int bd = b2 - b1;
    if ((long long int)ad*ad+bd>0) {
        return true;
    }

    int ad2 = ad*ad;
    return (long long int)ad2*ad2 + (long long int)bd*bd + 2ll*bd*ad2 < 4ll*ad2*b2;
}

Trauma answered 8/5, 2019 at 11:34 Comment(11)

It might be better to omit the ad>1000 branch; I think the ad*ad+bd>0 branch covers everything. One branch that's true most of the time is better than 2 branches that are each taken some of the time, for branch prediction. Unless the ad>1000 check catches most of the inputs, it's going to be worth doing 1 extra sub and imul. (And probably a movzx to 64-bit) – Mercurochrome 8/5, 2019 at 11:48

@PeterCordes: nice observation (as usual :) ), thanks! – Trauma 8/5, 2019 at 11:51

Oh, were you doing that so ad*ad is known to not overflow an int32_t? After inlining, an x86-64 compiler can optimize away zero-extension to 64-bit (because writing a 32-bit register does that implicitly), so we could promote the unsigned inputs to uint64_t and then subtract to get int64_t. (32-bit subtract would require a movsxd sign-extension of a possibly-negative result, so avoid that.) – Mercurochrome 8/5, 2019 at 11:51

@PeterCordes: almost :) I added it, so ad^4 surely doesn't overflow 64-bit. But yes, as you say, doing the ad*ad multiplication as 64-bit easily handles the case. – Trauma 8/5, 2019 at 11:55

Your fast version could avoid some sign-extension instructions by zero-extending a2-a1 to 64-bit with int64_t ad = a2 - a1. (a2-a1 has type unsigned, so it zero-extends. And that's free on x86-64.) bd needs to zero-extend the inputs to the subtraction, because we don't know if it's positive or negative. Or it needs to sign-extend the 32-bit result, but that's worse after inlining into a caller where the unsigned inputs are already known to be zero-extended in registers. (That's always the case unless they came in as function args or return values and haven't been touched yet.) – Mercurochrome 8/5, 2019 at 12:6

But yes, nice, transforming the fallback to avoid ever needing sqrt is great, even better than just branching to sometimes avoid a sqrt. – Mercurochrome 8/5, 2019 at 12:8

I was wondering if we can do anything with the fact(?) that sqrt( b2-b1 ) <= sqrt(b2) - sqrt(b1) (for b2 >= b1, and they're both integers >= 1.0, or they're zero. sqrt makes non-zero numbers closer to 1.0. So we could say that sqrt makes numbers closer to 0). wolframalpha.com/input/?i=sqrt(x-y)+-+(sqrt(x)+-+sqrt(y)) shows (I think) that the inequality is true: it finds global minima of 0 for x=y or y=0. So IDK if we can do an even better pre-check in terms of that which excludes more values, or if that would just give us a 2nd pre-check for some definitely-false inputs. – Mercurochrome 8/5, 2019 at 13:2

Probably you made a mistake somewhere. In my tests, for a1 = 1000000; b1 = 100000; a2 = 999100; b2 = 700000; the results differ from the original return a1+sqrt(b1) < a2+sqrt(b2);. See Godbolt link. – Bronder 8/5, 2019 at 22:25

My example above is wrong. I overlooked that we know that a1<a2. – Bronder 9/5, 2019 at 6:33

@wim: I've verified that my solution is OK. But, I've found that there is some problem with Brendan's second version (and maybe with the first as well) :) – Trauma 9/5, 2019 at 7:48

@Geza: You are right. I focused too much on a1+sqrt(b1) < a2+sqrt(b2), without paying attention to a1 < a2. – Bronder 9/5, 2019 at 13:27

I'm tired and probably made a mistake; but I'm sure if I did someone will point it out..

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a1-a2;   // May be negative

    if(a_diff < 0) {
        if(b1 < b2) {
            return true;
        }
        temp = a_diff+sqrt(b1);
        if(temp < 0) {
            return true;
        }
        return temp*temp < b2;
    } else {
        if(b1 >= b2) {
            return false;
        }
    }
//  return a_diff+sqrt(b1) < sqrt(b2);

    temp = a_diff+sqrt(b1);
    return temp*temp < b2;
}

If you know a1 < a2 then it could become:

bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    a_diff = a2-a1;    // Will be positive

    if(b1 > b2) {
        return false;
    }
    if(b1 >= a_diff*a_diff) {
        return false;
    }
    temp = a_diff+sqrt(b2);
    return b1 < temp*temp;
}

Inconsonant answered 8/5, 2019 at 10:13 Comment(8)

We know a1 < a2, so no need to test for a_diff < 0. But maybe it is worth to test it against > 1000 (reversed). – Overdress 8/5, 2019 at 10:16

You need a signed difference for int a_diff, and if you want to square it to check the condition without doing an actual sqrt you want int64_t. – Mercurochrome 8/5, 2019 at 10:22

Heh - I did make a mistake (if a_diff+sqrt(b1); is negative then you can't square it without borking the sign) - fixed. Also added the "if you know a1 < a2". – Inconsonant 8/5, 2019 at 10:25

Just define a_diff = a2 - a1 for the second case. The first check should be b1 <= b2. The first square root you can delay if you just check if b1/a_diff < a_diff – Blister 8/5, 2019 at 10:27

I'm also tired. Half-finished attempt: godbolt.org/z/U758t6. I think we can use sqrt( abs(b1-b2) ) <= sqrt(b1) - sqrt(b2) (which is true for numbers >= 1.0, and for 0). (ping @kvantour). But maybe better to just check b1 < b2 first / instead, and abs(a1-a2) <= 1000 – Mercurochrome 8/5, 2019 at 10:32

@kvantour: You're right (and doing a_diff = a2 - a1 cleaned it up a lot). Didn't quite follow the b1/a_diff < a_diff so I did it differently and probably got that wrong. Will check back in about 10 hours. – Inconsonant 8/5, 2019 at 10:48

@Inconsonant HAve a look at this comment. Also you could first do the test on b and then compute a_diff – Blister 8/5, 2019 at 10:53

the first algorithm, the if (b1 < b2) should be if (b1 <= b2) – Regeniaregensburg 9/5, 2019 at 5:20

There is also newton method for calculating integer sqrts as described here Another approach would be to not calculate square root, but searching for floor(sqrt(n)) via binary search ... there are "only" 1000 full square numbers less than 10^6. This has probably bad performance, but would be an interesting approach. I haven't measure any of these, but here are examples:

#include <iostream>
#include <array>
#include <algorithm>        // std::lower_bound
#include <cassert>          


bool is_smaller_sqrt(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt(b1) < a2 + sqrt(b2);
}

static std::array<int, 1001> squares;

template <typename C>
void squares_init(C& c)
{
    for (int i = 0; i < c.size(); ++i)
        c[i] = i*i;
}

inline bool greater(const int& l, const int& r)
{
    return r < l;
}

inline bool is_smaller_bsearch(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    // return a1 + sqrt(b1) < a2 + sqrt(b2)

    // find floor(sqrt(b1)) - binary search withing 1000 elems
    auto it_b1 = std::lower_bound(crbegin(squares), crend(squares), b1, greater).base();

    // find floor(sqrt(b2)) - binary search withing 1000 elems
    auto it_b2 = std::lower_bound(crbegin(squares), crend(squares), b2, greater).base();

    return (a2 - a1) > (it_b1 - it_b2);
}

unsigned int sqrt32(unsigned long n)
{
    unsigned int c = 0x8000;
    unsigned int g = 0x8000;

    for (;;) {
        if (g*g > n) {
            g ^= c;
        }

        c >>= 1;

        if (c == 0) {
            return g;
        }

        g |= c;
    }
}

bool is_smaller_sqrt32(unsigned a1, unsigned b1, unsigned a2, unsigned b2)
{
    return a1 + sqrt32(b1) < a2 + sqrt32(b2);
}

int main()
{
    squares_init(squares);

    // now can use is_smaller
    assert(is_smaller_sqrt(1, 4, 3, 1) == is_smaller_sqrt32(1, 4, 3, 1));
    assert(is_smaller_sqrt(1, 2, 3, 3) == is_smaller_sqrt32(1, 2, 3, 3));
    assert(is_smaller_sqrt(1000, 4, 1001, 1) == is_smaller_sqrt32(1000, 4, 1001, 1));
    assert(is_smaller_sqrt(1, 300, 3, 200) == is_smaller_sqrt32(1, 300, 3, 200));
}

Anglofrench answered 8/5, 2019 at 12:22 Comment(2)

Your integer sqrt32 runs a fixed 16 iterations of that loop, I think. You could maybe start it from a smaller position based on a bit-scan to find the highest set bit in n and divide by 2. Or just from a lower fixed start point since the known max for n is 1 mil, not ~4 billion. So that's about 12 / 2 = 6 iterations we could save. But this will probably still be slower than converting to single-precision float for sqrtss and back. Maybe if you did two square roots in parallel in the integer loop, so the c updates and loop overhead would be amortized, and there'd be 2 dep chains – Mercurochrome 8/5, 2019 at 12:49

The binary search inverting the table is an interesting idea, but still probably terrible on modern x86-64 where hardware sqrt is not very slow, but branch mispredicts are very costly relative to designs with shorter / simpler pipelines. Maybe some of this answer will be useful to someone with the same problem but on a microcontroller. – Mercurochrome 8/5, 2019 at 12:52

I'm not sure if algebraic manipulations, in combination with integer arithmetic, necessarily leads to the fastest solution. You'll need many scalar multiplies in that case (which isn't very fast), and/or branch prediction may fail, which may degrade performance. Obviously you'll have to benchmark to see which solution is fastest in you particular case.

One method to make the sqrt a bit faster is to add the -fno-math-errno option to gcc or clang. In that case the compiler doesn't have to check for negative inputs. With icc this the default setting.

More performance improvement is possible by using the vectorized sqrt instruction sqrtpd, instead of the scalar sqrt instruction sqrtsd. Peter Cordes has shown that clang is able to auto vectorize this code, such that it generates this sqrtpd.

However the amount success of auto vectorization depends quite heavily on the right compiler settings and the compiler that is used (clang, gcc, icc etc.). With -march=nehalem, or older, clang doesn't vectorize.

More reliable vectorization results are possible with the following intrinsics code, see below. For portability we only assume SSE2 support, which is the x86-64 baseline.

/* gcc -m64 -O3 -fno-math-errno smaller.c                      */
/* Adding e.g. -march=nehalem or -march=skylake might further  */
/* improve the generated code                                  */
/* Note that SSE2 in guaranteed to exist with x86-64           */
#include<immintrin.h>
#include<math.h>
#include<stdio.h>
#include<stdint.h>

int is_smaller_v5(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    uint64_t a64    =  (((uint64_t)a2)<<32) | ((uint64_t)a1); /* Avoid too much port 5 pressure by combining 2 32 bit integers in one 64 bit integer */
    uint64_t b64    =  (((uint64_t)b2)<<32) | ((uint64_t)b1); 
    __m128i ax      = _mm_cvtsi64_si128(a64);         /* Move integer from gpr to xmm register                  */
    __m128i bx      = _mm_cvtsi64_si128(b64);         
    __m128d a       = _mm_cvtepi32_pd(ax);            /* Convert 2 integers to double                           */
    __m128d b       = _mm_cvtepi32_pd(bx);            /* We don't need _mm_cvtepu32_pd since a,b < 1e6          */
    __m128d sqrt_b  = _mm_sqrt_pd(b);                 /* Vectorized sqrt: compute 2 sqrt-s with 1 instruction   */
    __m128d sum     = _mm_add_pd(a, sqrt_b);
    __m128d sum_lo  = sum;                            /* a1 + sqrt(b1) in the lower 64 bits                     */
    __m128d sum_hi  =  _mm_unpackhi_pd(sum, sum);     /* a2 + sqrt(b2) in the lower 64 bits                     */
    return _mm_comilt_sd(sum_lo, sum_hi);
}


int is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
    return a1+sqrt(b1) < a2+sqrt(b2);
}


int main(){
    unsigned a1; unsigned b1; unsigned a2; unsigned b2;
    a1 = 11; b1 = 10; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 11; a2 = 10; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 11; b2 = 10;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));
    a1 = 10; b1 = 10; a2 = 10; b2 = 11;
    printf("smaller?  %i  %i \n",is_smaller(a1,b1,a2,b2), is_smaller_v5(a1,b1,a2,b2));

    return 0;
}

See this Godbolt link for the generated assembly.

In a simple throughput test on Intel Skylake, with compiler options gcc -m64 -O3 -fno-math-errno -march=nehalem, I found a throughput of is_smaller_v5() which was 2.6 times better than the original is_smaller(): 6.8 cpu cycles vs 18 cpu cycles, with loop overhead included. However, in a (too?) simple latency test, where the inputs a1, a2, b1, b2 depended on the result of the previous is_smaller(_v5), I didn't see any improvement. (39.7 cycles vs 39 cycles).

Bronder answered 8/5, 2019 at 20:2 Comment(5)

clang already auto-vectorizes like that :P godbolt.org/z/GvNe2B see the double and signed int version. But only for double, not float. For throughput you should definitely use float with this strategy, because packed conversion is only 1 uop, and because sqrtps has better throughput. The OP's numbers are all 1million or less, so can be exactly represented by float, and so can their square roots. And BTW, looks like you forgot to set -mtune=haswell, so your gcc picked a store/reload strategy for _mm_set_epi32 instead of ALU movd. – Mercurochrome 8/5, 2019 at 22:29

@PeterCordes: Single precision isn't accurate enough, see my comment here. We only know that the target is x86-64. Somehow clang doesn't vectorize in that case.Even with -march=nehalem. Clang produces better asm with 4 movds indeed. – Bronder 8/5, 2019 at 22:39

@PeterCordes: Note that in a throughput test the auto vectorized function may easily bottleneck on port 5. Clang generates 9 p5 micro-ops (Skylake), if I counted right. – Bronder 10/5, 2019 at 13:11

I didn't look that closely. Not surprised it wasn't optimal. :P Interesting, I hadn't realized there was an intrinsic for (u)comisd. Makes sense that there is, of course, but I'd never noticed it before. You should be able to save a movaps if you use movhlps into a "dead" variable instead of unpckhpd (if you compile without AVX). But that requires a bunch of casting because intrinsics make it inconvenient to help the compiler optimize its shuffles that way. – Mercurochrome 10/5, 2019 at 13:19

@PeterCordes Actually, I have never used a (u)comisd intrinsic before, but here it seem useful. – Bronder 10/5, 2019 at 13:30

Possibly not better than other answers, but uses a different idea (and a mass of pre-analysis).

// Compute approximate integer square root of input in the range [0,10^6].
// Uses a piecewise linear approximation to sqrt() with bounded error in each piece:
//   0 <= x <= 784 : x/28
//   784 < x <= 7056 : 21 + x/112
//   7056 < x <= 28224 : 56 + x/252
//   28224 < x <= 78400 : 105 + x/448
//   78400 < x <= 176400 : 168 + x/700
//   176400 < x <= 345744 : 245 + x/1008
//   345744 < x <= 614656 : 336 + x/1372
//   614656 < x <= 1000000 : (784000+x)/1784
// It is the case that sqrt(x) - 7.9992711366390365897... <= pseudosqrt(x) <= sqrt(x).
unsigned pseudosqrt(unsigned x) {
    return 
        x <= 78400 ? 
            x <= 7056 ?
                x <= 764 ? x/28 : 21 + x/112
              : x <= 28224 ? 56 + x/252 : 105 + x/448
          : x <= 345744 ?
                x <= 176400 ? 168 + x/700 : 245 + x/1008
              : x <= 614656 ? 336 + x/1372 : (x+784000)/1784 ;
}

// known pre-conditions: a1 < a2, 
//                  0 <= b1 <= 1000000
//                  0 <= b2 <= 1000000
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
// Try three refinements:
// 1: a1 + sqrt(b1) <= a1 + 1000, 
//    so is a1 + 1000 < a2 ?  
//    Convert to a2 - a1 > 1000 .
// 2: a1 + sqrt(b1) <= a1 + pseudosqrt(b1) + 8 and
//    a2 + pseudosqrt(b2) <= a2 + sqrt(b2), 
//    so is  a1 + pseudosqrt(b1) + 8 < a2 + pseudosqrt(b2) ?
//    Convert to a2 - a1 > pseudosqrt(b1) - pseudosqrt(b2) + 8 .
// 3: Actually do the work.
//    Convert to a2 - a1 > sqrt(b1) - sqrt(b2)
// Use short circuit evaluation to stop when resolved.
    unsigned ad = a2 - a1;
    return (ad > 1000)
           || (ad > pseudosqrt(b1) - pseudosqrt(b2) + 8)
           || ((int) ad > (int)(sqrt(b1) - sqrt(b2)));
}

(I don't have a compiler handy, so this probably contains a typo or two.)

Clerissa answered 8/5, 2019 at 19:8 Comment(14)

It does compile with #include <math.h> (godbolt.org/z/VH4I3g), but I don't think it will do very well. @Geza's answer has a much faster integer early-out, and does less integer math than this (with less branching, too) to fully avoid sqrt by squaring while only requiring 64-bit integers. – Mercurochrome 8/5, 2019 at 22:33

@PeterCordes : I don't have the means to do a timing comparison. Geza's early out handles 99.946% of uniformly distributed inputs and follows two long subtracts and a long long multiply, add, and compare. My early out is an unsigned subtract and compare. I don't see the basis of "much faster integer early-out". My early out handles 99.8% of uniformly distributed inputs. The pseudosqrt() case raises the handled set of inputs to 99.9725% costing additional 7 unsigned compares and one each of unsigned add and subtract. (continued) – Clerissa 8/5, 2019 at 23:9

@PeterCordes : Unless long long operations are as fast as unsigned operations now, your "much faster" claim would be surprising. THis seems to not be the case. [#48780119 – Clerissa 8/5, 2019 at 23:13

compare-and-branch is expensive unless branch prediction works perfectly. much more expensive than a long long multiply (3c latency, 1 cycle throughput on modern x86-64, like AMD Zen or Intel since Nehalem). Only division costs more for wider types on modern x86-64, other operations are not data-dependent or type-width dependent. A few older x86-64 CPUs like Bulldozer-family or Silvermont have slower 64-bit multiply. agner.org/optimize. (We're of course talking about scalar; auto-vectorizing with SIMD makes narrow types valuable because you can do more per vector) – Mercurochrome 8/5, 2019 at 23:14

@PeterCordes : Then Geza's early out is 4 arithmetic operations and a compare and mine is 1 arithmetic operation and a compare. As far as I can tell, 1 is still much less than 4. – Clerissa 8/5, 2019 at 23:16

That's a fair point, and I hadn't looked at the math for how much of the problem space each one covers. If 99.8% happens in practice, that would probably justify the simpler early-out check. But we don't know if the OP's inputs will be uniformly distributed or not, though. Geza's first version does have the ad > 1000 early out, but I suggested removing it for better branch prediction. – Mercurochrome 8/5, 2019 at 23:23

@PeterCordes : I've been using Mathematica to test correctness and determine the fractions of the input space handled by various collections of conditions. What I can't do is any meaningful timing. – Clerissa 8/5, 2019 at 23:27

Superscalar out-of-order execution CPUs don't always bottleneck on instruction throughput; the extra instructions might essentially no cost if they can execute in the shadow of a latency dependency chain. (i.e. if the surrounding code doesn't max out instruction-per-cycle, there is room for burst of extra instructions to execute without slowing anything else down.) Counting instructions doesn't always predict performance. Fewer uops executed is almost always better, but smaller static code size is nice too. – Mercurochrome 8/5, 2019 at 23:27

@PeterCordes : Then my code may still be in good shape. I expect two long long constants in Geza's code and 23 unsigneds in mine (since I expect any vaguely optimizing compiler to inline pseudosqrt()). – Clerissa 8/5, 2019 at 23:31

So anyway, it would be an interesting experiment to test Geza's final version with/without adding in an ad > 1000 early out before (or more likely instead of) the current early out. But nobody can do that without knowing how the OP's actual inputs are distributed. Numbers in computer programs are rarely uniform, and this is the critical factor in which early out will filter best and how well branch prediction will do on a real CPU. (Past branch history is part of the input to predicting a branch, so if different numbers are a reasult of different branching earlier, it might predict...) – Mercurochrome 8/5, 2019 at 23:32

Huh, your comment about number of constants makes no sense. Geza's 2ll * ... and 4ll * ... don't need to be stored as a 64-bit 0x000000000002. It's a shift by 1, and multiply by 4 is a shift by 2. And x86-64 can do a shift-and-add in one instruction with lea rax, [rdi + 2*rcx] for example, so the *2 and + combine into one cheap instruction. Code-size is mostly instructions, although yours will need a lot of 32-bit immediate constants to compare against. The Godbolt link in my first comment shows that it's a lot of total code size, even if most of it rarely executes. – Mercurochrome 8/5, 2019 at 23:37

@PeterCordes: I'm not entirely convinced that Geza's math is entirely right however. Probably it is easy to repair. I didn't look into the details. – Bronder 9/5, 2019 at 0:3

@wim: yeah, whatever the OP does, a fairly exhaustive unit test is in order! (1M ^4 is too expensive, though, so some pruning of the search space to look at some large values, and some values that make the two sides of the inequality nearly equal, is needed.) – Mercurochrome 9/5, 2019 at 0:6

@EricTowers: It would be interesting to edit your answer to include your early-out analysis based on a uniform distribution. I think that's a valuable piece of analysis (the results, and the idea / method of even doing that in the first place instead of just guessing how good different early-outs will be). But still with the huge caveat that most code doesn't deal with uniformly-distributed numbers; small numbers are very common most of the time. – Mercurochrome 9/5, 2019 at 0:45

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags