As part of a program that I'm writing, I need to compare two values in the form a + sqrt(b)
where a
and b
are unsigned integers. As this is part of a tight loop, I'd like this comparison to run as fast as possible. (If it matters, I'm running the code on x86-64 machines, and the unsigned integers are no larger than 10^6. Also, I know for a fact that a1<a2
.)
As a stand-alone function, this is what I'm trying to optimize. My numbers are small enough integers that double
(or even float
) can exactly represent them, but rounding error in sqrt
results must not change the outcome.
// known pre-condition: a1 < a2 in case that helps
bool is_smaller(unsigned a1, unsigned b1, unsigned a2, unsigned b2) {
return a1+sqrt(b1) < a2+sqrt(b2); // computed mathematically exactly
}
Test case: is_smaller(900000, 1000000, 900001, 998002)
should return true, but as shown in comments by @wim computing it with sqrtf()
would return false. So would (int)sqrt()
to truncate back to integer.
a1+sqrt(b1) = 90100
and a2+sqrt(b2) = 901000.00050050037512481206
. The nearest float to that is exactly 90100.
As the sqrt()
function is generally quite expensive even on modern x86-64 when fully inlined as a sqrtsd
instruction, I'm trying to avoid calling sqrt()
as far as possible.
Removing sqrt by squaring potentially also avoids any danger of rounding errors by making all computation exact.
If instead the function was something like this ...
bool is_smaller(unsigned a1, unsigned b1, unsigned x) {
return a1+sqrt(b1) < x;
}
... then I could just do return x-a1>=0 && static_cast<uint64_t>(x-a1)*(x-a1)>b1;
But now since there are two sqrt(...)
terms, I cannot do the same algebraic manipulation.
I could square the values twice, by using this formula:
a1 + sqrt(b1) = a2 + sqrt(b2)
<==> a1 - a2 = sqrt(b2) - sqrt(b1)
<==> (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1) * sqrt(b2)
<==> (a1 - a2) * (a1 - a2) = b1 + b2 - 2 * sqrt(b1 * b2)
<==> (a1 - a2) * (a1 - a2) - (b1 + b2) = - 2 * sqrt(b1 * b2)
<==> ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 2 = sqrt(b1 * b2)
<==> ((b1 + b2) - (a1 - a2) * (a1 - a2)) * ((b1 + b2) - (a1 - a2) * (a1 - a2)) / 4 = b1 * b2
Unsigned division by 4 is cheap because it is just a bitshift, but since I square the numbers twice I will need to use 128-bit integers and I will need to introduce a few >=0
checks (because I'm comparing inequality instead of equality).
It feels like there might be a way do this faster, by applying better algebra to this problem. Is there a way to do this faster?
a1+sqrt(b1)<a2
is true then you can skip the calculation ofsqrt(b2)
. – Rollbackint
so it can do SIMD packed int->FP conversion cheaply) godbolt.org/z/GvNe2B. (includes versions with float and double; clang vectorizes better with double, butsqrtpd
is slower thansqrtps
) But yeah, multiplying twice with scalar 128-bit math might be good, especially if we don't care about Intel before Broadwell (whereadc
was 2 uops). – Mercurochromefloat
ordouble
, not integer.cvtsi2ss
isn't free. And it's on a Core 2, which has much slower hardware sqrt than Skylake (6 to 29 cycles per instruction on Core2, with the fastest numbers only for trivial cases like 0). Skylake has fixed one per 3-cycle throughput for single-precision FP sqrtsqrtps
(but still 12 cycle latency). agner.org/optimize But yes, there might be something we can gain fromsqrt(x) ~= x * rsqrtps(x)
if we exclude thex==0
case. That might be enough precision to round to nearest integer. – Mercurochromeis_smaller(1, 7, 2, 3)
?) – Prefatorya1 < a2
, theabs()
can be removed too. – Overdressa1 < a2
, you can already exclude directly all cases whereb1 < b2
– Blistersqrt
latency fits into the critical path, out-of-order exec may be able to mostly hide it. (But sqrt throughput is definitely a concern). – Mercurochromesqrt
here. Sett=a2-a1
, which is positive. You know that you want to testa1+sqrt(b1) < a2+sqrt(b2)
which is equivalent to testingsqrt(b1) - sqrt(b2) < t
. So the quick way is to testsqrt(b1) < t
which is equivlanet to testingb1 < t*t
, butt*t
might be to big, so you could just test with integer divistion if(b1/t)/t == 0
– Blisterreturn a1+sqrtf(b1) < a2+sqrtf(b2);
, differ from the double precision results (return a1+sqrt(b1) < a2+sqrt(b2);
), fora1 = 900000; b1 = 1000000; a2 = 900001; b2 = 998002;
which is true in double precision and false in in single precision. This suggests that you shouldn't use fast approximatesqrt
methods. – Bronderroundps
to the nearest integer, with nearest or floor. Or convert back to integer with truncation. Maybe I made a mistake when I moved the OP's pseudocode-comment in the question into actual C, because I assumed computing in FP would be equivalent to integer-sqrt (i.e. floor), in case that's what they meant. – Mercurochromea1+sqrt(b1) = 90100
anda2+sqrt(b2) = 901000.00050050037512481206
. The nearestfloat
to that is90100
, so they're equal not less. h-schmidt.net/FloatConverter/IEEE754.html. So IDK whether the OP wantsa1 + (int)sqrt(b1)
ora1 + lrint(sqrt(b1))
, or exact (for which double-precision may do the job), or what. – Mercurochromesqrt
we don't have, for example,10 + sqrt(10) < 10 +sqrt(11)
, because both side are equal. That would be undesirable I think. Because, after some algebraic integer manipulations you may find that actually10 + sqrt(10) < 10 +sqrt(11)
. Therefore a double precisiona1+sqrt(b1) < a2+sqrt(b2);
seems most reasonable to me. – Bronderis_smaller(900000, 1000000, 900001, 998002)
should return true. – Prefatoryfloat
like @Bronder proved. Any comment on expected distribution of inputs? If we assume uniform, then \@EricTowers calculated that even the simplea2-a1 < 1000
early-out check rejects 99.8% of all inputs, and thus might be better than geza's check that requires a multiply to reject 99.946%. But most numbers in software aren't uniform. – Mercurochrome