Related: unsigned
saturation is much easier, and efficiently possible in pure ISO C: How to do unsigned saturating addition in C?
Compilers are terrible at all of the pure C options proposed so far.
They don't see that they can use the signed-overflow flag result from an add
instruction to detect that saturation to INT64_MIN/MAX is necessary. AFAIK there's no pure C pattern that compilers recognize as reading the OF flag result of an add
.
Inline asm is not a bad idea here, but we can avoid that with GCC's builtins that expose UB-safe wrapping signed addition with a boolean overflow result. https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html
(If you were going to use GNU C inline asm, that would limit you just as much as these GNU C builtins. And these builtins aren't arch-specific. They do require gcc5 or newer, but gcc4.9 and older are basically obsolete. https://gcc.gnu.org/wiki/DontUseInlineAsm - it defeats constant propagation and is hard to maintain.)
This version uses the fact that INT64_MIN = INT64_MAX + 1ULL
(for 2's complement) to select INT64_MIN/MAX based on the sign of b
. Signed-overflow UB is avoided by using uint64_t
for that add, and GNU C defines the behaviour of converting an unsigned integer to a signed type that can't represent its value (bit-pattern used unchanged). Current gcc/clang benefit from this hand-holding because they don't figure out this trick from a ternary like (b<0) ? INT64_MIN : INT64_MAX
. (See below for the alternate version using that). I haven't checked the asm on 32-bit architectures.
GCC only supports 2's complement integer types, so a function using __builtin_add_overflow
doesn't have to care about portability to C implementations that use 1's complement (where the same identity holds) or sign/magnitude (where it doesn't), even if you made a version for long
or int
instead of int64_t
.
#include <stdint.h>
#ifndef __cplusplus
#include <stdbool.h>
#endif
// static inline
int64_t signed_sat_add64_gnuc_v2(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction depending on the sign bit
return ((uint64_t)b >> 63) + INT64_MAX;
// INT64_MIN = INT64_MAX + 1 wraparound, done with unsigned
}
return res;
}
Another option is (b>>63) ^ INT64_MAX
which might be useful if manually vectorizing where SIMD XOR can run on more ports than SIMD ADD, like on Intel before Skylake. (But x86 doesn't have SIMD 64-bit arithmetic right shift, only logical, so this would only help for an int32_t
version, and you'd need to efficiently detect overflow in the first place. Or you might use a variable blend on the sign bit, like blendvpd
) See Add saturate 32-bit signed ints intrinsics? with x86 SIMD intrinsics (SSE2/SSE4)
On Godbolt with gcc9 and clang8 (along with the other implementations from other answers):
# gcc9.1 -O3 (clang chooses branchless with cmov)
signed_sat_add64_gnuc_v2:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rax, 9223372036854775807 # INT64_MAX
shr rsi, 63 # b is still available after the ADD
add rax, rsi
ret
When inlining into a loop, the mov imm64
can be hoisted. If b
is needed afterwards then we might need an extra mov
, otherwise shr
/add
can destroy b
, leaving the INT64_MAX
constant in a register undamaged. Or if the compiler wants to use cmov
(like clang does), it has to mov
/shr
because it has to get the saturation constant ready before the add, preserving both operands.
Notice that the critical path for the non-overflowing case only includes an add
and a not-taken jo
. These can't macro-fuse into a single uop even on Sandybridge-family, but the jo
only costs throughput not latency thanks to branch prediction + speculative execution. (When inlining, the mov
will go away.)
If saturation is actually not rare and branch prediction is a problem, compile with profile-guided optimization and gcc will hopefully do if-conversion and use a cmovno
instead of jo
, like clang does. This puts the MIN/MAX selection on the critical path, as well as the CMOV itself. The MIN/MAX selection can run in parallel with the add
.
You could use a<0
instead. I used b
because I think most people would write x = sadd(x, 123)
instead of x = sadd(123, x)
, and having a compile-time-constant allows the b<0
to optimize away. For maximal optimization opportunity, you could use if (__builtin_constant_p(a))
to ask the compiler if a
was a compile-time constant. That works for gcc, but clang evaluates the const-ness way too early, before inlining, so it's useless except in macros with clang. (Related: ICC19 doesn't do constant propagation through __builtin_saddll_overflow
: it puts both inputs in registers and still does the add. GCC and Clang just return a constant.)
This optimization is especially valuable inside a loop with the MIN/MAX selection hoisted, leaving only add
+ cmovo
. (Or add
+ jo
to a mov
.)
cmov
is a 2 uop instruction on Intel P6-family and SnB-family before Broadwell because it has 3 inputs. On other x86 CPUs (Broadwell / Skylake, and AMD), it's a single-uop instruction. On most such CPUs it has 1 cycle latency. It's a simple ALU select operation; the only complication is 3 inputs (2 regs + FLAGS). But on KNL it's still 2-cycle latency.
Unfortunately gcc for AArch64 fails to use adds
to set flags and check the V (overflow) flag result, so it spends several instructions deciding whether to branch.
Clang does a great job, and AArch64's immediate encodings can represent INT64_MAX
as an operand to eor
or add
.
// clang8.0 -O3 -target aarch64
signed_sat_add64_gnuc:
orr x9, xzr, #0x7fffffffffffffff // mov constant = OR with zero reg
adds x8, x0, x1 // add and set flags
add x9, x9, x1, lsr #63 // sat = (b shr 63) + MAX
csel x0, x9, x8, vs // conditional-select, condition = VS = oVerflow flag Set
ret
Optimizing MIN
vs. MAX
selection
As noted above, return (b<0) ? INT64_MIN : INT64_MAX;
doesn't compile optimally with most versions of gcc/clang; they generate both constant in registers and cmov to select, or something similar on other ISAs.
We can assume 2's complement because GCC only supports 2's complement integer types, and because the ISO C optional int64_t
type is guaranteed to be 2's complement if it exists. (Signed overflow of int64_t
is still UB, this allows it to be a simple typedef
of long
or long long
).
(On a sign/magnitude C implementation that supported some equivalent of __builtin_add_overflow
, a version of this function for long long
or int
couldn't use the SHR / ADD trick. For extreme portability you'd probably just use the simple ternary, or for sign/magnitude specifically you could return (b&0x800...) | 0x7FFF...
to OR the sign bit of b
into a max-magnitude number.)
For two's complement, the bit-patterns for MIN and MAX are 0x8000...
(just the high bit set) and 0x7FFF...
(all other bits set). They have a couple interesting properties: MIN = MAX + 1
(if computed with unsigned on the bit-pattern), and MIN = ~MAX
: their bit-patterns are bitwise inverses, aka one's complement of each other.
The MIN = ~MAX
property follows from ~x = -x - 1
(a re-arrangement of the standard -x = ~x + 1
2's complement identity) and the fact that MIN = -MAX - 1
. The +1
property is unrelated, and follows from simple rollover from most-positive to most-negative and applies to the one's complement encoding of signed integer as well. (But not sign/magnitude; you'd get -0
where the unsigned magnitude ).
The above function uses the MIN = MAX + 1
trick. The MIN = ~MAX
trick is also usable by broadcasting the sign bit to all positions with an arithmetic right shift (creating 0
or 0xFF...
), and XORing with that.
GNU C guarantees that signed right shifts are arithmetic (sign-extension), so (b>>63) ^ INT64_MAX
is equivalent to (b<0) ? INT64_MIN : INT64_MAX
in GNU C.
ISO C leaves signed right shifts implementation-defined, but we could use a ternary of b<0 ? ~0ULL : 0ULL
. Compilers will optimize the following to sar
/ xor
, or equivalent instruction(s), but it has no implementation-defined behaviour. AArch64 can use a shifted input operand for eor
just as well as it can for add
.
// an earlier version of this answer used this
int64_t mask = (b<0) ? ~0ULL : 0; // compiles to sar with good compilers, but is not implementation-defined.
return mask ^ INT64_MAX;
Fun fact: AArch64 has a csinv
instruction: conditional-select inverse. And it can put INT64_MIN into a register with a single 32-bit mov
instruction, thanks to its powerful immediate encodings for simple bit-patterns. AArch64 GCC was already using csinv
for the MIN = ~MAX
trick for the original return (b<0) ? INT64_MIN : INT64_MAX;
version.
clang 6.0 and earlier on Godbolt were using shr
/add
for the plain (b<0) ? INT64_MIN : INT64_MAX;
version. It looks more efficient than what clang7/8 do, so that's a regression / missed-optimization bug I think. (And it's the whole point of this section and why I wrote a 2nd version.)
I chose the MIN = MAX + 1
version because it could possible auto-vectorize better: x86 has 64-bit SIMD logical right shifts but only 16 and 32-bit SIMD arithmetic right shifts until AVX512F. Of course, signed-overflow detection with SIMD probably makes it not worth it until AVX512 for 64-bit integers. Well maybe AVX2. And if it's part of some larger calculation that can otherwise vectorize efficiently, then unpacking to scalar and back sucks.
For scalar it's truly a wash; neither way compiles any better, and sar/shr
perform identically, and so do add/xor
, on all CPUs that Agner Fog has tested. (https://agner.org/optimize/).
But +
can sometimes optimize into other things, though, so you could imagine gcc folding a later +
or -
of a constant into the overflow branch. Or possibly using LEA
for that add instead of ADD
to copy-and-add. The difference in power from a simpler ALU execution unit for XOR vs. ADD is going to be lost in the noise from the cost of all the power it takes to do out-of-order execution and other stuff; all x86 CPUs have single-cycle scalar ADD and XOR, even for 64-bit integers, even on P4 Prescott/Nocona with its exotic adders.
Also @chqrlie suggested a compact readable way to write it in C without UB that looks nicer than the super-portable int mask = ternary
thing.
The earlier "simpler" version of this function
Doesn't depend on any special property of MIN/MAX, so maybe useful for saturating to other boundaries with other overflow-detection conditions. Or in case a compiler does something better with this version.
int64_t signed_sat_add64_gnuc(int64_t a, int64_t b) {
long long res;
bool overflow = __builtin_saddll_overflow(a, b, &res);
if (overflow) {
// overflow is only possible in one direction for a given `b`
return (b<0) ? INT64_MIN : INT64_MAX;
}
return res;
}
which compiles as follows
# gcc9.1 -O3 (clang chooses branchless)
signed_sat_add64_gnuc:
add rdi, rsi # the actual add
jo .L3 # jump on signed overflow
mov rax, rdi # retval = the non-overflowing add result
ret
.L3:
movabs rdx, 9223372036854775807
test rsi, rsi # one of the addends is still available after
movabs rax, -9223372036854775808 # missed optimization: lea rdx, [rax+1]
cmovns rax, rdx # branchless selection of which saturation limit
ret
This is basically what @drwowe's inline asm does, but with a test
replacing one cmov. (And of course different conditions on the cmov.)
Another downside to this vs. the _v2
with shr/add is that this needs 2 constants. In a loop, this would tie up an extra register. (Again unless b
is a compile-time constant.)
clang uses cmov
instead of a branch, and does spot the lea rax, [rcx + 1]
trick to avoid a 2nd 10-byte mov r64, imm64
instruction. (Or clang6.0 and earlier use the shr 63
/add
trick instead of that cmov.)
The first version of this answer put int64_t sat = (b<0) ? MIN : MAX
outside the if()
, but gcc missed the optimization of moving that inside the branch so it's not run at all for the non-overflow case. That's even better than running it off the critical path. (And doesn't matter if the compiler decides to go branchless).
But when I put it outside the if
and after the __builtin_saddll_overflow
, gcc was really dumb and saved the bool
result in an integer, then did the test/cmov, then used test
on the saddll_overflow
result again to put it back in FLAGS. Reordering the source fixed that.
(x < 0)
e.g ? To be portable use[u]int64_t
. Then you haveINT64_MAX
andINT64_MIN
for free and don't have to use your own constants for this. – Handsclamp((int128_t)x + y, INT64_MIN, INT64_MAX))
and see if it's acceptable. – Cookbook_int128_t
. – Handsx+1 > x
into the constant1
). For this reason alone I do not think this question should be closed, since none of the referenced answers considers this fact... – Alissaalistair