Saturating subtract/add for unsigned bytes
Asked Answered
C

11

88

Imagine I have two unsigned bytes b and x. I need to calculate bsub as b - x and badd as b + x. However, I don't want underflow/overflow occur during these operations. For example (pseudo-code):

b = 3; x = 5;
bsub = b - x; // bsub must be 0, not 254

and

b = 250; x = 10;
badd = b + x; // badd must be 255, not 4

The obvious way to do this includes branching:

bsub = b - min(b, x);
badd = b + min(255 - b, x);

I just wonder if there are any better ways to do this, i.e. by some hacky bit manipulations?

Curhan answered 2/11, 2015 at 15:35 Comment(13)
y ^ ((x ^ y) & -(x < y)) for int types evaluates min(x, y) without branching. This could form part of an eventual solution, based on what you have so far.Internist
y ^ ((x ^ y) & -(x < y)); is min(x,y) (usually) without a branch (where x < y might, depending on the machine still be a branch) but modern architecture has conditional move which is probably not much slower.Bi
Perhaps Clamped Increment Integer? is helpful.Napoleonnapoleonic
@ShafikYaghmour: I think that could be. And if they used a char type, then, from C++14 onwards that must be 2's complement.Internist
I just wonder if there are any better ways to do this, i.e. by some hacky bit manipulations? Unless the code is in a very performance critical part, the hacky optimizations only end up harder to read. I suggest a more specific question like ...are there any more efficient ways....Badge
Is this a C or a C++ question? Please choose one.Hecklau
@Internist indeed, I added an answer since the code generated for the uinit8_t case based on that approach looks reasonable.Napoleonnapoleonic
Just confirming: You'd prefer a wrong answer, just to stay within a byte? 254 + 254 = 255, and 1 - 254 = 0 ?Equiprobable
@AlanCampbell it is called Saturating Arithmetic.Napoleonnapoleonic
Do you need it to be portable? Because if you're looking at a specific architecture, then there's probably a nice single instruction. I know ARM has saturating vector addition and subtraction for bytes. On X86 the _mm_adds_epi8 intrinsic will do a saturating add of 16 bytes in a single instruction.Burge
@porglezomp, yes, and _mm_subs_epi8 for subtraction.Kibosh
like @FUZxxl said, can you clarify why you tagged with both C and C++. On it's face it looks more like a C question and the vast majority of the answers are C focused.Napoleonnapoleonic
Obviously it means that I use C++ compiler, so it is ok to use anything from standard library (e.g. std::min), but pure C solutions are also ok.Curhan
N
90

The article Branchfree Saturating Arithmetic provides strategies for this:

Their addition solution is as follows:

u32b sat_addu32b(u32b x, u32b y)
{
    u32b res = x + y;
    res |= -(res < x);

    return res;
}

modified for uint8_t:

uint8_t  sat_addu8b(uint8_t x, uint8_t y)
{
    uint8_t res = x + y;
    res |= -(res < x);

    return res;
}

and their subtraction solution is:

u32b sat_subu32b(u32b x, u32b y)
{
    u32b res = x - y;
    res &= -(res <= x);

    return res;
}

modified for uint8_t:

uint8_t sat_subu8b(uint8_t x, uint8_t y)
{
    uint8_t res = x - y;
    res &= -(res <= x);

    return res;
}
Napoleonnapoleonic answered 2/11, 2015 at 16:17 Comment(8)
Is this solution portable? I think it assumes that -1 is represented in 2's complement form (all bits set to 1).Antitoxic
@Antitoxic that may be the case but as the comment in the article indicates, that is solved by casting to unsigned before applying unary minus. In practicality it is unlikely you will have to deal with anything else but two's complement.Napoleonnapoleonic
I realized that this may be portable because of using unsigned types. -1 should be the largest unsigned value similar to the overflow arithmetic. However I am not sure if the result of (res < x) is unsigned or requires typecast.Antitoxic
This may be a good C answer, but isn't a very good C++ answer.Lucic
@Yakk I meant to think about this from a C++ perspective but I did not get the chance yet.Napoleonnapoleonic
@Yakk What makes this a "bad" C++ answer? These are basic mathematical operations, and I don't see how it would be interpreted as C only, or bad C++.Contribution
@Contribution A better C++ answer might be template<class T>struct sat{T t;}; with overloaded operators that saturate? Proper use of namespaces. Mostly sugar.Lucic
@Yakk, Ah, ok. I just saw this as a minimal example that the OP could adapt as needed. I wouldn't expect to see that complete of an implementation. Thanks for clarifying.Contribution
A
40

A simple method is to detect overflow and reset the value accordingly as below

bsub = b - x;
if (bsub > b)
{
    bsub = 0;
}

badd = b + x;
if (badd < b)
{
    badd = 255;
}

GCC can optimize the overflow check into a conditional assignment when compiling with -O2.

I measured how much optimization comparing with other solutions. With 1000000000+ operations on my PC, this solution and that of @ShafikYaghmour averaged 4.2 seconds, and that of @chux averaged 4.8 seconds. This solution is more readable as well.

Antitoxic answered 2/11, 2015 at 15:50 Comment(4)
@Masculine It's not optimized away, it's optimized into a conditional assignment depending on the carry flag.Hecklau
Yes user694733 is right. It is optimized into a conditional assignment.Antitoxic
This would not work for all cases, for example badd: b = 155 x =201, than badd = 156, and that is bigger than b. You would need to compare the result to the min() or max() of the two variables, depending on the operationTokyo
@CristianF How do you calculate 155+201 = 156? I think it needs to be 155+201 = 356%256 = 100. I do not think min(), max() is needed in any combination of b, x values.Antitoxic
T
17

To saturate subtract/add for unsigned bytes:

For subtraction:

diff = (a - b)*(a >= b);

Addition:

sum = (a + b) | -(a > (255 - b))

Evolution:

// sum = (a + b)*(a <= (255-b)); this fails
// sum = (a + b) | -(a <= (255 - b)) fails too

Thanks to @R_Kapp

Thanks to @NathanOliver

This exercise shows the value of simply coding:

sum = b + min(255 - b, a);
Tertian answered 2/11, 2015 at 15:45 Comment(11)
For sum perhaps (a + b) | -(a <= (255 - b))?Midday
You could do sum = ((a + b) | (!!((a + b) & ~0xFF) * 0xFF)) & 0xFF, assuming sizeof(int) > sizeof(unsigned char), but this looks so complex that I don't know if you would gain anything with it (other than headache).Masculine
@Masculine Yes and maybe even (a+b+1)*(a <= (255-b)) - 1.Tertian
@NathanOliver Thanks for the oversight - the telling aspect of this is that the sub was easy as the limit was 0. But other limits pose complications and follow user2079303 comment.Tertian
This solution generates much more ASM code than the solution by me with the -O2 flag to gcc.Antitoxic
@Antitoxic OP was not clear on "better" (code space vs. speed performance) nor target platform nor compiler. Speed assessment makes most sense in the context of the un-posted larger problem.Tertian
@chux I understand. Just out of curiosity, I checked on an intel processor on 64-bit ubuntu machine. I measured it and posted results in my solution.Antitoxic
Seems to me that multiplying by bools obscures the intent; probably would be better for future users to be more explicit with conditionals.Belligerency
@Kyle Kanos Detail: in C, the results of relational operators like >= is int, not bool. "better", unfortunately is unclear in OP's post concerning execution efficiency, code size or source code clarity. I assume OP was looking for efficiency - something highly machine/compiler dependent - and so offered code that may perform faster. YMMV.Tertian
@chux: I use C++ more than C nowadays & OP used both tags, hence int vs bool comment. My usage of 'better' is for future users' understanding of the code, not for the undefined improvements wanted by OP.Belligerency
@Kyle Kanos, Yes - I too now the post is dual language tagged. This is one of those cases where C and C++ slightly diverge.Tertian
F
14

If you are using a recent enough version of gcc or clang (maybe also some others) you could use built-ins to detect overflow.

if (__builtin_add_overflow(a,b,&c))
{
  c = UINT_MAX;
}
Favus answered 3/11, 2015 at 6:40 Comment(8)
This is the best answer. Using compiler built-ins instead of bit magic is not only faster, it is also clearer and makes the code more maintainable.Brachypterous
Thank you, @erebos. I'll definitely try this on platforms where it is available.Curhan
I can not get gcc to generate brachless code with this one, which is a bit disappointing. The especially unfortunate thing here is that clang uses different names for these.Napoleonnapoleonic
@Brachypterous And it's completely non-crossplatform, heck most likely doesn't even work on another compiler. Not a good solution for the 21st century.Prather
@Prather It's exactly the other way around: built-ins are not a good solution for the 20th century. Welcome to the future!Brachypterous
@ShafikYaghmour I made an answer using builtin/intrinsic functions without branching https://mcmap.net/q/236994/-saturating-subtract-add-for-unsigned-bytesInamorata
@Brachypterous I don't see how using non-standard, compiler-specific things can be a good thing. If the built-ins were standardized I'd agree with you, but I highly doubt they are.Prather
10 years later and this is the only way to get a Clang to actually use Neon saturation intrinsics...Roderick
N
3

For addition:

unsigned temp = a+b;  // temp>>8 will be 1 if overflow else 0
unsigned char c = temp | -(temp >> 8);

For subtraction:

unsigned temp = a-b;  // temp>>8 will be 0xFF if neg-overflow else 0
unsigned char c = temp & ~(temp >> 8);

No comparison operators or multiplies required.

Naamana answered 2/11, 2015 at 20:33 Comment(0)
B
2

You could also use the safe numerics library at Boost Library Incubator. It provides drop-in replacements for int, long, etc... which guarantee that you'll never get an undetected overflow, underflow, etc.

Baku answered 2/11, 2015 at 16:1 Comment(3)
Providing an example of how to use the library would make this a better answer. In addition, do they provide a brachless guarantee?Napoleonnapoleonic
The library has extensive documentation and examples. But the end of the day it's as easy as including the appropriate header and substituting safe<int> for int.Baku
branchless? I guess you man branchless. The library uses template metaprogramming to include run time checks only when necessary. For example unsigned char times unsigned char will result in unsigned int. This can never overflow so no checking at all need be done. On the other hand, unsigned times unsigned can overflow so it has to be checked at runtime.Baku
K
2

All can be done in unsigned byte arithmetic

// Addition without overflow
return (b > 255 - a) ? 255 : a + b

// Subtraction without underflow
return (b > a) ? 0 : a - b;
Kenay answered 2/11, 2015 at 16:44 Comment(1)
This is actually one of the best solutions. All the others making the substraction or addition before are actually creating an undefined behavior in C++, resulting in the compiler being able to do whatever it wants. In practice you can mostly predict what will happen, but still.Interplead
O
2

If you want to do this with two bytes, use the simplest code possible.

If you want to do this with twenty billion bytes, check what vector instructions are available on your processor and whether they can be used. You might find that your processor can do 32 of these operations with a single instruction.

Octavie answered 3/11, 2015 at 1:44 Comment(0)
I
2

If you are willing to use assembly or intrinsics, I think I have an optimal solution.

For subtraction:

We can use the sbb instruction

In MSVC we can use the intrinsic function _subborrow_u64 (also available in other bit sizes).

Here is how it is used:

// *c = a - (b + borrow)
// borrow_flag is set to 1 if (a < (b + borrow))
borrow_flag = _subborrow_u64(borrow_flag, a, b, c);

Here is how we could apply it to your situation

uint64_t sub_no_underflow(uint64_t a, uint64_t b){
    uint64_t result;
    borrow_flag = _subborrow_u64(0, a, b, &result);
    return result * !borrow_flag;
}

For addition:

We can use the adcx instruction

In MSVC we can use the intrinsic function _addcarry_u64 (also available in other bit sizes).

Here is how it is used:

// *c = a + b + carry
// carry_flag is set to 1 if there is a carry bit
carry_flag = _addcarry_u64(carry_flag, a, b, c);

Here is how we could apply it to your situation

uint64_t add_no_overflow(uint64_t a, uint64_t b){
    uint64_t result;
    carry_flag = _addcarry_u64(0, a, b, &result);
    return !carry_flag * result - carry_flag;
}

I don't like this one as much as the subtraction one, but I think it is pretty nifty.

If the add overflows, carry_flag = 1. Not-ing carry_flag yields 0, so !carry_flag * result = 0 when there is overflow. And since 0 - 1 will set the unsigned integral value to its max, the function will return the result of the addition if there is no carry and return the max of the chosen integral value if there is carry.

Inamorata answered 4/11, 2015 at 16:57 Comment(1)
You might want to mention that this answer is for a specific instruction-set architecture (x86?) and will require reimplementing for each target architecture (SPARC, MIPS, ARM, etc)Tasset
L
1

what about this:

bsum = a + b;
bsum = (bsum < a || bsum < b) ? 255 : bsum;

bsub = a - b;
bsub = (bsub > a || bsub > b) ? 0 : bsub;
Liebknecht answered 2/11, 2015 at 15:45 Comment(6)
I fixed the (obvious?) typo, but I still don't think this is correct.Internist
This also includes branching.Hecklau
I will delete this answer just a quick question in assembly without optimization what is difference between ternary operator and if/else statement?Liebknecht
@GRC There is no difference.Hecklau
@GRC FUZxxl is right, but, as always, try yourself. Even if you don't know assembly (you could make a question here on SO if something's not clear to you), just by checking the length / instructions you'll know.Mercy
Guys I did it, there is difference and unlike if/else version ternary operation does not include a single jump statement.Liebknecht
G
1

If you will call those methods a lot, the fastest way would be not bit manipulation but probably a look-up table. Define an array of length 511 for each operation. Example for minus (subtraction)

static unsigned char   maxTable[511];
memset(maxTable, 0, 255);           // If smaller, emulates cutoff at zero
maxTable[255]=0;                    // If equal     - return zero
for (int i=0; i<256; i++)
    maxTable[255+i] = i;            // If greater   - return the difference

The array is static and initialized only once. Now your subtraction can be defined as inline method or using pre-compiler:

#define MINUS(A,B)    maxTable[A-B+255];

How it works? Well you want to pre-calculate all possible subtractions for unsigned chars. The results vary from -255 to +255, total of 511 different result. We define an array of all possible results but because in C we cannot access it from negative indices we use +255 (in [A-B+255]). You can remove this action by defining a pointer to the center of the array.

const unsigned char *result = maxTable+255;
#define MINUS(A,B)    result[A-B];

use it like:

bsub  = MINUS(13,15); // i.e 13-15 with zero cutoff as requested

Note that the execution is extremely fast. Only one subtraction and one pointer deference to get the result. No branching. The static arrays are very short so they will be fully loaded into CPU's cache to further speed up the calculation

Same would work for addition but with a bit different table (first 256 elements will be the indices and last 255 elements will be equal to 255 to emulate the cutoff beyond 255.

If you insist on bits operation, the answers that use (a>b) are wrong. This still might be implemented as branching. Use the sign-bit technique

// (num1>num2) ? 1 : 0
#define        is_int_biggerNotEqual( num1,num2) ((((__int32)((num2)-(num1)))&0x80000000)>>31)

Now you can use it for calculation of subtraction and addition.

If you want to emulate the functions max(), min() without branching use:

inline __int32 MIN_INT(__int32 x, __int32 y){   __int32 d=x-y; return y+(d&(d>>31)); }              

inline __int32 MAX_INT(__int32 x, __int32 y){   __int32 d=x-y; return x-(d&(d>>31)); }

My examples above use 32 bits integers. You can change it to 64, though I believe that 32 bits calculations run a bit faster. Up to you

Greff answered 2/11, 2015 at 17:31 Comment(8)
It likely will not, actually: first, of course, loading up the table is slow. Bit operations take 1 cycle, loading from memory takes approximately 80 ns; even from the L1 cache we're in the range of 20 ns, which is almost 7 cycles on a 3GHz CPU.Mercy
You are not entirely correct. LUT method will take a few sycles but bit manipulation is not a single cycle as well. There are a few sequential actions. For example, only calculating the MAX() requires 2 subtractions, and logical operation and one shift right. And don't forget the integer promotion/demotionGreff
I meant to say that single bitwise operations take 1 cycle, naturally assuming register operands. With the code that Shafik showed, clang outputs 4 elementary instructions. Also (x > y), is branchless.Mercy
First, (x > y) might use branching. You don't know on which architecture you are running. I tend to agree that it is possibly branchless on Intel architecture. Most smartphones are not Intel. That is also the reason that you cannot know how many assembly instructions there will be. Try my solution on your PC. I am interested in hearing the results.Greff
How can't it be branchless? Branching takes in when you have to jump depending on the result of the comparison (as in if-else/?); in that case you just take the result of the operation (i.e. the flags affected).Mercy
L1 cache is much much faster than 20 ns, it's in the order of maybe 4 processor cycles. And will likely be using an otherwise unused execution unit, and be fully pipelined anyway. Measure it. And 20ns is 60 cycles in a 3 GHz CPU.Octavie
On some assembly languages (x>y) is implemented as c statement (x>y) ? 1 : 0; It has branching. Regarding the processing time - My measurements were indecisive. I tested the LUT (look up table) against 'Shafiks' code and on some hardware the LUT wins, on other 'Shafiks' code. The up side of the LUT that its performance is less depended on specific assembly languages and compiler optimizations flags (and is also branch-less guaranteed on every architecture).Greff
@DanielHsH: For processors which can't do a branchless x>y, what do you think of my approach? If the original values are in registers and aren't needed afterward, and if the result needn't be masked, they'd probably be about 4 instructions: add, shift-and-move, negate, or, using just the two original registers. If the result needs to be masked, that would likely add one instruction.Naamana

© 2022 - 2024 — McMap. All rights reserved.