Which is the fastest way to get the absolute value of a number
Asked Answered
P

16

54

Which is the fastest way to implement an operation that returns the absolute value of a number?

x=root(x²)

or

if !isPositive(x):
    x=x*(-1)

Actually this question can be translated as, how fast is an if (and why please).

My college programing professors always told me to avoid ifs for they are extremely slow, but I always forgot to ask how slow and why. Does anybody here know?

Phosphor answered 20/3, 2009 at 3:11 Comment(6)
This is absolute value, not modulus....Sailboat
At least here in Romania we use the English equivalent for "modulus" / "module" for "absolute value". I presume this phenomenon is spread to other languages as well.Johnsiejohnson
Ah, In american english, Absolute value is the distance from 0 on a number line. That is the absolute value of -4 is 4. the absolute value of 12 is 12Whoop
Although it seems Wikipedia mentions the use of "modulus" when meaning "absolute value": en.wikipedia.org/wiki/Absolute_valueJohnsiejohnson
I think these English-speaking purists can't tell modulus from modulo. Modulus is a valid English term for referring to an absolute value of a real or complex number.Glick
The square/squareroot method is also prone to overflow.Cranial
S
80

Conditionals are slower than plain arithmetic operations, but much, much faster than something as silly as calculating the square root.

Rules of thumb from my assembly days:

  • Integer or bitwise op: 1 cycle
  • Floating-point add/sub/mul: 4 cycles
  • Floating-point div: ~30 cycles
  • Floating-point exponentiation: ~200 cycles
  • Floating-point sqrt: ~60 cycles depending on implementation
  • Conditional branch: avg. 10 cycles, better if well-predicted, much worse if mispredicted
Sailboat answered 20/3, 2009 at 3:18 Comment(3)
For fp add/sub/mul, those are latencies. Throughputs are still at least 1 per clock if you don't bottleneck on latency. Also, integer multiply is 3 cycle latency on modern x86. See Agner Fog's optimization guides to learn more about the difference between throughput and latency for pipelined CPUs (and out-of-order execution).Thay
Also note that any decent compiler will see what this specific if is doing and compile it to just a bitwise operation that clears the sign bit of the float or double (modern FPUs like x86 with SSE), or a dedicated instruction like legacy x87 fabs that does the same thing on the x87 FPU that doesn't support arbitrary bitwise stuff on floats.Thay
Or at least you'd hope so; practice is more complicated godbolt.org/z/4K5W61. That's why you should actually use fabs(x) in C which does compile as efficiently as possible, without worrying the compiler with signed-zero and NaN special-casing. e.g. if (x<0) x = -x; or x = (x<0) ? -x : x; both need to leave negative-zero alone because it compares == 0.0). But anyway, (-1)*x can optimize to just xorps to flip the sign bit.Thay
R
99

There is a great trick to calculate the absolute value of a 2s-complement integer without using an if statement. The theory goes, if the value is negative you want to toggle the bits and add one, otherwise you want to pass the bits through as is. A XOR 1 happens to toggle A and A XOR 0 happens to leave A intact. So you want do something like this:

  uint32_t temp = value >> 31;     // make a mask of the sign bit
  value ^= temp;                   // toggle the bits if value is negative
  value += temp & 1;               // add one if value was negative

In principle, you can do it in as few as three assembly instructions (without a branch). And you'd like to think that the abs() function that you get with math.h does it optimally.

No branches == better performance. Contrary to @paxdiablo's response above, this really matters in deep pipelines where the more branches you have in your code, the more likely you are to have your branch predictor get it wrong and have to roll-back, etc. If you avoid branching where possible, things will keep moving along at full throttle in your core :).

Romanticism answered 15/1, 2010 at 19:59 Comment(10)
by the way, this assumes value is an int32_t (i.e. signed), if it is not, you have to cast it as such before shifting itRomanticism
Instead of value += temp & 1, I suggest the simpler value -= temp, and there is no reason to use an unsigned type for temp.Ault
Also, note that some compilers on some processors can eliminate the branch implied by (x < 0 ? -x : x) using a conditional move instruction; in that case this trick is no faster than the standard Abs.Ault
I'm guessing this solution would fail on Big Endian architectures (e.g. Xbox 360). Am I right?Infirm
none of it makes sense to me where do i start digging !?!?! :(Claudy
Exactly what I came here looking for! So if your situation permits an error of one, you can just mask out the sign bit! Why didn't I think of that? lol.Chant
Nice one. This is better than branching; to name one reason: modern CPU's will vectorize this when used in a loop. See graphics.stanford.edu/~seander/bithacks.html#IntegerAbs for a bit more generic approach - and you might want to notice the patent if you're in the US.Collocation
pff why so much effort? Is there any reason why ((value >> 31) | 1) * value wouldn't suffice? multiplication isn't expensive.Lally
A XOR 1 inverts A if you read 1 as 1111...1. The right shift (>> 31), is assumed to fill the left side with copies of the left most but. This is called an arithmetic shift. Nice answer, this small point confused me.Armidaarmiger
Works assuming two's complement, except for the most negative value -2147483648.Andante
S
80

Conditionals are slower than plain arithmetic operations, but much, much faster than something as silly as calculating the square root.

Rules of thumb from my assembly days:

  • Integer or bitwise op: 1 cycle
  • Floating-point add/sub/mul: 4 cycles
  • Floating-point div: ~30 cycles
  • Floating-point exponentiation: ~200 cycles
  • Floating-point sqrt: ~60 cycles depending on implementation
  • Conditional branch: avg. 10 cycles, better if well-predicted, much worse if mispredicted
Sailboat answered 20/3, 2009 at 3:18 Comment(3)
For fp add/sub/mul, those are latencies. Throughputs are still at least 1 per clock if you don't bottleneck on latency. Also, integer multiply is 3 cycle latency on modern x86. See Agner Fog's optimization guides to learn more about the difference between throughput and latency for pipelined CPUs (and out-of-order execution).Thay
Also note that any decent compiler will see what this specific if is doing and compile it to just a bitwise operation that clears the sign bit of the float or double (modern FPUs like x86 with SSE), or a dedicated instruction like legacy x87 fabs that does the same thing on the x87 FPU that doesn't support arbitrary bitwise stuff on floats.Thay
Or at least you'd hope so; practice is more complicated godbolt.org/z/4K5W61. That's why you should actually use fabs(x) in C which does compile as efficiently as possible, without worrying the compiler with signed-zero and NaN special-casing. e.g. if (x<0) x = -x; or x = (x<0) ? -x : x; both need to leave negative-zero alone because it compares == 0.0). But anyway, (-1)*x can optimize to just xorps to flip the sign bit.Thay
P
26

Calculating the square root is probably one of the worst things you could do because it is really slow. Usually there is a library function for doing this; something like Math.Abs(). Multiplying with -1 is also unnecessary; just return -x. So a good solution would be the following.

(x >= 0) ? x : -x

The compiler will probably optimize this to a single instructions. Conditions may be quite expensive on modern processors because of the long execution pipelines -the calculations must be thrown away if a branch was misspredicted and the processor started executing the instructions from the wrong code path. But because of the mentioned compiler optimization you need not care in this case.

Proterozoic answered 20/3, 2009 at 3:32 Comment(1)
Why doesn't this answer have more upvotes?! This compiles to mov eax, edi; neg eax; cmovl eax, edi; ret and it doesn't require any comments to explain all the bit twiddling.Aldis
P
8

For completeness, here's a way to do it for IEEE floats on x86 systems in C++:

*(reinterpret_cast<uint32_t*>(&foo)) &= 0xffffffff >> 1;
Piper answered 26/7, 2012 at 16:8 Comment(4)
@Stefnotch take the address of a 32-bit floating-point variable foo, cast to a 32-bit unsigned integer pointer, dereference that and apply a bitmask that preserves all bits except for the (MSB) sign bitPiper
This answer is wrong. If you remove the bit sign of -1 you won't get 1 but a very large value instead. Lookup 2's complement to understand why.Schubert
@Schubert I think you are misunderstanding what's going on here. we are manipulating the raw bits of a floating-point number - the resulting bit pattern is not used as a signed integer but as a floating-point numberPiper
@MartinKällman, oups you're right. My mistake. I was manipulating integers at the time and missed the "float" part of the answerSchubert
C
6

Which is the fastest way to get the absolute value of a number

I think the "right" answer isn't here actually. The fastest way to get the absolute number is probably to use the Intel Intrinsic. See https://software.intel.com/sites/landingpage/IntrinsicsGuide/ and look for 'vpabs' (or another intrinsic that does the job for your CPU). I'm pretty sure it'll beat all the other solutions here.

If you don't like intrinsics (or cannot use them or ...), you might want to check if the Compiler is smart enough to figure out if a call to 'native absolute value' (std::abs in C++ or Math.Abs(x) in C#) will change automagically into the intrinsic - basically that involves looking at the disassembled (compiled) code. If you're in a JIT, be sure that JIT optimizations aren't disabled.

If that also doesn't give you the optimized instructions, you can use the method described here: https://graphics.stanford.edu/~seander/bithacks.html#IntegerAbs .

Collocation answered 13/7, 2015 at 20:13 Comment(1)
pabsd is great if you have an array of values, or otherwise can keep your data in vector register only, but neg/cmov is more efficient than copying from integer registers to XMM and back. You should almost always use std::abs and let the compiler auto-vectorize if it wants to, otherwise inline it efficiently.Thay
B
4

The if variant will almost certainly be blindingly fast compared to the square root, since it normally translates to a conditional jump instruction at the machine code level (following the evaluation of the expression, which may be complex, but not in this case since it's a simple check for less than 0).

Taking the square root of a number is likely to be much slower (Newton's method, for example, would use many many if statements at the machine code level).

The likely source of confusion is the fact that if invariably lead to changing the instruction pointer in a non-sequential manner. This can slow down processors that pre-fetch instructions into a pipeline since they have to re-populate the pipeline when the address changes unexpectedly.

However, the cost of that would be minuscule compared to performing a square root operation as opposed to a simple check-and-negate.

Blastocoel answered 20/3, 2009 at 3:19 Comment(0)
N
3

The modulo operation is used to find a remainder, you mean absolute value. I modified the question because it should be if !pos(x) then x = x*-1. (not was missing)

I wouldn't worry about the efficiency of an if statement. Instead focus on the readability of your code. If you identify that there is an efficiency problem, then focus on profiling your code to find real bottlenecks.

If you want to keep an eye out for efficiency while you code, you should only worry about the big-O complexity of your algorithms.

If statements are very efficient, it evaluates whatever expression and then simply changes the program counter based on that condition. The program counter stores the address of the next instruction to be executed.

Mulitplication by -1 and checking if a value is greater than 0 both can be reduced to a single assembly instruction.

Finding the root of a number and squaring that number first is definitely more operations than the if with a negation.

Narva answered 20/3, 2009 at 3:17 Comment(2)
I'm guessing the professor is thinking of If statements stuffing up the pipeline. Which I'm pretty sure doesn't happen in modern processors any more.Lazor
That professor is an idiot - calls to a root() function would also stuff up the pipeline.Blastocoel
C
2

The time taken to do a square root is much greater than the time taken to do an conditional. If you have been taught to avoid conditionals because they are slow, then you have been misinformed. They are a good deal slower than trivial operations like adding or subtracting integers or bit shifting - which is why unrolling loops can be of benefit only if you are doing such trivial operations. But in the grand scheme of things conditionals are good and fast, not bad and slow. To do something as complicated as call a function or calculate a square root to avoid a conditional statement is crazy.

Also, instead of (x = x * -1) why not do (x = 0 - x)? Maybe the compiler will optimize them the same, but isn't the second one simpler anyway?

Conveyancer answered 20/3, 2009 at 3:35 Comment(1)
"Also, instead of (x = x * -1) why not do (x = 0 - x)? Maybe the compiler will optimize them the same, but isn't the second one simpler anyway?" Sure it is I just never thought like that...Phosphor
D
2

Are you using 8086 assembly? ;-)

                ; abs value of AX
   cwd          ; replicate the high bit into DX
   xor  ax, dx  ; take 1's complement if negative; no change if positive
   sub  ax, dx  ; AX is 2's complement if it was negative The standard
                : absolute value method works on any register but is much
                ; slower:

   or   bx, bx  ; see if number is negative
   jge  notneg  ; if it is negative...
   neg  bx      ; ...make it positive
notneg:         ; jump to here if positive

(flagrantly stolen)

Doble answered 20/3, 2009 at 4:23 Comment(1)
Use test same,same, not or same,same for efficiency (Test whether a register is zero with CMP reg,0 vs OR reg,reg?). And unless you're programming for an actual ancient CPU, use cmov instead of a conditional branch.Thay
P
1

You can try to use a single AND operator as a mask.

Here is a pseudocode as an example

i8 num = 10001101 = -13
u8 mask = 01111111 = 127;
i8 res = num & mask = 00001101 = 13

.

I believe this is the fastest way to calculate the absolute value on a computer. Correct me if I'm wrong.

Paleozoic answered 12/7, 2023 at 9:13 Comment(2)
This seems to assume sign-magnitude integers. Somewhat common is two's complement: -13₁₀ = 11110011₂.Dispenser
well, this should work on floats too, since floats and double msb is a signed bit. in fact, any signed number that has a signed bit as msb can be used as a single AND operation to calculate the absolute value.Paleozoic
C
0

If you are simply comparing the absolute values of two numbers (e.g. you don't need the absolute value of either after the comparison) then just square both values to make both positive (remove the sign of each value), the larger square will be greater than the smaller square.

Cankerous answered 21/10, 2014 at 15:25 Comment(0)
L
0

What is faster is very dependent on what compiler and what CPU you're targeting. On most CPUs and all compilers x = (x>=0)? x:-x; is fastest way to get absolute value, but in fact, often standard functions already offer this solution (e.g. fabs()). It is compiled into compare followed by conditional assignment instruction(CMOV), not into conditional jump. Some platforms lack of that instruction though. Although, Intel (but not Microsoft or GCC) compiler would automatically convert if() into conditional assignment, and even would try optimize cycles (if possible).

Branching code in general is slower than conditional assignment, if CPU uses statistical prediction. if() might be slower in average if operation gets repeated multiple times and result of condition is constantly changing. CPUs like Intel, would start to calculate both branches, and would drop the invalid one, In case of large if() bodies or large number of cycles that might be critical.

sqr() and sqrt() on modern Intel CPUs are single built-in instruction and aren't slow, but they are imprecise, and loading registers would take time as well.

Related question: Why is a CPU branch instruction slow?

Most likely, professor wanted student to do research on this matter, it's semi-provocative question\task that would do only good, if student would learn think independently and look for additional sources.

Loyalist answered 18/3, 2015 at 16:29 Comment(2)
gcc does do if-conversion into branchless CMOV. See gcc optimization flag -O3 makes code slower than -O2 for a case where it backfires with sorted data. sqrt is a single instruction on x86 but it is slow-ish, and only available for float/double/long double, not integer. The throughput / latency numbers are similar to (but slower than) FP division: Floating point division vs floating point multiplication.Thay
Integer multiply is nice and fast, though. Not that's hardly relevant, that's not a useful building block for abs. It just takes a mov / neg/cmov to do it in 3 uops with 2 cycle latency.Thay
T
0

I'm doing some retro graphics programming in C for 8088/8086 and calling abs() is time consuming so I've replaced it with:

/* assuming 'i' is int; this WILL NOT WORK on floating point */
if (i < 0) {
    i = ~i + 1;
}

The reason this is faster is because it essentially trades a CALL in assembly for a JNE. Calling a method changes a couple of registers, pushes several more, pushes arguments onto the stack, and can flush the prefetch queue. Plus these actions need to be reversed at the end of the function and all of this is very expensive to the CPU.

Toandfro answered 21/11, 2016 at 23:8 Comment(1)
Any modern compiler can inline abs to code that compiles at least as efficiently as that. (e.g. neg/cmov on modern x86). Doing the 2's complement bithack yourself isn't useful; you might as well just use i = -i, because x86 has a neg instruction that's faster than NOT / INC (in case you have a naive compiler that doesn't recognize the 2's complement identity and optimize it back to neg or sub).Thay
E
0

For completeness, if you are dealing with floating point numbers, you can always do something like n * sign(n), where sign is a function that returns +1 if the number is positive, -1 if negative. In C this would be something like copysign(1.0, n) or (n > 0) - (n < 0).

Most machines use IEEE 754 as their floating point format these days, so you can clear the sign bit directly:

float fabs(float x) {
    char *c = &x;
    c[0] &= 7;
    return *(float *)c;
}

Given that the abs function likely does this exact thing, your best bet is to use it when available. If you are lucky the function will be a couple of instructions, and will be inlined.

Erichericha answered 30/7, 2020 at 23:43 Comment(0)
A
0

I wonder, if something is wrong with this solution. There is

  • no branching
  • no bitwidth dependent shifting
  • no bit twiddling
  • no architecture dependency
  • no compiler dependency
  • optionally: no undefined behaviour for INT_MIN

Maybe too much instructions?

My solution

xabs = (x < 0)*(-x) + (x >=0)*x
  • 2 integer comparisons
  • 2 multiplications

Old solution

xtest = (x < 0)*x;           // xtest = x if is negative, otherwise zero
xabs = (x - xtest) - xtest;  // Order of instructions taken into account

Undefined behaviour of negating INT_MIN

A check against undefined behaviour (negation of INT_MIN) can be added, if your value is not limited in the algorithm somewhere before. But that makes it a little more complicated. Maybe, someone finds a simpler logic.

xabs =   (x < -INT_MAX)*INT_MAX            //  x < -INT_MAX < 0  --> xabs = INT_MAX
       + ((x >= -INT_MAX)&&(x < 0))*(-x)   // -INT_MAX =< x < 0  --> xabs = -x
       + (x >= 0)*x                        // 0 <= x             --> xabs = +x
  • 5 integer comparisons
  • 3 integer multiplications

Unfortunately, I never did a speed comparison. So I don't know if it is really faster than

if ( x < 0 )
{
  if ( x >= -INT_MAX )
  {
    x = -x;
  }
  else
  {
    x = INT_MAX;
  }
}
     
Advert answered 22/4, 2021 at 11:36 Comment(0)
B
-1

For a list of negative numbers:

if you have zero stored in memory, simply use 0 - x, where x is the negative number.

Or if you do not have zero stored in memory:

x-x-x, where x is the negative number.

Or, with brackets for clarity:

(x) - (x) - (x) => (-n) - (-n) - (-n), where x = -n

i.e. subtract the negative number from itself to get zero, then subtract it from zero.

Batchelder answered 8/9, 2019 at 16:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.