Is multiplication and division using shift operators in C actually faster?
Asked Answered
P

19

331

Multiplication and division can be achieved using bit operators, for example

i*2 = i<<1
i*3 = (i<<1) + i;
i*10 = (i<<3) + (i<<1)

and so on.

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly? Is there any sort of input that can't be multiplied or divided in this way?

Posology answered 15/6, 2011 at 11:31 Comment(12)
Actually, cheap division by a constant other than a power of two is possible, but a tricky subjet to which you are not doing justice with "/Division … /divided" in your question. See for instance hackersdelight.org/divcMore.pdf (or get the book "Hacker's delight" if you can).Mneme
It sound like something that could easily be tested.Arron
As usual - it depends. Once upon a time I tried this in assembler on an Intel 8088 (IBM PC/XT) where a multiplication took a bazillion clocks. Shifts and adds executed a lot faster, so it seemed like a good idea. However, while multiplying the bus unit was free to fill the instruction queue and the next instruction could then start immediately. After a series of shifts and adds the instruction queue would be empty and the CPU would have to wait for the next instruction to be fetched from memory (one byte at a time!). Measure, measure, measure!Glittery
@Bo so it was more trouble implementing the shifted version then it was worth or did you end up using it?Nicholasnichole
Also, beware that right-shifting is only well-defined for unsigned integers. If you have a signed integer, it's not defined whether 0 or the highest bit are padded from the left. (And don't forget the time it takes for someone else (even yourself) to read the code a year later!)Carleecarleen
@Rex - No, I ended up using the multiply, because that was the fastest for the whole routine. The 8088 was limited by the 8-bit bus, so the code size was often more important than the number of clocks for each instruction.Glittery
Actually, a good optimizing compiler will implement multiplication and division with shifts when they are faster.Devindevina
@Peter G. In the [s|b]ad old days before good optimizing compilers and fast processors, I used my own "times 10" routine (shift once & save, then shift twice more and add to saved value). Nowadays it's not worth bothering, but back then it made the difference between users getting a report out immediately, or going for a coffee break while they waited.Alainaalaine
It should be mentioned that the optimization is called strength reduction.Distracted
There is no such thing as "better". On an 8-pin microcontroller, you might optimize for fewer instructions. If the processor is a vector processor you might worry that you can do 16 multiplications in one instruction and need to shoehorn your algorithm into a wide vector engine. Like everyone says, make your code say what to do, not how to do it. If you happen to know that a specific code path takes 90% of your CPU time, then do this low level stuff IF measurements say it helps. Anything else is wasting time that could be spent actually optimizing things.Cordellcorder
@PeterG.: I'm not sure I've ever seen a compiler where dividing a signed number by a power of 2 was not slower than doing a right shift. One could argue that if dividend will never be negative one should cast to unsigned and do the division, but that can cause quirks of its own.Risteau
I'm a little late to this discussion but a recent test out of curiosity showed that int64 division was about 8 times slower than bitshifting but int64 multiplication was the same. Interestingly, int32 division produced about the same results as int32 bitshifting. I ran this test very impromptu in debug mode so these results might not be representative of the subject in application.Christianize
C
553

Short answer: Not likely.

Long answer: Your compiler has an optimizer in it that knows how to multiply as quickly as your target processor architecture is capable. Your best bet is to tell the compiler your intent clearly (i.e. i*2 rather than i << 1) and let it decide what the fastest assembly/machine code sequence is. It's even possible that the processor itself has implemented the multiply instruction as a sequence of shifts & adds in microcode.

Bottom line--don't spend a lot of time worrying about this. If you mean to shift, shift. If you mean to multiply, multiply. Do what is semantically clearest--your coworkers will thank you later. Or, more likely, curse you later if you do otherwise.

Calci answered 15/6, 2011 at 11:38 Comment(8)
Yep, as said the possible gains for almost every application will totally outweigh the obscurity introduced. Don't worry about this kind of optimisation prematurely. Build what is sematically clear, identify bottlenecks and optimise from there...Ornis
Agreed, optimizing for readability and maintainability will probably net you more time to spend actually optimizing things that the profiler says are hot code paths.Cordellcorder
These comments make it sound like you're giving up on potential performance from telling the compiler how to do its job. This is not the case. You actually get better code from gcc -O3 on x86 with return i*10 than from the shift version. As someone who looks at compiler output a lot (see many of my asm / optimization answers), I'm not suprised. There are times when it can help to hand-hold the compiler into one way of doing things, but this isn't one of them. gcc is good at integer math, because it's important.Beyer
Just downloaded an arduino sketch that has millis() >> 2; Would it have been too much to ask to just divide?Invest
@PaulWieland Embedded processors are very different from x86 compatibles. MSP430 has neither division nor multiplication instruction. Some of them have a separate multiplication peripheral which is not lightning fast, 4 cycles for 2 16 bit values and the result is broken into 4 8-bit registers.. When you're working at 16MHz and have no fancy pipeline optimizations or predictive execution, this kind of thing starts to matter. Yes, you can leave it up to the compiler but again, you're two orders of magnitude slower than what's discussed here.Mailand
To be fair the readability argument is subjective. It does make perfect sense when the target is novices but to a person very experienced with binary arithmetic it might be more readable to be that way. An example might be a project that already uses a lot of binary arithmetic so it will fit better in that case - and hence become more readable - to have the few odd divisions or multiplications to be in bit arithmetic too.Ghastly
+1 for the last comment. But I think it used to give some sort of overhead at a time when compilers were not as advanced as today.Espinosa
I tested i / 32 vs i >> 5 and i / 4 vs i >> 2 on gcc for cortex-a9 (which has no hardware division) with optimisation -O3 and the resulting assembly was exactly the same. I didn't like using divisions first but it describes my intention and the output is the same.Tumble
T
105

Just a concrete point of measure: many years back, I benchmarked two versions of my hashing algorithm:

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = 127 * h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

and

unsigned
hash( char const* s )
{
    unsigned h = 0;
    while ( *s != '\0' ) {
        h = (h << 7) - h + (unsigned char)*s;
        ++ s;
    }
    return h;
}

On every machine I benchmarked it on, the first was at least as fast as the second. Somewhat surprisingly, it was sometimes faster (e.g. on a Sun Sparc). When the hardware didn't support fast multiplication (and most didn't back then), the compiler would convert the multiplication into the appropriate combinations of shifts and add/sub. And because it knew the final goal, it could sometimes do so in less instructions than when you explicitly wrote the shifts and the add/subs.

Note that this was something like 15 years ago. Hopefully, compilers have only gotten better since then, so you can pretty much count on the compiler doing the right thing, probably better than you could. (Also, the reason the code looks so C'ish is because it was over 15 years ago. I'd obviously use std::string and iterators today.)

Theca answered 15/6, 2011 at 12:35 Comment(2)
You may be interested in the following blog post, in which the author notes that modern optimizing compilers seem to reverse-engineer common patterns that programmers might use thinking them more efficient into their mathematical forms so as to really generate the most efficient instruction sequence for them. shape-of-code.coding-guidelines.com/2009/06/30/…Mneme
@PascalCuoq Nothing really new about this. I discovered pretty much the same thing for Sun CC close to 20 years ago.Theca
S
75

In addition to all the other good answers here, let me point out another reason to not use shift when you mean divide or multiply. I have never once seen someone introduce a bug by forgetting the relative precedence of multiplication and addition. I have seen bugs introduced when maintenance programmers forgot that "multiplying" via a shift is logically a multiplication but not syntactically of the same precedence as multiplication. x * 2 + z and x << 1 + z are very different!

If you're working on numbers then use arithmetic operators like + - * / %. If you're working on arrays of bits, use bit twiddling operators like & ^ | >> . Don't mix them; an expression that has both bit twiddling and arithmetic is a bug waiting to happen.

Sophey answered 15/6, 2011 at 14:13 Comment(9)
Avoidable with simple parenthesis?Horseshoes
@Joel: Sure. If you remember that you need them. My point is that it is easy to forget that you do. People who get in the mental habit of reading "x<<1" as though it were "x*2" get in the mental habit of thinking that << is the same precedence as multiplication, which it is not.Sophey
Well, I find expression "(hi << 8) + lo" more intent-revealing than "hi*256 + lo". Probably it is a matter of taste, but sometimes it is more clear to write bit-twiddling. In most cases though I totally agree with your point.Capitular
@Ivan: And "(hi << 8) | lo" is even more clear. Setting the low bits of a bit array is not addition of integers. It is setting bits, so write the code that sets bits.Sophey
What would you consider the most idiomatic way of e.g. dividing a signed number (which isn't near Int32.MaxValue) by 256 with rounding? (X+128)>>8 is fast and pretty clear compared with anything I can figure using the "/" operator. Do you know any formulations using "/" which are not both harder to read and slower to execute?Risteau
@supercat: Slower to execute is irrelevant. Unacceptably slow is relevant, and that's a question that can only be answered by knowing what acceptable speed is. If either technique has acceptable speed then pick the one that is easier to read; if neither is acceptable then choosing the harder-to-read one does not solve the problem. The middle ground, where the hard-to-read one is acceptable but the easy-to-read one is not, is rare; if you're in that rare situation then the code is extraordinarily perf sensitive and should be well-commented.Sophey
@EricLippert: Even just focusing on ease of reading, do you know any nice way to compute a rounded division by 256 other than by using shift? Something like x >= 0 ? (x + 128)/256 : (x-127)/256 would seem ugly even if it performed just as well as (x+128)>>8.Risteau
@supercat: I must confess that the operation "divide a signed integer by 256 with rounding" I have not once had to do in my career, so I've spent no time considering how to optimize it for either readability or speed.Sophey
@EricLippert: Fair enough. Such things arise pretty often in the embedded-systems world (things like digital thermometers, etc.) where floating-point is expensive, and also used to be very common in graphics programming (though floating-point hardware has by now become somewhat ubiquitous). If you think 256 is an unusual value, pick any other. For example, using integer math, is there any nicer way than (a+b+c+d+2)>>2 to compute a value that is within 0.5 of the average value? That would seem a pretty normal thing to do.Risteau
G
50

This depends on the processor and the compiler. Some compilers already optimize code this way, others don't. So you need to check each time your code needs to be optimized this way.

Unless you desperately need to optimize, I would not scramble my source code just to save an assembly instruction or processor cycle.

Gathers answered 15/6, 2011 at 11:34 Comment(3)
Just to add a rough estimation: On a typical 16-Bit processor (80C166) adding two ints comes at 1-2 cycles, a multiplication at 10 cycles and a division at 20 cycles. Plus some move-operations if you optimize i*10 into multiple ops (each mov another +1 cycle). The most common compilers (Keil/Tasking) do not optimize unless for multiplications/divisions by a power of 2.Gathers
And in general, the compiler optimizes code better than you do.Polynuclear
I agree that when multiplying "quantities", the multiplication operator is generally better, but when dividing signed values by powers of 2, the >> operator is faster than / and, if the signed values can be negative, it's often semantically superior as well. If one needs the value which x>>4 would produce, that's a lot clearer than x < 0 ? -((-1-x)/16)-1 : x/16;, and I can't imagine how a compiler could optimize that latter expression to something nice.Risteau
T
43

Is it actually faster to use say (i<<3)+(i<<1) to multiply with 10 than using i*10 directly?

It might or might not be on your machine - if you care, measure in your real-world usage.

A case study - from 486 to core i7

Benchmarking is very difficult to do meaningfully, but we can look at a few facts. From http://www.penguin.cz/~literakl/intel/s.html#SAL and http://www.penguin.cz/~literakl/intel/i.html#IMUL we get an idea of x86 clock cycles needed for arithmetic shift and multiplication. Say we stick to "486" (the newest one listed), 32 bit registers and immediates, IMUL takes 13-42 cycles and IDIV 44. Each SAL takes 2, and adding 1, so even with a few of those together shifting superficially looks like a winner.

These days, with the core i7:

(from http://software.intel.com/en-us/forums/showthread.php?t=61481)

The latency is 1 cycle for an integer addition and 3 cycles for an integer multiplication. You can find the latencies and thoughput in Appendix C of the "Intel® 64 and IA-32 Architectures Optimization Reference Manual", which is located on http://www.intel.com/products/processor/manuals/.

(from some Intel blurb)

Using SSE, the Core i7 can issue simultaneous add and multiply instructions, resulting in a peak rate of 8 floating-point operations (FLOP) per clock cycle

That gives you an idea of how far things have come. The optimisation trivia - like bit shifting versus * - that was been taken seriously even into the 90s is just obsolete now. Bit-shifting is still faster, but for non-power-of-two mul/div by the time you do all your shifts and add the results it's slower again. Then, more instructions means more cache faults, more potential issues in pipelining, more use of temporary registers may mean more saving and restoring of register content from the stack... it quickly gets too complicated to quantify all the impacts definitively but they're predominantly negative.

functionality in source code vs implementation

More generally, your question is tagged C and C++. As 3rd generation languages, they're specifically designed to hide the details of the underlying CPU instruction set. To satisfy their language Standards, they must support multiplication and shifting operations (and many others) even if the underlying hardware doesn't. In such cases, they must synthesize the required result using many other instructions. Similarly, they must provide software support for floating point operations if the CPU lacks it and there's no FPU. Modern CPUs all support * and <<, so this might seem absurdly theoretical and historical, but the significance thing is that the freedom to choose implementation goes both ways: even if the CPU has an instruction that implements the operation requested in the source code in the general case, the compiler's free to choose something else that it prefers because it's better for the specific case the compiler's faced with.

Examples (with a hypothetical assembly language)

source           literal approach         optimised approach
#define N 0
int x;           .word x                xor registerA, registerA
x *= N;          move x -> registerA
                 move x -> registerB
                 A = B * immediate(0)
                 store registerA -> x
  ...............do something more with x...............

Instructions like exclusive or (xor) have no relationship to the source code, but xor-ing anything with itself clears all the bits, so it can be used to set something to 0. Source code that implies memory addresses may not entail any being used.

These kind of hacks have been used for as long as computers have been around. In the early days of 3GLs, to secure developer uptake the compiler output had to satisfy the existing hardcore hand-optimising assembly-language dev. community that the produced code wasn't slower, more verbose or otherwise worse. Compilers quickly adopted lots of great optimisations - they became a better centralised store of it than any individual assembly language programmer could possibly be, though there's always the chance that they miss a specific optimisation that happens to be crucial in a specific case - humans can sometimes nut it out and grope for something better while compilers just do as they've been told until someone feeds that experience back into them.

So, even if shifting and adding is still faster on some particular hardware, then the compiler writer's likely to have worked out exactly when it's both safe and beneficial.

Maintainability

If your hardware changes you can recompile and it'll look at the target CPU and make another best choice, whereas you're unlikely to ever want to revisit your "optimisations" or list which compilation environments should use multiplication and which should shift. Think of all the non-power-of-two bit-shifted "optimisations" written 10+ years ago that are now slowing down the code they're in as it runs on modern processors...!

Thankfully, good compilers like GCC can typically replace a series of bitshifts and arithmetic with a direct multiplication when any optimisation is enabled (i.e. ...main(...) { return (argc << 4) + (argc << 2) + argc; } -> imull $21, 8(%ebp), %eax) so a recompilation may help even without fixing the code, but that's not guaranteed.

Strange bitshifting code implementing multiplication or division is far less expressive of what you were conceptually trying to achieve, so other developers will be confused by that, and a confused programmer's more likely to introduce bugs or remove something essential in an effort to restore seeming sanity. If you only do non-obvious things when they're really tangibly beneficial, and then document them well (but don't document other stuff that's intuitive anyway), everyone will be happier.

General solutions versus partial solutions

If you have some extra knowledge, such as that your int will really only be storing values x, y and z, then you may be able to work out some instructions that work for those values and get you your result more quickly than when the compiler's doesn't have that insight and needs an implementation that works for all int values. For example, consider your question:

Multiplication and division can be achieved using bit operators...

You illustrate multiplication, but how about division?

int x;
x >> 1;   // divide by 2?

According to the C++ Standard 5.8:

-3- The value of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity 2 raised to the power E2. If E1 has a signed type and a negative value, the resulting value is implementation-defined.

So, your bit shift has an implementation defined result when x is negative: it may not work the same way on different machines. But, / works far more predictably. (It may not be perfectly consistent either, as different machines may have different representations of negative numbers, and hence different ranges even when there are the same number of bits making up the representation.)

You may say "I don't care... that int is storing the age of the employee, it can never be negative". If you have that kind of special insight, then yes - your >> safe optimisation might be passed over by the compiler unless you explicitly do it in your code. But, it's risky and rarely useful as much of the time you won't have this kind of insight, and other programmers working on the same code won't know that you've bet the house on some unusual expectations of the data you'll be handling... what seems a totally safe change to them might backfire because of your "optimisation".

Is there any sort of input that can't be multiplied or divided in this way?

Yes... as mentioned above, negative numbers have implementation defined behaviour when "divided" by bit-shifting.

Therapist answered 17/6, 2011 at 10:28 Comment(2)
Very nice answer. Core i7 vs. 486 comparison is enlightening!Calci
On all commonplace architectures, intVal>>1 will have the same semantics which differ from those of intVal/2 in a way that is sometimes useful. If one needs to compute in portable fashion the value that commonplace architectures would yield for intVal >> 1, the expression would need to be rather more complicated and harder to read, and would be likely to generate substantially inferior code to that produced for intVal >> 1.Risteau
P
36

Just tried on my machine compiling this :

int a = ...;
int b = a * 10;

When disassembling it produces output :

MOV EAX,DWORD PTR SS:[ESP+1C] ; Move a into EAX
LEA EAX,DWORD PTR DS:[EAX+EAX*4] ; Multiply by 5 without shift !
SHL EAX, 1 ; Multiply by 2 using shift

This version is faster than your hand-optimized code with pure shifting and addition.

You really never know what the compiler is going to come up with, so it's better to simply write a normal multiplication and let him optimize the way he wants to, except in very precise cases where you know the compiler cannot optimize.

Polynuclear answered 15/6, 2011 at 14:53 Comment(4)
You would have gotten a big upvote for this if you had skipped the part about the vector. If the compiler can fix the multiply it can also see that the vector doesn't change.Glittery
How can a compiler know a vector size won't change without making some really dangerous assumptions? Or have you never heard of concurrency...Exist
Ok, so you loop over a global vector with no locks? And I loop over a local vector who's address hasn't been taken, and call only const member functions. At least my compiler realizes that the vector size will not change. (and soon someone will probably flag us for chatting :-).Glittery
@BoPersson Finally, after all this time, I removed my statement about the compiler not being able to optimize away vector<T>::size(). My compiler was quite ancient! :)Polynuclear
H
22

Shifting is generally a lot faster than multiplying at an instruction level but you may well be wasting your time doing premature optimisations. The compiler may well perform these optimisations at compiletime. Doing it yourself will affect readability and possibly have no effect on performance. It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck.

Actually the division trick, known as 'magic division' can actually yield huge payoffs. Again you should profile first to see if it's needed. But if you do use it there are useful programs around to help you figure out what instructions are needed for the same division semantics. Here is an example : http://www.masm32.com/board/index.php?topic=12421.0

An example which I have lifted from the OP's thread on MASM32:

include ConstDiv.inc
...
mov eax,9999999
; divide eax by 100000
cdiv 100000
; edx = quotient

Would generate:

mov eax,9999999
mov edx,0A7C5AC47h
add eax,1
.if !CARRY?
    mul edx
.endif
shr edx,16
Hopfinger answered 15/6, 2011 at 11:35 Comment(2)
There are no random forum threads about liking math. Anyone who likes math knows how hard it is to generate a true "random" forum thread.Horseshoes
It's probably only worth it to do things like this if you have profiled and found this to be a bottleneck and implemented the alternatives and profile again and get at least 10 times performance advantage.Doubloon
C
15

Shift and integer multiply instructions have similar performance on most modern CPUs - integer multiply instructions were relatively slow back in the 1980s but in general this is no longer true. Integer multiply instructions may have higher latency, so there may still be cases where a shift is preferable. Ditto for cases where you can keep more execution units busy (although this can cut both ways).

Integer division is still relatively slow though, so using a shift instead of division by a power of 2 is still a win, and most compilers will implement this as an optimisation. Note however that for this optimisation to be valid the dividend needs to be either unsigned or must be known to be positive. For a negative dividend the shift and divide are not equivalent!

#include <stdio.h>

int main(void)
{
    int i;

    for (i = 5; i >= -5; --i)
    {
        printf("%d / 2 = %d, %d >> 1 = %d\n", i, i / 2, i, i >> 1);
    }
    return 0;
}

Output:

5 / 2 = 2, 5 >> 1 = 2
4 / 2 = 2, 4 >> 1 = 2
3 / 2 = 1, 3 >> 1 = 1
2 / 2 = 1, 2 >> 1 = 1
1 / 2 = 0, 1 >> 1 = 0
0 / 2 = 0, 0 >> 1 = 0
-1 / 2 = 0, -1 >> 1 = -1
-2 / 2 = -1, -2 >> 1 = -1
-3 / 2 = -1, -3 >> 1 = -2
-4 / 2 = -2, -4 >> 1 = -2
-5 / 2 = -2, -5 >> 1 = -3

So if you want to help the compiler then make sure the variable or expression in the dividend is explicitly unsigned.

Canna answered 15/6, 2011 at 11:42 Comment(3)
Integer multiplies are microcoded for example on PlayStation 3's PPU, and stall the whole pipeline. It's recommended to avoid integer multiplies on some platforms still :)Intellectual
Many unsigned divisions are - assuming the compiler knows how - implemented using unsigned multiplies. One or two multiplies @ a few clock cycles each can do the same work as a division @ 40 cycles each and up.Discography
@Olof: true, but only valid for division by a compile-time constant of courseCanna
S
4

It completely depends on target device, language, purpose, etc.

Pixel crunching in a video card driver? Very likely, yes!

.NET business application for your department? Absolutely no reason to even look into it.

For a high performance game for a mobile device it might be worth looking into, but only after easier optimizations have been performed.

Sir answered 15/6, 2011 at 16:13 Comment(0)
G
2

Don't do unless you absolutely need to and your code intent requires shifting rather than multiplication/division.

In typical day - you could potentialy save few machine cycles (or loose, since compiler knows better what to optimize), but the cost doesn't worth it - you spend time on minor details rather than actual job, maintaining the code becomes harder and your co-workers will curse you.

You might need to do it for high-load computations, where each saved cycle means minutes of runtime. But, you should optimize one place at a time and do performance tests each time to see if you really made it faster or broke compilers logic.

Glyoxaline answered 15/6, 2011 at 13:48 Comment(0)
W
1

As far as I know in some machines multiplication can need upto 16 to 32 machine cycle. So Yes, depending on the machine type, bitshift operators are faster than multiplication / division.

However certain machine do have their math processor, which contains special instructions for multiplication/division.

Wist answered 15/6, 2011 at 11:35 Comment(1)
The people writing compilers for those machines have also likely read the Hackers Delight and optimize accordingly.Glittery
L
1

In the case of signed integers and right shift vs division, it can make a difference. For negative numbers, the shift rounds rounds towards negative infinity whereas division rounds towards zero. Of course the compiler will change the division to something cheaper, but it will usually change it to something that has the same rounding behavior as division, because it is either unable to prove that the variable won't be negative or it simply doesn't care. So if you can prove that a number won't be negative or if you don't care which way it will round, you can do that optimization in a way that is more likely to make a difference.

Laevo answered 15/6, 2011 at 16:29 Comment(4)
or cast the number to unsignedDoubloon
Are you sure that the shifting behaviour is standardized? I was under the impression that right-shift on negative ints is implementation-defined.Carleecarleen
While you should perhaps mention that code which relies upon any particular behavior for right-shifting negative numbers should document that requirement, the advantage to right-shifting is huge in cases where it naturally yields the right value and the division operator would generate code to waste time computing an unwanted value which user code would then have to waste additional time adjusting to yield what the shift would have given in the first place. Actually, if I had my druthers, compilers would have an option to squawk at attempts to perform signed division, since...Risteau
...code which knows the operands are positive could improve optimization if it cast to unsigned before division (possibly casting back to signed afterward), and code which knows that operands might be negative should generally deal with that case explicitly anyhow (in which case one may as well assume them to be positive).Risteau
G
1

I agree with the marked answer by Drew Hall. The answer could use some additional notes though.

For the vast majority of software developers the processor and compiler are no longer relevant to the question. Most of us are far beyond the 8088 and MS-DOS. It is perhaps only relevant for those who are still developing for embedded processors...

At my software company Math (add/sub/mul/div) should be used for all mathematics. While Shift should be used when converting between data types eg. ushort to byte as n>>8 and not n/256.

Germicide answered 3/12, 2012 at 19:24 Comment(1)
I agree with you, too. I follow the same guideline subconsciously, though I haven't ever had a formal requirement to do so.Calci
L
0

Python test performing same multiplication 100 million times against the same random numbers.

>>> from timeit import timeit
>>> setup_str = 'import scipy; from scipy import random; scipy.random.seed(0)'
>>> N = 10*1000*1000
>>> timeit('x=random.randint(65536);', setup=setup_str, number=N)
1.894096851348877 # Time from generating the random #s and no opperati

>>> timeit('x=random.randint(65536); x*2', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); x << 1', setup=setup_str, number=N)
2.2616429328918457

>>> timeit('x=random.randint(65536); x*10', setup=setup_str, number=N)
2.2799630165100098
>>> timeit('x=random.randint(65536); (x << 3) + (x<<1)', setup=setup_str, number=N)
2.9485139846801758

>>> timeit('x=random.randint(65536); x // 2', setup=setup_str, number=N)
2.490908145904541
>>> timeit('x=random.randint(65536); x / 2', setup=setup_str, number=N)
2.4757170677185059
>>> timeit('x=random.randint(65536); x >> 1', setup=setup_str, number=N)
2.2316000461578369

So in doing a shift rather than multiplication/division by a power of two in python, there's a slight improvement (~10% for division; ~1% for multiplication). If its a non-power of two, there's likely a considerable slowdown.

Again these #s will change depending on your processor, your compiler (or interpreter -- did in python for simplicity).

As with everyone else, don't prematurely optimize. Write very readable code, profile if its not fast enough, and then try to optimize the slow parts. Remember, your compiler is much better at optimization than you are.

Leggat answered 15/6, 2011 at 18:23 Comment(0)
G
0

There are optimizations the compiler can't do because they only work for a reduced set of inputs.

Below there is c++ sample code that can do a faster division doing a 64bits "Multiplication by the reciprocal". Both numerator and denominator must be below certain threshold. Note that it must be compiled to use 64 bits instructions to be actually faster than normal division.

#include <stdio.h>
#include <chrono>

static const unsigned s_bc = 32;
static const unsigned long long s_p = 1ULL << s_bc;
static const unsigned long long s_hp = s_p / 2;

static unsigned long long s_f;
static unsigned long long s_fr;

static void fastDivInitialize(const unsigned d)
{
    s_f = s_p / d;
    s_fr = s_f * (s_p - (s_f * d));
}

static unsigned fastDiv(const unsigned n)
{
    return (s_f * n + ((s_fr * n + s_hp) >> s_bc)) >> s_bc;
}

static bool fastDivCheck(const unsigned n, const unsigned d)
{
    // 32 to 64 cycles latency on modern cpus
    const unsigned expected = n / d;

    // At least 10 cycles latency on modern cpus
    const unsigned result = fastDiv(n);

    if (result != expected)
    {
        printf("Failed for: %u/%u != %u\n", n, d, expected);
        return false;
    }

    return true;
}

int main()
{
    unsigned result = 0;

    // Make sure to verify it works for your expected set of inputs
    const unsigned MAX_N = 65535;
    const unsigned MAX_D = 40000;

    const double ONE_SECOND_COUNT = 1000000000.0;

    auto t0 = std::chrono::steady_clock::now();
    unsigned count = 0;
    printf("Verifying...\n");
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            count += !fastDivCheck(n, d);
        }
    }
    auto t1 = std::chrono::steady_clock::now();
    printf("Errors: %u / %u (%.4fs)\n", count, MAX_D * (MAX_N + 1), (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        fastDivInitialize(d);
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += fastDiv(n);
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Fast division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    t0 = t1;
    count = 0;
    for (unsigned d = 1; d <= MAX_D; ++d)
    {
        for (unsigned n = 0; n <= MAX_N; ++n)
        {
            result += n / d;
        }
    }
    t1 = std::chrono::steady_clock::now();
    printf("Normal division time: %.4fs\n", (t1 - t0).count() / ONE_SECOND_COUNT);

    getchar();
    return result;
}
Gunslinger answered 3/6, 2017 at 2:15 Comment(0)
C
0

I think in the one case that you want to multiply or divide by a power of two, you can't go wrong with using bitshift operators, even if the compiler converts them to a MUL/DIV, because some processors microcode (really, a macro) them anyway, so for those cases you will achieve an improvement, especially if the shift is more than 1. Or more explicitly, if the CPU has no bitshift operators, it will be a MUL/DIV anyway, but if the CPU has bitshift operators, you avoid a microcode branch and this is a few instructions less.

I am writing some code right now that requires a lot of doubling/halving operations because it is working on a dense binary tree, and there is one more operation that I suspect might be more optimal than an addition - a left (power of two multiply) shift with an addition. This can be replaced with a left shift and an xor if the shift is wider than the number of bits you want to add, easy example is (i<<1)^1, which adds one to a doubled value. This does not of course apply to a right shift (power of two divide) because only a left (little endian) shift fills the gap with zeros.

In my code, these multiply/divide by two and powers of two operations are very intensively used and because the formulae are quite short already, each instruction that can be eliminated can be a substantial gain. If the processor does not support these bitshift operators, no gain will happen but neither will there be a loss.

Also, in the algorithms I am writing, they visually represent the movements that occur so in that sense they are in fact more clear. The left hand side of a binary tree is bigger, and the right is smaller. As well as that, in my code, odd and even numbers have a special significance, and all left-hand children in the tree are odd and all right hand children, and the root, are even. In some cases, which I haven't encountered yet, but may, oh, actually, I didn't even think of this, x&1 may be a more optimal operation compared to x%2. x&1 on an even number will produce zero, but will produce 1 for an odd number.

Going a bit further than just odd/even identification, if I get zero for x&3 I know that 4 is a factor of our number, and same for x%7 for 8, and so on. I know that these cases have probably got limited utility but it's nice to know that you can avoid a modulus operation and use a bitwise logic operation instead, because bitwise operations are almost always the fastest, and least likely to be ambiguous to the compiler.

I am pretty much inventing the field of dense binary trees so I expect that people may not grasp the value of this comment, as very rarely do people want to only perform factorisations on only powers of two, or only multiply/divide powers of two.

Cupbearer answered 6/4, 2018 at 11:8 Comment(0)
A
0

Whether it is actually faster depends on the hardware and compiler actually used.

Ammadis answered 28/7, 2019 at 10:22 Comment(0)
I
0

If you compare output for x+x , x*2 and x<<1 syntax on a gcc compiler, then you would get the same result in x86 assembly : https://godbolt.org/z/JLpp0j

        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        mov     eax, DWORD PTR [rbp-4]
        add     eax, eax
        pop     rbp
        ret

So you can consider gcc as smart enought to determine his own best solution independently from what you typed.

Inebriety answered 5/8, 2019 at 10:11 Comment(0)
P
0

I too wanted to see if I could Beat the House. this is a more general bitwise for any-number by any number multiplication. the macros I made are about 25% more to twice as slower than normal * multiplication. as said by others if it's close to a multiple of 2 or made up of few multiples of 2 you might win. like X*23 made up of (X<<4)+(X<<2)+(X<<1)+X is going to be slower then X*65 made up of (X<<6)+X.

#include <stdio.h>
#include <time.h>

#define MULTIPLYINTBYMINUS(X,Y) (-((X >> 30) & 1)&(Y<<30))+(-((X >> 29) & 1)&(Y<<29))+(-((X >> 28) & 1)&(Y<<28))+(-((X >> 27) & 1)&(Y<<27))+(-((X >> 26) & 1)&(Y<<26))+(-((X >> 25) & 1)&(Y<<25))+(-((X >> 24) & 1)&(Y<<24))+(-((X >> 23) & 1)&(Y<<23))+(-((X >> 22) & 1)&(Y<<22))+(-((X >> 21) & 1)&(Y<<21))+(-((X >> 20) & 1)&(Y<<20))+(-((X >> 19) & 1)&(Y<<19))+(-((X >> 18) & 1)&(Y<<18))+(-((X >> 17) & 1)&(Y<<17))+(-((X >> 16) & 1)&(Y<<16))+(-((X >> 15) & 1)&(Y<<15))+(-((X >> 14) & 1)&(Y<<14))+(-((X >> 13) & 1)&(Y<<13))+(-((X >> 12) & 1)&(Y<<12))+(-((X >> 11) & 1)&(Y<<11))+(-((X >> 10) & 1)&(Y<<10))+(-((X >> 9) & 1)&(Y<<9))+(-((X >> 8) & 1)&(Y<<8))+(-((X >> 7) & 1)&(Y<<7))+(-((X >> 6) & 1)&(Y<<6))+(-((X >> 5) & 1)&(Y<<5))+(-((X >> 4) & 1)&(Y<<4))+(-((X >> 3) & 1)&(Y<<3))+(-((X >> 2) & 1)&(Y<<2))+(-((X >> 1) & 1)&(Y<<1))+(-((X >> 0) & 1)&(Y<<0))
#define MULTIPLYINTBYSHIFT(X,Y) (((((X >> 30) & 1)<<31)>>31)&(Y<<30))+(((((X >> 29) & 1)<<31)>>31)&(Y<<29))+(((((X >> 28) & 1)<<31)>>31)&(Y<<28))+(((((X >> 27) & 1)<<31)>>31)&(Y<<27))+(((((X >> 26) & 1)<<31)>>31)&(Y<<26))+(((((X >> 25) & 1)<<31)>>31)&(Y<<25))+(((((X >> 24) & 1)<<31)>>31)&(Y<<24))+(((((X >> 23) & 1)<<31)>>31)&(Y<<23))+(((((X >> 22) & 1)<<31)>>31)&(Y<<22))+(((((X >> 21) & 1)<<31)>>31)&(Y<<21))+(((((X >> 20) & 1)<<31)>>31)&(Y<<20))+(((((X >> 19) & 1)<<31)>>31)&(Y<<19))+(((((X >> 18) & 1)<<31)>>31)&(Y<<18))+(((((X >> 17) & 1)<<31)>>31)&(Y<<17))+(((((X >> 16) & 1)<<31)>>31)&(Y<<16))+(((((X >> 15) & 1)<<31)>>31)&(Y<<15))+(((((X >> 14) & 1)<<31)>>31)&(Y<<14))+(((((X >> 13) & 1)<<31)>>31)&(Y<<13))+(((((X >> 12) & 1)<<31)>>31)&(Y<<12))+(((((X >> 11) & 1)<<31)>>31)&(Y<<11))+(((((X >> 10) & 1)<<31)>>31)&(Y<<10))+(((((X >> 9) & 1)<<31)>>31)&(Y<<9))+(((((X >> 8) & 1)<<31)>>31)&(Y<<8))+(((((X >> 7) & 1)<<31)>>31)&(Y<<7))+(((((X >> 6) & 1)<<31)>>31)&(Y<<6))+(((((X >> 5) & 1)<<31)>>31)&(Y<<5))+(((((X >> 4) & 1)<<31)>>31)&(Y<<4))+(((((X >> 3) & 1)<<31)>>31)&(Y<<3))+(((((X >> 2) & 1)<<31)>>31)&(Y<<2))+(((((X >> 1) & 1)<<31)>>31)&(Y<<1))+(((((X >> 0) & 1)<<31)>>31)&(Y<<0))
int main()
{
    int randomnumber=23;
    int randomnumber2=23;
    int checknum=23;
    clock_t start, diff;
    srand(time(0));
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYMINUS(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    int msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYMINUS Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum=MULTIPLYINTBYSHIFT(randomnumber,randomnumber2);
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("MULTIPLYINTBYSHIFT Time %d milliseconds", msec);
    start = clock();
    for(int i=0;i<1000000;i++)
    {
        randomnumber = rand() % 10000;
        randomnumber2 = rand() % 10000;
        checknum= randomnumber*randomnumber2;
        if (checknum!=randomnumber*randomnumber2)
        {
            printf("s %i and %i and %i",checknum,randomnumber,randomnumber2);
        }
    }
    diff = clock() - start;
    msec = diff * 1000 / CLOCKS_PER_SEC;
    printf("normal * Time %d milliseconds", msec);
    return 0;
}
Plagio answered 12/1, 2020 at 7:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.