Fast multiplication/division by 2 for floats and doubles (C/C++)
Asked Answered
C

9

31

In the software I'm writing, I'm doing millions of multiplication or division by 2 (or powers of 2) of my values. I would really like these values to be int so that I could access the bitshift operators

int a = 1;
int b = a<<24

However, I cannot, and I have to stick with doubles.

My question is : as there is a standard representation of doubles (sign, exponent, mantissa), is there a way to play with the exponent to get fast multiplications/divisions by a power of 2?

I can even assume that the number of bits is going to be fixed (the software will work on machines that will always have 64 bits long doubles)

P.S : And yes, the algorithm mostly does these operations only. This is the bottleneck (it's already multithreaded).

Edit : Or am I completely mistaken and clever compilers already optimize things for me?


Temporary results (with Qt to measure time, overkill, but I don't care):

#include <QtCore/QCoreApplication>
#include <QtCore/QElapsedTimer>
#include <QtCore/QDebug>

#include <iostream>
#include <math.h>

using namespace std;

int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);

while(true)
{
    QElapsedTimer timer;
    timer.start();

    int n=100000000;
    volatile double d=12.4;
    volatile double D;
    for(unsigned int i=0; i<n; ++i)
    {
        //D = d*32;      // 200 ms
        //D = d*(1<<5);  // 200 ms
        D = ldexp (d,5); // 6000 ms
    }

    qDebug() << "The operation took" << timer.elapsed() << "milliseconds";
}

return a.exec();
}

Runs suggest that D = d*(1<<5); and D = d*32; run in the same time (200 ms) whereas D = ldexp (d,5); is much slower (6000 ms). I know that this is a micro benchmark, and that suddenly, my RAM has exploded because Chrome has suddenly asked to compute Pi in my back every single time I run ldexp(), so this benchmark is worth nothing. But I'll keep it nevertheless.

On the other had, I'm having trouble doing reinterpret_cast<uint64_t *> because there's a const violation (seems the volatile keyword interferes)

Celery answered 11/10, 2011 at 2:2 Comment(7)
Don't assume that it's the bottleneck just because it's multithreaded. We had a multithreaded app that we found out was bottlenecked in many places different from what we expected? How accurately have you profiled?Lilas
As always, the application is never profiled enough. I mean, I used CacheGrind, and it seems I spend most of my time in a function that does mostly multiplications. It seems. But I wrote that it was the bottleneck because I'm more interested in the theoretical ideas behind the multiplication by 2 than in "petty considerations" (sure, I could optimize my SQL requests, but honestly, I'm pretty sure it'll be meaningless when compared to the multiplication stuff, and, mostly, I don't care ^^)Celery
Remember, 1<<5 is a constant, so the compiler will optimize it to d*32.Croatia
Yeah, I "know" ^^ But there's been talk about how you may obfuscate the compiler with weird stuff (basically, some operations boil down to *32 but the compiler doesn't "see" it). And it was just a line =)Celery
In general memory access and SQL requests are much much slower then things like multiplications in C++. I'm sorry to say this, but "petty considerations" may be precisely what are causing your bottleneck.Lilas
What can I say else than : I used CacheGrind, and it seems I spend most of my time in a function that does mostly multiplications? Yeah, of course, the function also makes addition, and allocate data on the stack, but I guess I have a good estimate of the situation.Celery
@Fezvez: 1) Your loop needs unrolling. 2) CacheGrind says you're mostly in some math routine? That's an alarm! Single-step the code at the assembly language level and make sure it's doing nothing more than you expect. It shouldn't be calling anything. 3) Multi-threading doesn't make code faster. It just spreads it out on more processors, at best. 4) If performance is what you actually care about, learn this technique.Slouch
S
8

You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:

const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK; 
void unsafe_shl(double* d, int shift) { 
    unsigned long long* i = (unsigned long long*)d; 
    if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) { 
        *i += (unsigned long long)shift << DOUBLE_EXP_SHIFT; 
    } else if (*i) {
        *d *= (1 << shift);
    }
} 

EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:

    double ds[0x1000];
    for (int i = 0; i != 0x1000; i++)
        ds[i] = 1.2;

    clock_t t = clock();

    for (int j = 0; j != 1000000; j++)
        for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
            ds[i] *= 1 << 4;
#else
            ((unsigned int*)&ds[i])[1] += 4 << 20;
#endif

    clock_t e = clock();

    printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);

In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of

movupd xmm0,xmmword ptr [ecx]  
lea    ecx,[ecx+10h]  
mulpd  xmm0,xmm1  
movupd xmmword ptr [ecx-10h],xmm0

Versus 2.4 seconds otherwise, with an inner loop of:

add dword ptr [ecx],400000h
lea ecx, [ecx+8]  

Truly unexpected!

EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:

movsd xmm1,mmword ptr [esp+eax*8+38h]  
mulsd xmm1,xmm0  
movsd mmword ptr [esp+eax*8+38h],xmm1  
inc   eax

VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:

fld         qword ptr [esp+eax*8+38h]  
inc         eax  
fmul        st,st(1)  
fstp        qword ptr [esp+eax*8+30h]

I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.

Shote answered 11/10, 2011 at 2:19 Comment(10)
I'll try to see if it's efficient!Celery
Somehow, I don't think a conditional branch is going to be faster than just doing an FP multiply.Disseminule
I tend to agree, but you can always be surprised (well, I'd like to be surprised!)Celery
@Nicol: My understanding is that floating point is pretty slow on x86 - at least getting values into the FP stack and out again is. Regardless, this is how to do it if you want to avoid the FP stack in the (very) common case.Shote
Note that using the plain old multiply has allowed the compiler to vectorise the loop - it's processing two doubles at a time. That's likely why it runs faster - and let that be a lesson to all who pass this way! ;)Cumuliform
Dear sir, you carried out the test that proves what I wanted to know (microbenchmark, I know, but that's good enough for me). Thanks!Celery
@caf: Still, I'm surprised that plain old add is 50% slower than mulpd, I even tried 64bit add in case the 32bit read, 64bit step was confusing the prefetcher, it's still roughly the same speed!Shote
@SimonBuchan: It's two addls that's 50% slower than one mulpd, though - and remember that you are using an integer execution unit for the loop variable increment too, so some of the benefit comes from using a seperate execution unit.Cumuliform
@caf: I guess I'm just used to FP being 20x slower than integral :)Shote
x87 FP shouldn't be that slow. Maybe there's an MMX instruction in there somewhere, without an emms? Re: the speed of int add vs. vector SSE2: The loop is going to be bottlenecked by store-port throughput. This is why operating on 16B at a time wins. You could do an int add with SSE. paddq or paddd (can run on port1/port5, 1 cycle latency, 0.5c recip throughput)Snowber
M
21

This is one of those highly-application specific things. It may help in some cases and not in others. (In the vast majority of cases, a straight-forward multiplication is still best.)

The "intuitive" way of doing this is just to extract the bits into a 64-bit integer and add the shift value directly into the exponent. (this will work as long as you don't hit NAN or INF)

So something like this:

union{
    uint64 i;
    double f;
};

f = 123.;
i += 0x0010000000000000ull;

//  Check for zero. And if it matters, denormals as well.

Note that this code is not C-compliant in any way, and is shown just to illustrate the idea. Any attempt to implement this should be done directly in assembly or SSE intrinsics.

However, in most cases the overhead of moving the data from the FP unit to the integer unit (and back) will cost much more than just doing a multiplication outright. This is especially the case for pre-SSE era where the value needs to be stored from the x87 FPU into memory and then read back into the integer registers.

In the SSE era, the Integer SSE and FP SSE use the same ISA registers (though they still have separate register files). According the Agner Fog, there's a 1 to 2 cycle penalty for moving data between the Integer SSE and FP SSE execution units. So the cost is much better than the x87 era, but it's still there.

All-in-all, it will depend on what else you have on your pipeline. But in most cases, multiplying will still be faster. I've run into this exact same problem before so I'm speaking from first-hand experience.

Now with 256-bit AVX instructions that only support FP instructions, there's even less of an incentive to play tricks like this.

Maratha answered 11/10, 2011 at 2:11 Comment(13)
Also, check for 0s (and subnormals, more generally). I should mention, if you're batch processing stuff, it's likely to be in main memory from the start, so it's already stupid slow.Shote
@Simon Buchan: I was just about to add that. Good catch for pointing that out.Maratha
Re "this will work as long as ...": it's not guaranteed to work at all. It may work for a given implementation but the standard explicitly states that this (setting one type of a union and using a different type to read it back) is not mandated. No downvote for this one since we're already into optimisation/non-standard-safe behaviour by the looks of it.Anaya
@paxdiablo: Correct. We're already way beyond the standard. I threw this example out just to demonstrate the idea. It's more practically done using SSE registers.Maratha
@paxdiablo: It works when you know the machine's floating point representation. So long as you know this won't be run on VAXs (maybe some older IBM mainframes too), you know that will be IEEE 754.Shote
@Simon, you actually have to know the representations of the double and the integer (there are trap representations and other such things allowed for by the standard). You also have to know that your compiler won't assume you're following the rules about writing an item as one type and reading it back as another. However, as already stated (eloquently by Mystical), we're probably well beyond the point of compliance :-) See stackoverflow.com/questions/7687082/… for the standards sections detailing this.Anaya
Wow, thanks! Though, ermmm, SSE? Streaming SIMD Extensions? ISA? International Society of Arboriculture? FP? These are the three acronyms I feel I need the most to understand your post =)Celery
SSE = Streaming SIMD Extensions, ISA = Instruction Set Architecture, FP = Floating-Point, AVX = Advanced Vector ExtensionsMaratha
@paxdiablo: Sure, aliasing rules in particular are a pain - I'm just against the somewhat common idea that it's somehow immoral to use non-portable code (obviously where it makes sense).Shote
Agreed, @Simon, and it probably makes sense here, hence why I didn't downvote. Personally, though I have neither the hard evidence to back this up nor the energy to look into it at the moment :-), I suspect any optimisation here will be a waste of time since the compiler would probably do a better job. But, provided you measure the suggestions, this one's as good as any other and at least it gives some caveats and options. So I'll +1 it for that.Anaya
Why is this code not C compliant? I thought unions are one of the safe ways to manipulate values which alias the same memory address?Abstemious
I guess you mean that double is not guaranteed to be 64-bit IEEE 754 in C?Abstemious
@Zboson It was actually more about giving into peer pressure to people who had no idea that union type-punning is allowed in C99. But the IEEE 754 part still applies.Maratha
M
11

How about ldexp?

Any half-decent compiler will generate optimal code on your platform.

But as @Clinton points out, simply writing it in the "obvious" way should do just as well. Multiplying and dividing by powers of two is child's play for a modern compiler.

Directly munging the floating point representation, besides being non-portable, will almost certainly be no faster (and might well be slower).

And of course, you should not waste time even thinking about this question unless your profiling tool tells you to. But the kind of people who listen to this advice will never need it, and the ones who need it will never listen.

[update]

OK, so I just tried ldexp with g++ 4.5.2. The cmath header inlines it as a call to __builtin_ldexp, which in turn...

...emits a call to the libm ldexp function. I would have thought this builtin would be trivial to optimize, but I guess the GCC developers never got around to it.

So, multiplying by 1 << p is probably your best bet, as you have discovered.

Monies answered 11/10, 2011 at 2:28 Comment(2)
VC does the same for most floating point operations - I beleive it's so it can respect precision control (_control87(), _controlfp(), etc...). Try fiddling with the floating precision compiler switches...Shote
ldexp is 6 times slower than the regular x*pow(2,exp) source: benchmarked on intel XeonCosignatory
S
8

You can pretty safely assume IEEE 754 formatting, the details of which can get pretty gnarley (esp. when you get into subnormals). In the common cases, however, this should work:

const int DOUBLE_EXP_SHIFT = 52;
const unsigned long long DOUBLE_MANT_MASK = (1ull << DOUBLE_EXP_SHIFT) - 1ull;
const unsigned long long DOUBLE_EXP_MASK = ((1ull << 63) - 1) & ~DOUBLE_MANT_MASK; 
void unsafe_shl(double* d, int shift) { 
    unsigned long long* i = (unsigned long long*)d; 
    if ((*i & DOUBLE_EXP_MASK) && ((*i & DOUBLE_EXP_MASK) != DOUBLE_EXP_MASK)) { 
        *i += (unsigned long long)shift << DOUBLE_EXP_SHIFT; 
    } else if (*i) {
        *d *= (1 << shift);
    }
} 

EDIT: After doing some timing, this method is oddly slower than the double method on my compiler and machine, even stripped to the minimum executed code:

    double ds[0x1000];
    for (int i = 0; i != 0x1000; i++)
        ds[i] = 1.2;

    clock_t t = clock();

    for (int j = 0; j != 1000000; j++)
        for (int i = 0; i != 0x1000; i++)
#if DOUBLE_SHIFT
            ds[i] *= 1 << 4;
#else
            ((unsigned int*)&ds[i])[1] += 4 << 20;
#endif

    clock_t e = clock();

    printf("%g\n", (float)(e - t) / CLOCKS_PER_SEC);

In the DOUBLE_SHIFT completes in 1.6 seconds, with an inner loop of

movupd xmm0,xmmword ptr [ecx]  
lea    ecx,[ecx+10h]  
mulpd  xmm0,xmm1  
movupd xmmword ptr [ecx-10h],xmm0

Versus 2.4 seconds otherwise, with an inner loop of:

add dword ptr [ecx],400000h
lea ecx, [ecx+8]  

Truly unexpected!

EDIT 2: Mystery solved! One of the changes for VC11 is now it always vectorizes floating point loops, effectively forcing /arch:SSE2, though VC10, even with /arch:SSE2 is still worse with 3.0 seconds with an inner loop of:

movsd xmm1,mmword ptr [esp+eax*8+38h]  
mulsd xmm1,xmm0  
movsd mmword ptr [esp+eax*8+38h],xmm1  
inc   eax

VC10 without /arch:SSE2 (even with /arch:SSE) is 5.3 seconds... with 1/100th of the iterations!!, inner loop:

fld         qword ptr [esp+eax*8+38h]  
inc         eax  
fmul        st,st(1)  
fstp        qword ptr [esp+eax*8+30h]

I knew the x87 FP stack was aweful, but 500 times worse is kinda ridiculous. You probably won't see these kinds of speedups converting, i.e. matrix ops to SSE or int hacks, since this is the worst case loading into the FP stack, doing one op, and storing from it, but it's a good example for why x87 is not the way to go for anything perf. related.

Shote answered 11/10, 2011 at 2:19 Comment(10)
I'll try to see if it's efficient!Celery
Somehow, I don't think a conditional branch is going to be faster than just doing an FP multiply.Disseminule
I tend to agree, but you can always be surprised (well, I'd like to be surprised!)Celery
@Nicol: My understanding is that floating point is pretty slow on x86 - at least getting values into the FP stack and out again is. Regardless, this is how to do it if you want to avoid the FP stack in the (very) common case.Shote
Note that using the plain old multiply has allowed the compiler to vectorise the loop - it's processing two doubles at a time. That's likely why it runs faster - and let that be a lesson to all who pass this way! ;)Cumuliform
Dear sir, you carried out the test that proves what I wanted to know (microbenchmark, I know, but that's good enough for me). Thanks!Celery
@caf: Still, I'm surprised that plain old add is 50% slower than mulpd, I even tried 64bit add in case the 32bit read, 64bit step was confusing the prefetcher, it's still roughly the same speed!Shote
@SimonBuchan: It's two addls that's 50% slower than one mulpd, though - and remember that you are using an integer execution unit for the loop variable increment too, so some of the benefit comes from using a seperate execution unit.Cumuliform
@caf: I guess I'm just used to FP being 20x slower than integral :)Shote
x87 FP shouldn't be that slow. Maybe there's an MMX instruction in there somewhere, without an emms? Re: the speed of int add vs. vector SSE2: The loop is going to be bottlenecked by store-port throughput. This is why operating on 16B at a time wins. You could do an int add with SSE. paddq or paddd (can run on port1/port5, 1 cycle latency, 0.5c recip throughput)Snowber
B
6

The fastest way to do this is probably:

x *= (1 << p);

This sort of thing may simply be done by calling an machine instruction to add p to the exponent. Telling the compiler to instead extract the some bits with a mask and doing something manually to it will probably make things slower, not faster.

Remember, C/C++ is not assembly language. Using a bitshift operator does not necessarily compile to a bitshift assembly operation, not does using multiplication necessarily compile to multiplication. There's all sorts of weird and wonderful things going on like what registers are being used and what instructions can be run simultaneously which I'm not smart enough to understand. But your compiler, with many man years of knowledge and experience and lots of computational power, is much better at making these judgements.

p.s. Keep in mind, if your doubles are in an array or some other flat data structure, your compiler might be really smart and use SSE to multiple 2, or even 4 doubles at the same time. However, doing a lot of bit shifting is probably going to confuse your compiler and prevent this optimisation.

Belton answered 11/10, 2011 at 2:19 Comment(3)
I am unaware of any architecture with a "machine instruction to add p to the exponent". But maybe there should be one.Redding
@masterxilo: x87 fscale does exactly that, but only for a double input, not integer. It does x * (1<<trunc(y)), or x_exponent += trunc(y). It's not fast though: much slower than fmul, like 20 to 32 clock cycles on classic P5 Pentium vs. 3 cycles for a fmul, and it's not much better on modern x86. (agner.org/optimize). So fmul with a constant of 0.5 is much better.Snowber
But as this and Mysticial's answer points out, modern SIMD ISAs typically use the same registers for FP and vector-integer. It is possible to use an instruction like paddd xmm0, xmm1 to do an integer add on FP bit patterns between instructions like mulps xmm0, xmm2. Of course this doesn't handle the case where the exponent goes out of range, where FP multiply would give you infinity, or correct underflow to a subnormal. (Or wrapping the biased-exponent from 0 (subnormal) to all-ones (NaN or infinity depending on significand.)Snowber
K
2

Since c++17 you can also use hexadecimal floating literals. That way you can multiply by higher powers of 2. For instance:

d *= 0x1p64;

will multiply d by 2^64. I use it to implement my fast integer arithmetic in a conversion to double.

Kirbie answered 29/4, 2020 at 9:17 Comment(0)
N
1

What other operations does this algorithm require? You might be able to break your floats into int pairs (sign/mantissa and magnitude), do your processing, and reconstitute them at the end.

Nepos answered 11/10, 2011 at 2:11 Comment(1)
Erm, well, I do a bit of stuff here and there (matrix multiplication, etc...) I guess that might be a nice idea, but I think that'll be a load of work (redefining +, -, *, ...)Celery
B
1

Multiplying by 2 can be replaced by an addition: x *= 2 is equivalent to x += x.

Division by 2 can be replaced by multiplication by 0.5. Multiplication is usually significantly faster than division.

Beckman answered 11/10, 2011 at 2:48 Comment(5)
That is totally true, but it becomes untractable as soon as I want to do something like x *= 33554432Celery
@Fezvez, that multiplication is likely to be done by the floating point unit faster than any optimization you can come up with.Beckman
Well, it's just adding 25 to the exponent, so I guess there's some sense behind my interrogations =)Celery
@Fezvez, don't underestimate the speed of the modern multiply instruction. If you doubt me, measure and see.Beckman
I mean, the whole point of this post is to check whether there's a faster way to do some very specific kind of multiplications. I'm not saying there exists a faster way, but I find it legitimate to ask.Celery
A
1

Although there is little/no practical benefit to treating powers of two specially for float of double types there is a case for this for double-double types. Double-double multiplication and division is complicated in general but is trivial for multiplying and dividing by a power of two.

E.g. for

typedef struct {double hi; double lo;} doubledouble;
doubledouble x;
x.hi*=2, x.lo*=2; //multiply x by 2
x.hi/=2, x.lo/=2; //divide x by 2

In fact I have overloaded << and >> for doubledouble so that it's analogous to integers.

//x is a doubledouble type
x << 2 // multiply x by four;
x >> 3 // divide x by eight.
Abstemious answered 26/5, 2015 at 8:50 Comment(0)
D
0

Depending on what you're multiplying, if you have data that is recurring enough, a look up table might provide better performance, at the expense of memory.

Daves answered 21/3, 2016 at 22:18 Comment(1)
I am not sure if a lookup can actually be faster than a multiplication. In understand the use of lookup tables for trigonometric tables for example, but for multiplication?Celery

© 2022 - 2024 — McMap. All rights reserved.