Most accurate way to do a combined multiply-and-divide operation in 64-bit?

Asked 4/1, 2012 at 19:59 Answered 11/6, 2022 at 20:5

What is the most accurate way I can do a multiply-and-divide operation for 64-bit integers that works in both 32-bit and 64-bit programs (in Visual C++)? (In case of overflow, I need the result mod 2⁶⁴.)

(I'm looking for something like MulDiv64, except that this one uses inline assembly, which only works in 32-bit programs.)

Obviously, casting to double and back is possible, but I'm wondering if there's a more accurate way that isn't too complicated. (i.e. I'm not looking for arbitrary-precision arithmetic libraries here!)

Bouncy answered 4/1, 2012 at 19:59 Comment(25)

Have you considered the long long type? – Feverfew 4/1, 2012 at 20:12

@MikeNakis, the intermediate result of the multiply requires 128 bits so a long long doesn't work. – Archaean 4/1, 2012 at 20:18

wh00ps, I did not realize that long long is only 64 bits! I thought they would have made it 128 bits! I just looked it up. Pity. Well then your next option might be wonderful inline assembly! – Feverfew 4/1, 2012 at 20:23

Doesn't MSVC have a uint128_t in its stdint.h? – Blague 4/1, 2012 at 21:6

Then write this function and build it with a better compiler (off the top of my head, GCC and Clang have uint128_t), and link the resulting binary to your program. – Blague 4/1, 2012 at 21:21

Are you dividing by a variable? Or are you dividing by a compile-time constant? – Communicant 4/1, 2012 at 21:24

Then that makes it a LOT harder... :( The only idea I have left is to basically implement a small arbitrary precision library... Another idea to bounce around is to use the floating-point method to get 53-bits of accuracy. Then run one iteration of Newton's Method (which uses only multiplications and additions) on 64/128-bit arithmetic. – Communicant 4/1, 2012 at 21:35

Curious, have you tried the code in straight c++ and looking at the generated assembly with optimizations turned on? – Lebna 4/1, 2012 at 21:38

Related: #8453646 – Shifrah 4/1, 2012 at 22:30

Related: https://mcmap.net/q/21197/-representing-128-bit-numbers-in-c/293511 – Overdress 5/1, 2012 at 0:30

I don't know what the answer is but I know it is in Knuth's Seminumerical Algorithms. – Iselaisenberg 5/1, 2012 at 1:33

Related: eggheadcafe.com/microsoft/VC-Language/30133426/… – Waterless 5/1, 2012 at 6:47

@CAFxX: On Microsoft compilers, the type is called __int128. – Waterless 5/1, 2012 at 6:51

@BenVoigt It isn't implemented though - even on x64: error C4235: nonstandard extension used : '__int128' keyword not supported on this architecture – Communicant 5/1, 2012 at 6:56

@Mystical: What version of VC++? I saw some references saying that newer versions have it. – Waterless 5/1, 2012 at 8:19

Some information about __int128 - not sure if it's still up to date, though: connect.microsoft.com/VisualStudio/feedback/details/529171/… – Craggy 5/1, 2012 at 8:28

@BenVoigt I didn't get your response cause you left out the i in my UN. I'm using VS2010 SP1. – Communicant 5/1, 2012 at 9:2

@MichaelBurr: I tested it and VS2010 SP1 still doesn't support it in either x86 or x64 toolchains. Voted for the connect issue. – Waterless 5/1, 2012 at 15:44

Just wondering, how important is the overflow case of where the result should be mod 2^64? I think my answer meets all your criteria (with perfect integer truncation) except for the overflow case. As of right now, I can't find a clean way to deal with the overflow case. (And actually the 64-bit div and idiv instructions will throw an exception if the quotient is >64-bits.) So there may not be a clean way to do this even with access to inline assembly. – Communicant 6/1, 2012 at 5:25

@Mysticial: It's actually not too important for my program what happens in that case, as long as the result is otherwise predictable/sensible (e.g. saturation would be fine too)... though I would guess that other people who want this would probably want the answer to be mod 64. However, working on a 32-bit program definitely is important, because the numbers are all 64-bit. So while your answer is great for a 64-bit program, I can't use it because I need to use this for a 32-bit system as well. – Bouncy 6/1, 2012 at 5:35

I see, it would be much more complicated in the 32-bit case since those intrinsics are all x64. Most of them are easy to implement manually. The only tricky one is _umul128(), which would have to be done using the long-multiplication suggestions here. I suppose a work-around would be to combine it with the implementation you linked to. – Communicant 6/1, 2012 at 5:41

@Mysticial: Oh wow, good point about combining them... funny, I just tried out MulDiv64(0x1203785013274012, 0x1cef95b58432904f, 0x53) on both implementations, and neither gave back Wolfram Alpha's answer of 0x05DC4FD990C5C443 ... and Microsoft's Power Calculator gave back 0x647AC843C8E79298B21B29FB808EF.347 for the same thing, but without the mod... which should result in 0x298B21B29FB808EF after truncation. I feel like I'm missing something. Ideas? – Bouncy 6/1, 2012 at 6:7

For that one, the quotient overflows 64-bits. My current answer does something unpredictable in that case. I can fix that fairly easily though. EDIT: Did you flip the order of the parameters? The answer to that is 32607500402250344521899981391005935 EDIT2: That answer is the IEEE double-precision representation of the result, not an integer. – Communicant 6/1, 2012 at 6:9

@Mysticial: Oooooh yeah I probably flipped them. I assumed first * second / third... probably wasn't correct. Still kinda weird that Alpha and Power Calculator give different answers, though... EDIT: Yeah the IEEE representation kinda baffled me, not sure why they gave me back a floating-point number in its hexadecimal representation. I'll use PowerCalc then. – Bouncy 6/1, 2012 at 6:13

first * second / third is correct. a * b / c in my answer. – Communicant 6/1, 2012 at 6:17

Since this is tagged Visual C++ I'll give a solution that abuses MSVC-specific intrinsics.

This example is fairly complicated. It's a highly simplified version of the same algorithm that is used by GMP and java.math.BigInteger for large division.

Although I have a simpler algorithm in mind, it's probably about 30x slower.

This solution has the following constraints/behavior:

It requires x64. It will not compile on x86.
The quotient is not zero.
The quotient saturates if it overflows 64-bits.

Note that this is for the unsigned integer case. It's trivial to build a wrapper around this to make it work for signed cases as well. This example should also produce correctly truncated results.

This code is not fully tested. However, it has passed all the tests cases that I've thrown at it.
(Even cases that I've intentionally constructed to try to break the algorithm.)

#include <intrin.h>

uint64_t muldiv2(uint64_t a, uint64_t b, uint64_t c){
    //  Normalize divisor
    unsigned long shift;
    _BitScanReverse64(&shift,c);
    shift = 63 - shift;

    c <<= shift;

    //  Multiply
    a = _umul128(a,b,&b);
    if (((b << shift) >> shift) != b){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }
    b = __shiftleft128(a,b,shift);
    a <<= shift;


    uint32_t div;
    uint32_t q0,q1;
    uint64_t t0,t1;

    //  1st Reduction
    div = (uint32_t)(c >> 32);
    t0 = b / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q1 = (uint32_t)t0;
    while (1){
        t0 = _umul128(c,(uint64_t)q1 << 32,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q1--;
//        cout << "correction 0" << endl;
    }
    b -= t1;
    if (t0 > a) b--;
    a -= t0;

    if (b > 0xffffffff){
        cout << "Overflow" << endl;
        return 0xffffffffffffffff;
    }

    //  2nd reduction
    t0 = ((b << 32) | (a >> 32)) / div;
    if (t0 > 0xffffffff)
        t0 = 0xffffffff;
    q0 = (uint32_t)t0;

    while (1){
        t0 = _umul128(c,q0,&t1);
        if (t1 < b || (t1 == b && t0 <= a))
            break;
        q0--;
//        cout << "correction 1" << endl;
    }

//    //  (a - t0) gives the modulus.
//    a -= t0;

    return ((uint64_t)q1 << 32) | q0;
}

Note that if you don't need a perfectly truncated result, you can remove the last loop completely. If you do this, the answer will be no more than 2 larger than the correct quotient.

Test Cases:

cout << muldiv2(4984198405165151231,6132198419878046132,9156498145135109843) << endl;
cout << muldiv2(11540173641653250113, 10150593219136339683, 13592284235543989460) << endl;
cout << muldiv2(449033535071450778, 3155170653582908051, 4945421831474875872) << endl;
cout << muldiv2(303601908757, 829267376026, 659820219978) << endl;
cout << muldiv2(449033535071450778, 829267376026, 659820219978) << endl;
cout << muldiv2(1234568, 829267376026, 1) << endl;
cout << muldiv2(6991754535226557229, 7798003721120799096, 4923601287520449332) << endl;
cout << muldiv2(9223372036854775808, 2147483648, 18446744073709551615) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 9223372036854775807) << endl;
cout << muldiv2(9223372032559808512, 9223372036854775807, 12) << endl;
cout << muldiv2(18446744073709551615, 18446744073709551615, 9223372036854775808) << endl;

Output:

3337967539561099935
8618095846487663363
286482625873293138
381569328444
564348969767547451
1023786965885666768
11073546515850664288
1073741824
9223372032559808512
Overflow
18446744073709551615
Overflow
18446744073709551615

Communicant answered 5/1, 2012 at 8:51 Comment(2)

Modified to saturate on overflow. Getting the full mod by 2^64 behavior on the quotient is not trivial at all since it requires 128-bit x 64-bit multiplies. – Communicant 6/1, 2012 at 6:36

Seems to work pretty darn well, and indeed it's better than using double: muldiv2(0x1203785013274012, 0x1234, 0x1238) == 0x11ff83d891999ab0; according to the calculator, 0x1203785013274012 * 0x1234 / 0x1238 == 0x11FF83D891999AB0.F112... so it's completely correct. Using double yields: (unsigned long long)(0x1203785013274012 * (double)0x1234 / 0x1238) == 0x11ff83d891999b08 which is inaccurate. Nice; thanks a lot! :) (For 32-bit I used version from the CodeProject site.) – Bouncy 7/1, 2012 at 16:18

You just need 64 bits integers. There are some redundant operations but that allows to use 10 as base and step in the debugger.

uint64_t const base = 1ULL<<32;
uint64_t const maxdiv = (base-1)*base + (base-1);

uint64_t multdiv(uint64_t a, uint64_t b, uint64_t c)
{
    // First get the easy thing
    uint64_t res = (a/c) * b + (a%c) * (b/c);
    a %= c;
    b %= c;
    // Are we done?
    if (a == 0 || b == 0)
        return res;
    // Is it easy to compute what remain to be added?
    if (c < base)
        return res + (a*b/c);
    // Now 0 < a < c, 0 < b < c, c >= 1ULL
    // Normalize
    uint64_t norm = maxdiv/c;
    c *= norm;
    a *= norm;
    // split into 2 digits
    uint64_t ah = a / base, al = a % base;
    uint64_t bh = b / base, bl = b % base;
    uint64_t ch = c / base, cl = c % base;
    // compute the product
    uint64_t p0 = al*bl;
    uint64_t p1 = p0 / base + al*bh;
    p0 %= base;
    uint64_t p2 = p1 / base + ah*bh;
    p1 = (p1 % base) + ah * bl;
    p2 += p1 / base;
    p1 %= base;
    // p2 holds 2 digits, p1 and p0 one

    // first digit is easy, not null only in case of overflow
    uint64_t q2 = p2 / c;
    p2 = p2 % c;

    // second digit, estimate
    uint64_t q1 = p2 / ch;
    // and now adjust
    uint64_t rhat = p2 % ch;
    // the loop can be unrolled, it will be executed at most twice for
    // even bases -- three times for odd one -- due to the normalisation above
    while (q1 >= base || (rhat < base && q1*cl > rhat*base+p1)) {
        q1--;
        rhat += ch;
    }
    // subtract 
    p1 = ((p2 % base) * base + p1) - q1 * cl;
    p2 = (p2 / base * base + p1 / base) - q1 * ch;
    p1 = p1 % base + (p2 % base) * base;

    // now p1 hold 2 digits, p0 one and p2 is to be ignored
    uint64_t q0 = p1 / ch;
    rhat = p1 % ch;
    while (q0 >= base || (rhat < base && q0*cl > rhat*base+p0)) {
        q0--;
        rhat += ch;
    }
    // we don't need to do the subtraction (needed only to get the remainder,
    // in which case we have to divide it by norm)
    return res + q0 + q1 * base; // + q2 *base*base
}

Obadiah answered 6/1, 2012 at 11:52 Comment(1)

Looking for similar approach for signed types (no unsigned types in java unfortunately). Any ideas? – Utmost 28/9, 2018 at 15:15

This is a community wiki answer, since it's really just a bunch of pointers to other papers/references (I'm unable to post relevant code).

The multiplication of two 64-bit ints to a 128 bit result is pretty easy using a straightforward application of pencil & paper technique everyone learns in grade school.

GregS's comment is correct: Knuth covers division in "The Art of Computer Programming, Second Edition, Volume 2/Seminumerical Algorithms" at the end of Section 4.3.1 Multiple Precision Arithmetic/The Classical Algorithms (pages 255 - 265 in my copy). It's not an easy read, at least not for someone like me who's forgotten most mathematics beyond 7th grade Algebra. Just prior, Knuth covers the multiplication side of things, too.

Some other options for ideas (these notes are for division algorithms, but most also discuss multiplication):

Jack Crenshaw covers the Knuth division algorithms in a more readable manner in a series of articles from Embedded System Programming magazine 1997 (unfortunately, my notes don't have the exact issues). Sadly, articles from old ESP issues are not easy to find online. If you have access to a University library, maybe some back issues or a copy of the ESP CD-ROM Library is available to you.
Thomas Rodeheffer of Microsoft research has a paper on Software Integer Division: http://research.microsoft.com/pubs/70645/tr-2008-141.pdf
Karl Hasselström's paper on "Fast Division of Large Integers": http://www.treskal.com/kalle/exjobb/original-report.pdf or http://treskal.squarespace.com/s/masters-thesis.pdf
Randall Hyde's "Art of Assembly Language" (http://webster.cs.ucr.edu/AoA/Windows/HTML/AoATOC.html), specifically Volume Four Section 4.2.5 (Extended Precision Division): http://webster.cs.ucr.edu/AoA/Windows/HTML/AdvancedArithmetica2.html#998729 this is in Hyde's variant of x86 assembly language, but there's also some pseudocode and enough explanation to port the algorithm to C. It's slow, too - performing the division bit-by-bit...

Craggy answered 4/1, 2012 at 19:59 Comment(0)

Since this is tagged Visual C++ you can use the newly available intrinsics:

Mul: _mul128(), _umul128()
Div: _div128(), _udiv128()

uint64_t muldiv_u64(uint64_t a, uint64_t b, uint64_t c)
{
    uint64_t highProduct;
    uint64_t lowProduct = _umul128(a, b, &highProduct);
    uint64_t remainder;
    return _udiv128(highProduct, lowProduct, c, &remainder);
}

If you need signed mul-div then just use the version without u

See also

Sandman answered 28/10, 2020 at 9:58 Comment(1)

However, this does not match the question's requirement "In case of overflow, I need the result mod 2⁶⁴" – Archimedes 9/1, 2021 at 21:19

You don't need arbitrary precision arithmetic for that. You only need 128-bit arithmetic. I.e. you need 64*64=128 multiplication and 128/64=64 division (with proper overflow behavior). This is not that difficult to implement manually.

Thanksgiving answered 5/1, 2012 at 1:19 Comment(7)

Is the margin here too small to write down an implementation? (jk) – Bouncy 5/1, 2012 at 2:52

The problem is that there is 128/64=64 primitive on amd64 architecture, but there is no corresponding compiler intrinsic for it. As such, without assembly, you have to build it using 64/64->64 and the result is not as efficient. – Shifrah 5/1, 2012 at 10:0

you don't need arbitrary precision arithmetic, but it is pretty simple and error proof to use a library for arbitrary precision, which is very likely to perform better than any ad-hoc code you may write too, so much that I would avoid any custom code which is longer than 5-6 lines and fall back to the library. – Sofar 8/8, 2014 at 10:3

@pqnet: Well, there are different levels of ad-hoc... The difference between fixed-precision code and arbitrary-precision code is qualitative, meaning that just a regular implementation of 128-bit arithmetic ("regular" meaning "straightforward", "no ultra-clever fast math tricks") will noticeably outperform any arbitrary-precision library. In my code I use self-implemented and quite simple 128-bit and 256-bit arithmetic support and it does easily outperform any "big-number" library we tried so far. – Thanksgiving 8/8, 2014 at 16:42

@AndreyT there is no "regular" implementation of division, only complex tricks like the ones shown in many of the answer here. Also, the arbitrary precision library will likely have a good performance if there is a native support for 128 bits operations – Sofar 8/8, 2014 at 16:56

@pqnet: It depends on what you see as "regular". In my implementation I simply use classic long division we all know from school (of course, I do it in binary). And yes, in my implementation I do rely on 128-bit math support from the compiler (if available) as well as on compiler built-ins (if available). – Thanksgiving 8/8, 2014 at 17:13

@AndreyT but your implementation is not so simple that you could write it down in an answer here, so it may be worth thinking about using a library instead. – Sofar 8/8, 2014 at 17:17

Do you have the COMP type (x87-based 64-bit integer type) at your disposal in VC++? I've used it occasionally in Delphi in the past when I needed 64-bit integer math. For years, it was way faster than the library-based 64-bit integer math - certainly when division was involved.

In Delphi 2007 (the latest I have installed - 32-bits), I would implement MulDiv64 like this:

function MulDiv64(const a1, a2, a3: int64): int64;
var
  c1: comp absolute a1;
  c2: comp absolute a2;
  c3: comp absolute a3;
  r: comp absolute result;
begin
  r := c1*c2/c3;
end;

(Those weird absolute statements layer the comp variables on top of their 64-bit integer counter parts. I would have used simple type casts except the Delphi compiler gets confused about that - probably because the Delphi language (or whatever they call it now) has no clear syntactic distinction between type casting (reinterpret) and value type conversion.)

Anyway, Delphi 2007 renders the above as follows:

0046129C 55               push ebp
0046129D 8BEC             mov ebp,esp
0046129F 83C4F8           add esp,-$08

004612A2 DF6D18           fild qword ptr [ebp+$18]
004612A5 DF6D10           fild qword ptr [ebp+$10]
004612A8 DEC9             fmulp st(1)
004612AA DF6D08           fild qword ptr [ebp+$08]
004612AD DEF9             fdivp st(1)
004612AF DF7DF8           fistp qword ptr [ebp-$08]
004612B2 9B               wait 

004612B3 8B45F8           mov eax,[ebp-$08]
004612B6 8B55FC           mov edx,[ebp-$04]
004612B9 59               pop ecx
004612BA 59               pop ecx
004612BB 5D               pop ebp
004612BC C21800           ret $0018

The following statement yields 256204778801521550, which appears to be correct.

writeln(MulDiv64($aaaaaaaaaaaaaaa, $555555555555555, $1000000000000000));

If you want to implement this as VC++ inline-assembly, it's possible that you would need to do some tweaking of the default rounding flags to accomplish the same thing, I don't know - I haven't had the need to find out - yet :)

Strict answered 5/1, 2012 at 2:23 Comment(13)

X87 support has generally been poor in a lot of C compilers, probably because a lot of printf-related code assumes double and long double are synonymous and will break if they aren't. It really irks me that the extended-precision FP type has gotten a bad reputation, when it's a faster type to work with than a 64-bit double even on machines with no FPU. – Tuneful 5/3, 2015 at 5:25

@Tuneful Are you sure 80 bit long doubles are faster than 64 bit doubles when using the SSE instructions for 64 bit? Because I'm quite sure they're not. – Haywood 12/4, 2019 at 15:8

@PaulGroke: This answer (above) is 7+ years old by now - things have no doubt changed significantly since then. – Strict 12/4, 2019 at 19:59

@PaulGroke: The bad reputation of the extended-precision type goes back to a time when many machines either had an FPU that could handle it as fast as a 64-bit double or no FPU at all (which meant that they could handle it faster than a 64-bit double). Hardware support for the type was effectively dropped as a consequence of that. If extended-precision types hadn't become unpopular before SSE was invented, SSE could have been designed to support them efficiently, bearing in mind that few applications would require loading or storing large numbers of extended-precision values... – Tuneful 15/4, 2019 at 15:16

...so the performance of vector load/store operations on such values wouldn't be overly important. What would be useful would be to have registers that could contain e.g. either four 80-bit values and eight 40-bit values, along with instructions that would load and store in either packed/rounded form 64/32-bit forms, or padded 128/64 bit forms, with the latter being used mainly for context save/restore/spill code. – Tuneful 15/4, 2019 at 15:21

When loading or storing large numbers of values, having the type be a small power-of-two size is important. When loading or storing a smaller number, doubling the size isn't really a problem. And when keeping them in registers, there's nothing magical about powers of two. – Tuneful 15/4, 2019 at 15:25

@Tuneful Doesn't really matter why, but it's a fact that the 64 bit type is faster on modern CPUs (and that includes the ones that were current in 2012). Simply because they have more computation units for the 64 bit type than for the 80 bit type. Also it's not true that the 80 bit type was just as fast on the older CPU generations, at least not if you also count division and actually want an accurate result. (The division/sqrt precision of the x87 unit can be configured, and it's faster if you limit the precision for what's needed for the 64 bit type.) – Haywood 21/4, 2019 at 22:48

@500-InternalServerError Things actually haven't changed that much since 2012 in that regard. Also it doesn't even matter if it was true then. Even if it were, it's not true now, and hence it should be corrected. – Haywood 21/4, 2019 at 22:52

@PaulGroke: Floating-point division has always involved trade-offs between precision and performance, but I think that on a platform with fast extended-precision multiplication and addition along with an "approximate reciprocal" function, using extended precision multiplies and additions to refine an approximate reciprocal for the divisor and then multiply by the dividend could yield a result that's within double-precision ulp faster than would be practical using double-precision math alone. – Tuneful 22/4, 2019 at 14:57

@Tuneful Yes, for constant divisors that would probably work. But doesn't mean that extended precision is (or better: was, on pre-SSE2 machines) in general as fast as double precision. – Haywood 23/4, 2019 at 18:4

@PaulGroke: For non-constant divisors, given an approximation of the reciprocal, each iteration of Newton's method can double the number of significant figures. With fast extended-precision addition and multiplication hardware and a decent initial guess, computation of a reciprocal that's accurate to better than one part in 2**53 will be much faster than a double-precision division. – Tuneful 23/4, 2019 at 18:40

@Tuneful OK, now I understand what you meant :) Anyway, that doesn't really have to do with double precision vs. extended precision, has it?. Also there is the problem of the decent initial guess. If this was a good, exact general purpose solution, I bet someone at AMD/Intel would know how to do it in hardware/firmware. Or maybe they have already done it. Rough estimation: we may need up to 8 rounds and each round will probably be 3-5 cycles (latency, not throughput!). Which means up to 24~40 cycles. Which AFAIK is about what a double precision division takes on modern CPUs. – Haywood 23/4, 2019 at 19:22

@PaulGroke: From what I understand, a significant number of floating-point architectures include an "approximate reciprocal" function. Outside of weird scenarios involving denormals, a 12-bit approximation can be produced easily with a 4096x12 lookup table. The benefit of extended-precision types is that it's often much easier to design an algorithm that produces a result that's within 256ulp of of the correct value in the type used, than one which is always within 1ulp, and rounding an extended precision value that's within 256ulp of being correct will yield a `double within 9/16ulp. – Tuneful 23/4, 2019 at 20:15

If you only need to support Windows 7 and newer, one good way is this:

#include <mfapi.h>
#include <assert.h>
#pragma comment( lib, "mfplat.lib" )

uint64_t mulDiv64( uint64_t a, uint64_t b, uint64_t c )
{
    assert( a <= LLONG_MAX && b <= LLONG_MAX && c <= LLONG_MAX );
    // https://learn.microsoft.com/en-us/windows/desktop/api/Mfapi/nf-mfapi-mfllmuldiv
    return (uint64_t)MFllMulDiv( (__int64)a, (__int64)b, (__int64)c, (__int64)c / 2 );
}

This method is much simpler than what’s in other answers here, it rounds the result instead of truncating, and it works on all Windows platforms including ARM.

Hendecahedron answered 13/9, 2018 at 21:21 Comment(1)

Interesting but I doubt this matches the question's requirement "In case of overflow, I need the result mod 2⁶⁴" (which can occur for low c). – Archimedes 9/1, 2021 at 21:34

Below presented muldiv64.asm source for Windows-x86-64 Visual Studio. More details about compiling/testing/exceptions available in https://github.com/LF994/MulDiv64

include listing.inc

_TEXT   SEGMENT
  PUBLIC asmMulDiv64

asmMulDiv64 PROC
  mov  rax,rdx    ; rax <- b
  mul  rcx        ; 128 bits of rax*rcx placed into rdx+rax (a was in rcx) 
  div  r8         ; [rdx,rax] / r8: result->rax, reminder->rdx (c was in r8)
; Quotient already in rax, place here "mov rax,rdx" if you want to return remainder instead.
  ret  0          ; 0: do not remove parameters from stack
asmMulDiv64 ENDP

_TEXT   ENDS
END

Lasley answered 11/6, 2022 at 20:5 Comment(0)

Well you can chop up your 64 bit operands into 32 bit chunks (low and high part). Than do the operation on them that you want. All intermediate results will be less than 64 bit and can therefore be stored in the data types you have.

Flageolet answered 5/1, 2012 at 1:12 Comment(1)

Chunking a multiply into parts is easy, but I've never seen a clean way to do divide. – Archaean 5/1, 2012 at 1:20

For 64-bit mode code you can implement 64*64=128 multiplication similarly to the implementation of 128/64=64:64 division here.

For 32-bit code it'll be more complex because there's no CPU instruction that would do multiplication or division of such long operands in 32-bit mode and you'll have to combine several smaller multiplications into a bigger one and reimplement long division.

You may use the code from this answer as a framework for building long division.

Of course, if your dividers are always less than 2³² (or better yet 2¹⁶), you can do much faster division in 32-bit mode by chaining several divisions (divide the most significant 64 (or 32) bits of the dividend by the 32-bit (or 16-bit) divisor, then combine the remainder from this division with the next 64 (32) bits of the dividend and divide that by the divisor and keep doing that until you use up the entire dividend). Also, if the divisor is big but can be factored into sufficiently small numbers, dividing by its factors using this chained division will be better than the classical looping solution.

Hilaria answered 5/1, 2012 at 1:57 Comment(0)

Consider that you want to multiply a by b and then divide by d:

uint64_t LossyMulDiv64(uint64_t a, uint64_t b, uint64_t d)
{
    long double f = long double(b)/d;
    uint64_t highPart = uint64_t((a & ~0xffffffff) * f + 0.5);
    uint64_t lowPart = uint64_t((a & 0xffffffff) * f + 0.5);
    return highPart + lowPart;
}

This code splits the value of a into higher and lower 32-bit parts, then multiplies the 32-bit parts separately by 52-bit precise ratio of b to d, rounds the partial multiplications, and sums them up back to an integer. Some precision still gets lost, but the result is more precise than just plain return a * double(b) / d; .

Interpellation answered 12/11, 2015 at 16:43 Comment(0)

-1

Here's an approximation method you can use: (full precision unless a > 0x7FFFFFFF or b > 0x7FFFFFFF and c is larger than a or b)

constexpr int64_t muldiv(int64_t a, int64_t b, int64_t c, unsigned n = 0) {
  return (a < 0x7FFFFFFF && b < 0x7FFFFFFF) ? (a * b) / c : (n != 2) ? (c <= a) ? ((a / c) * b + muldiv(b, a % c, c, n + 1)) : muldiv(a, b, c / 2) / 2 : 0;
}

The modulo is used to find the loss of precision, which is then plugged back into the function. This is similar to the classic division algorithm.

2 was chosen because (x % x) % x = x % x.

Defalcation answered 5/1, 2012 at 6:42 Comment(4)

Does this work? The first case I tested is muldiv(4984198405165151231,6132198419878046132,9156498145135109843) and it gives 0. The correct answer is 3337967539561099935. – Communicant 5/1, 2012 at 6:54

@Communicant Seems to choke when c is larger than a or b. I'll see if I can fix it. – Defalcation 5/1, 2012 at 6:58

Now it gives 3269073876237821221, which is better but still not right. I assume it's still a work in progress... Though I'm having trouble deciphering how it's supposed to work. If you can get this to work I'll be very impressed that you can do it without any 64 x 64 -> 128 bit multiplies) – Communicant 5/1, 2012 at 7:11

@Communicant Yeah, doesn't seem possible to get full precision without more bits. – Defalcation 5/1, 2012 at 7:32

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags