Reasonably portable way to get top 64-bits from 64x64 bit multiply? [duplicate]
Asked Answered
S

3

10

Is there a reasonably portable way in C/C++ to multiply two 64-bit integers for a 128-bit result and get the top 64-bits of the result, rather than the bottom 64-bits? I need this for distributing a hash function over an arbitrary sized table.

Sloatman answered 10/11, 2014 at 20:29 Comment(9)
Yes, there is. If you do (let's say) a 2-digit decimal multiply on paper, you can see how it works. Then code something with two 16-bit operands so you can easily check the stages as you go. But be careful with your signed/ unsigned vars!Swarth
maybe that helps you?Matter
Calculate the product sign first. Work with unsigned. Put sign back after.Swarth
@Mark I'm looking for unsigned actually.Sloatman
@WeatherVane I was hoping for something hardware accelerated. I know I can do the long-multiplication formula but I'm competing with the % operator so I don't think I'll win that way.Sloatman
If the language does not support it (some high level languages have always supported ints bigger than the native int), then do it in assembler. The 64 high bits of product go into rdx, the 64 low bits go into rax.Swarth
If this is for a hash function, does it matter if the top 64-bits are the correct value of the multiplication or if they are just consistent? You can quickly compute an incorrect but consistent value for the top 64 bits if you are worried about speed. Otherwise if you want the correct value you need to worry about carries. What I mean is you can do what Lee Daniel Crocker said in his answer even though his answer might be off by 1.Spoonerism
@Spoonerism You're right I actually don't care if they're the right value of the multiplication as long as the result is uniformly distributed across the hash table.Sloatman
@Sloatman So in that case, do the "quick" multiplication where you just take a*c + ((b*c)>>32) + ((a*d)>>32) as others have said, without worrying about the other parts that contribute to a possible carry.Spoonerism
D
15

This answer shows how to get the (exact) top 64-bits from a 64x64 bit multiply on a system that doesn't support 128-bit integers. The answer by @amdn will give better performance on systems that do support 128-bit integers.

The diagram below shows one method for computing a 128-bit product from two 64-bit numbers. Each black rectangle represents a 64-bit number. The 64-bit inputs to the method, X and Y, are divided into 32-bits chunks labeled a, b, c, and d. Then four 32x32 bit multiplications are performed, giving four 64-bit products labeled a*c, b*c, a*d, and b*d. The four products must be shifted and added to compute the final answer.

enter image description here

Note that the lower 32-bits of the 128-bit product are solely determined by the lower 32-bits of partial product b*d. The next 32-bits are determined by the lower 32-bits of the following

mid34 = ((b*c) & 0xffffffff) + ((a*d) & 0xffffffff) + ((b*d) >> 32);

Note that mid34 is the sum of three 32-bit numbers and therefore is in fact a 34-bit sum. The upper two bits of mid34 act as a carry into the top 64-bits of the 64x64 bit multiply.

Which brings us to the demo code. The top64 function computes the upper 64-bits of a 64x64 multiply. It's a little verbose to allow for the calculation of the lower 64-bits to be shown in a comment. The main function takes advantage of 128-bit integers to verify the results with a simple test case. Further testing is left as an exercise for the reader.

#include <stdio.h>
#include <stdint.h>

typedef unsigned __int128 uint128_t;

uint64_t top64( uint64_t x, uint64_t y )
{
    uint64_t a = x >> 32, b = x & 0xffffffff;
    uint64_t c = y >> 32, d = y & 0xffffffff;

    uint64_t ac = a * c;
    uint64_t bc = b * c;
    uint64_t ad = a * d;
    uint64_t bd = b * d;

    uint64_t mid34 = (bd >> 32) + (bc & 0xffffffff) + (ad & 0xffffffff);

    uint64_t upper64 = ac + (bc >> 32) + (ad >> 32) + (mid34 >> 32);
//  uint64_t lower64 = (mid34 << 32) | (bd & 0xffffffff);

    return upper64;
}

int main( void )
{
    uint64_t x = 0x0000000100000003;
    uint64_t y = 0x55555555ffffffff;

    uint128_t m = x, n = y;
    uint128_t p = m * n;

    uint64_t top = p >> 64;
    printf( "%016llx %016llx\n", top, top64( x, y ) );
}
Dicephalous answered 11/11, 2014 at 0:20 Comment(0)
P
1

A little algebra never hurt:

#include <stdint.h>

uint64_t top64(uint64_t x, uint64_t y) {
    uint64_t a = x >> 32, b = x & 0xFFFFFFFF;
    uint64_t c = y >> 32, d = y & 0xFFFFFFFF;

    return a * c + ((b * d >> 32) + (a * d) + (b * c)) >> 32 +
    ((((a * d) & 0xFFFFFFFF) + ((b *c) & 0xFFFFFFFF) + ((b * d) >> 32)) >> 32);
}
Portia answered 10/11, 2014 at 21:3 Comment(16)
This isn't quite right because the lower 32 bits are computed by adding 3 numbers, and that sum could overflow 1 or 2 bits into the result.Dicephalous
Only if the original 64x64 would have overflowed also.Portia
Oh, wait, I see what you mean; but no, that's not happening here. I'm specifically only adding into the result the bits that do in fact wind up in those upper 64. I'm not calculating the lower 64 at all.Portia
I'm looking for an example, but haven't found one yet. It seems to me that the calculation of the lower 64 bits results in a 66 bit sum. Hence the 2 carry bits should affect the calculation of the upper 64 bits. But if I can't find an example, then I must be wrong :)Dicephalous
You're correct that it matters. You are incorrect that I haven't accounted for that. That's precisely what the ad and bc terms are doing in my equation.Portia
Good answer. I was hoping to grab the bits straight out of hardware if possible but this is pretty fast too. I think you need to sum AD and BC with (BD >> 32) before you shift to handle the case where you get a carry from the bottom 32-bits though.Sloatman
@Dicephalous is right; you need to account for carries out of the low-order product bits in order to get the correct result for a multiply high. For a simple example, take x = y = 2**64-1. Your method gives 18446744073709551613, whereas the correct result is 18446744073709551614.Shakespeare
@gct: even summing ad + bc before shifting isn't enough. You can't get the right result without including the high-order 32 bits of bd.Shakespeare
I think this answer could be off by one carry. This is the possible carry: (((a*d)&0xffffffff) + ((b*c)&0xffffffff) + ((b*d)>>32)) >> 32. Maybe there is a faster way of computing the carry than this.Spoonerism
I am off by one, but the omitted bd can't be it: 0xFFFFFFFF * 0xFFFFFFFF = 0xfffffffe00000001L. Think about it: 2 ** 32 - 1 squared can't possibly be larger than 2 ** 32 squared.Portia
Added JS1's fix, but yeah, that's ugly. I'll have to sleep on it and see what I can come up with.Portia
@LeeDanielCrocker: It's the carry from adding the high 32 bits of bd to the low 32 bits of ad and bc: 1 + 1 + 0xfffffffe = 0x100000000; you seem to have addressed this in your latest edit.Shakespeare
Ah, I see. Hmm. Time to drag out the pencil again...Portia
I'm a little late to the party, but here's an example where the carry from the lower 64-bits is equal to 2. x=0x0000000100000003 y=0x55555555ffffffffDicephalous
Try a*c + ((a*d + b*c + (b*d >> 32)) >> 32). The bottom 32 bits of b*d can't affect the results since all the other terms have 0 in those bits. Edit: oops that partial sum can overflow, so maybe it can't really be simplified.Jesusa
@MarkRansom: as you seem to have noticed, you can't group it that way because you miss some overflows out of the middle terms.Shakespeare
B
1

Both gcc and clang support 128-bit integers as an extension.

Here's one way to do it demo

#include <iostream>
#include <cstdint>

//https://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fint128.html#_005f_005fint128
using u128 = unsigned __int128;
using u64  = uint64_t;

void mul64x64( u64 a, u64 b, u64 & hi, u64 & lo ) {
    u128 product = u128(a) * b;
    lo = product;
    hi = product >> 64;
}

int main()
{
    u64 hi, lo;
    mul64x64( 40282220, u64{1} << 63, hi, lo );
    std::cout << hi << std::endl;
}

Output is

set -x ; clang++ -std=c++11 -O0 -Wall -Werror main.cpp && ./a.out
+ clang++ -std=c++11 -O0 -Wall -Werror main.cpp
+ ./a.out
20141110
Burgh answered 10/11, 2014 at 23:30 Comment(1)
I probably have to support back further than GCC 4.8 though, which makes the extensions not as reliable to be there. ICC support is also a bonus.Sloatman

© 2022 - 2024 — McMap. All rights reserved.