An efficient way to do basic 128 bit integer calculations in C++?

Asked 2/12, 2014 at 23:59 Answered 23/1, 2015 at 20:5

Solved c++assembly x86 intel-edison int128

Some years ago I needed a way to do some basic 128 bit integer math with Cuda: 128 bit integer on cuda?. Now I am having the same problem, but this time I need to run some basic 128 bit arithmetics (sums, bitshifts and multiplications) on a 32 bit embedded system (Intel Edison) that does not support 128 bits of any kind. There are, however, 64 bit integers supported directly (unsigned long long int).

I tried naively to use the asm code that was answered to me last time on the CPU, but I got a bunch of errors. I am really not experienced with asm, so: what is the most efficient way, having 64 bit integers, to implement additions, multiplications and bit shifting in 128 bits?

Grimy answered 2/12, 2014 at 23:59 Comment(7)

This might be worth a look: gmplib.org check: "8 Low-level Functions": gmplib.org/manual/Low_002dlevel-Functions.html – Wig 3/12, 2014 at 0:7

You can easily synthesize those operations in C using the available 64 bit support, and let the compiler take care of the details. Only if the performance isn't good enough should you start to optimize. – Mooncalf 3/12, 2014 at 0:14

Doesn't Edison support SSE ? You could use standard intrinsics macros. – Yoke 3/12, 2014 at 0:16

@Yoke SSE doesn't do 128 bit integer arithmetic AFAIK. – Mooncalf 3/12, 2014 at 0:17

Consider arbitrary precision math libraries like GMP or MPIR. – Dittany 3/12, 2014 at 0:37

@SevaAlekseyev: We're talking about a very-low-performance CPU here (Intel Edison) and you're suggesting arbitrary precision? GMP is optimized for 100+ digits, not 20. This may easily be 20x times slower than a straightforward 64+64 solution. – Heffernan 3/12, 2014 at 10:4

@Yoke SSE is a SIMD instructions set which works on multiple values at a time, not for a big 128-bit value. Doing 128-bit arithmetics on it is tricky and very inefficient – Steatite 23/1, 2015 at 11:8

Update: Since the OP hasn't accepted an answer yet <hint><hint>, I've attached a bit more code.

Using the libraries discussed above is probably a good idea. While you might only need a few functions today, eventually you may find that you need one more. Then one more after that. Until eventually you end up writing, debugging and maintaining your own 128bit math library. Which is a waste of your time and effort.

That said. If you are determined to roll your own:

1) The cuda question you asked previously already has c code for multiplication. Was there some problem with it?

2) The shift probably won't benefit from using asm, so a c solution makes sense to me here as well. ~~Although if performance is really an issue here, I'd see if the Edison supports SHLD/SHRD, which might make this a bit faster. Otherwise, m~~ Maybe an approach like this?

my_uint128_t lshift_uint128 (const my_uint128_t a, int b)
{
   my_uint128_t res;
   if (b < 32) {    
      res.x = a.x << b;
      res.y = (a.y << b) | (a.x >> (32 - b));
      res.z = (a.z << b) | (a.y >> (32 - b));
      res.w = (a.w << b) | (a.z >> (32 - b));
   } elseif (b < 64) {
      ...
   }

   return res;
}

Update: Since it appears that the Edison may support SHLD/SHRD, here's an alternative which might be more performant than the 'c' code above. As with all code purporting to be faster, you should test it.

inline
unsigned int __shld(unsigned int into, unsigned int from, unsigned int c)
{
   unsigned int res;

   if (__builtin_constant_p(into) &&
       __builtin_constant_p(from) &&
       __builtin_constant_p(c))
   {
      res = (into << c) | (from >> (32 - c));
   }
   else
   {
      asm("shld %b3, %2, %0"
          : "=rm" (res)
          : "0" (into), "r" (from), "ic" (c)
          : "cc");
   }

   return res;
}

inline
unsigned int __shrd(unsigned int into, unsigned int from, unsigned int c)
{
   unsigned int res;

   if (__builtin_constant_p(into) && 
       __builtin_constant_p(from) && 
       __builtin_constant_p(c))
   {
      res = (into >> c) | (from << (32 - c));
   }
   else
   {
      asm("shrd %b3, %2, %0"
          : "=rm" (res)
          : "0" (into), "r" (from), "ic" (c)
          : "cc");
   }

   return res;
}

my_uint128_t lshift_uint128 (const my_uint128_t a, unsigned int b)
{
   my_uint128_t res;

   if (b < 32) {
      res.x = a.x << b;
      res.y = __shld(a.y, a.x, b);
      res.z = __shld(a.z, a.y, b);
      res.w = __shld(a.w, a.z, b);
   } else if (b < 64) {
      res.x = 0;
      res.y = a.x << (b - 32);
      res.z = __shld(a.y, a.x, b - 32);
      res.w = __shld(a.z, a.y, b - 32);
   } else if (b < 96) {
      res.x = 0;
      res.y = 0;
      res.z = a.x << (b - 64);
      res.w = __shld(a.y, a.x, b - 64);
   } else if (b < 128) {
      res.x = 0;
      res.y = 0;
      res.z = 0;
      res.w = a.x << (b - 96);
   } else {
      memset(&res, 0, sizeof(res));
   }

   return res;
}

my_uint128_t rshift_uint128 (const my_uint128_t a, unsigned int b)
{
   my_uint128_t res;

   if (b < 32) {
      res.x = __shrd(a.x, a.y, b);
      res.y = __shrd(a.y, a.z, b);
      res.z = __shrd(a.z, a.w, b);
      res.w = a.w >> b;
   } else if (b < 64) {
      res.x = __shrd(a.y, a.z, b - 32);
      res.y = __shrd(a.z, a.w, b - 32);
      res.z = a.w >> (b - 32);
      res.w = 0;
   } else if (b < 96) {
      res.x = __shrd(a.z, a.w, b - 64);
      res.y = a.w >> (b - 64);
      res.z = 0;
      res.w = 0;
   } else if (b < 128) {
      res.x = a.w >> (b - 96);
      res.y = 0;
      res.z = 0;
      res.w = 0;
   } else {
      memset(&res, 0, sizeof(res));
   }

   return res;
}

3) The addition may benefit from asm. You could try this:

struct my_uint128_t
{
   unsigned int x;
   unsigned int y;
   unsigned int z;
   unsigned int w;
};

my_uint128_t add_uint128 (const my_uint128_t a, const my_uint128_t b)
{
   my_uint128_t res;

    asm ("addl %5, %[resx]\n\t"
         "adcl %7, %[resy]\n\t"
         "adcl %9, %[resz]\n\t"
         "adcl %11, %[resw]\n\t"
         : [resx] "=&r" (res.x), [resy] "=&r" (res.y), 
           [resz] "=&r" (res.z), [resw] "=&r" (res.w)
         : "%0"(a.x), "irm"(b.x), 
           "%1"(a.y), "irm"(b.y), 
           "%2"(a.z), "irm"(b.z), 
           "%3"(a.w), "irm"(b.w)
         : "cc");

   return res;
}

I just dashed this off, so use at your own risk. I don't have an Edison, but this works with x86.

Update: If you are just doing accumulation (think to += from instead of the code above which is c = a + b), this code might serve you better:

inline
void addto_uint128 (my_uint128_t *to, const my_uint128_t from)
{
   asm ("addl %[fromx], %[tox]\n\t"
        "adcl %[fromy], %[toy]\n\t"
        "adcl %[fromz], %[toz]\n\t"
        "adcl %[fromw], %[tow]\n\t"
        : [tox] "+&r"(to->x), [toy] "+&r"(to->y), 
          [toz] "+&r"(to->z), [tow] "+&r"(to->w)
        : [fromx] "irm"(from.x), [fromy] "irm"(from.y), 
          [fromz] "irm"(from.z), [fromw] "irm"(from.w)
        : "cc");
}

Turkic answered 3/12, 2014 at 7:17 Comment(7)

This is nice! However, I noticed that there is an unsigned long long int 64 bit type defined here. Do you think it could be somehow faster using directly that type? Or would that just translate in this kind of code, as g++ would emulate 64 bit ints just the same way you did? – Grimy 3/12, 2014 at 8:44

I'm not familiar with Edison, but if it only has 32bit registers, I can't see how it could be faster. As you say, I would expect it to just simulate. Still, any question that starts with "Would it be faster if I..." should normally be answered by "try it and see." – Turkic 3/12, 2014 at 9:34

Given that this is C++, it would be prettier to define a class unit128_t and overload all the integer operators so that in most cases 128 bit expressions would look like any other integer arithmetic expression. – Complexity 3/12, 2014 at 9:43

If you want to, feel free. However, if you are going to put that much work into it, why not use one of the libraries people mentioned above? I'm sure some of them have already done all this. – Turkic 3/12, 2014 at 9:48

Edison is basically a Pentium P54C with SSE2. Since the Pentium part is all 32 bits, you need the SSE2 for 64 bit performance. – Heffernan 3/12, 2014 at 10:9

Well, it looks like the P54C does indeed support SHLD. Such being the case, it might be possible to squeeze a tiny bit more performance out of an asm then the code currently in this answer for shift. You'd probably have to try it and see if it really is faster on your specific hw though. How desperate are you for perf? And before I write this, is the 'add' code working? – Turkic 3/12, 2014 at 10:56

@DavidWohlferd: According to Agner Fog's instruction tables (agner.org/optimize), SHLD r, imm8 takes 4 cycles on the in-order P54C, and is non-pairable. adc r, r/imm is 1 cycle and is pairable (U-pipe only). Inline asm prevents the compiler from scheduling any independent instructions interleaved with the add/adc chain to pair with the middle 2 instructions, but at least the whole chain is only 4 cycles total and the first and last instructions might pair. (Being in-order, there's no real distinction between throughput and latency). – Teplica 15/10, 2018 at 17:57

If using an external library is an option then have a look at this question. You can use TTMath which is a very simple header for big precision math. On 32-bit architectures ttmath:UInt<4> will create a 128-bit int type with four 32-bit limbs. Some other alternatives are (u)int128_t in Boost.Multiprecision or calccrypto/uint128_t

If you must write it your own then there are already a lot of solutions on SO and I'll summarize them here

For addition and subtraction, it's very easy and straightforward, simply add/subtract the words (which big int libraries often called limbs) from the lowest significant to higher significant, with carry of course.

typedef struct INT128 {
    uint64_t H, L;
} my_uint128_t;

inline my_uint128_t add(my_uint128_t a, my_uint128_t b)
{
    my_uint128_t c;
    c.L = a.L + b.L;
    c.H = a.H + b.H + (c.L < a.L);  // c = a + b
    return c;
}

The assembly output can be checked with Compiler Explorer

The compilers can already generate efficient code for double-word operations, but many aren't smart enough to use "add with carry" when compiling multi-word operations from high level languages as you can see in the question efficient 128-bit addition using carry flag. Therefore using 2 long longs like above will make it not only more readable but also easier for the compiler to emit a little more efficient code.

If that still doesn't suit your performance requirement, you must use intrinsic or write it in assembly. To add the 128-bit value stored in bignum to the 128-bit value in {eax, ebx, ecx, edx} you can use the following code

add edx, [bignum]
adc ecx, [bignum+4]
adc ebx, [bignum+8]
adc eax, [bignum+12]

The equivalent intrinsic will be like this for Clang

unsigned *x, *y, *z, carryin=0, carryout;
z[0] = __builtin_addc(x[0], y[0], carryin, &carryout);
carryin = carryout;
z[1] = __builtin_addc(x[1], y[1], carryin, &carryout);
carryin = carryout;
z[2] = __builtin_addc(x[2], y[2], carryin, &carryout);
carryin = carryout;
z[3] = __builtin_addc(x[3], y[3], carryin, &carryout);

You need to change the intrinsic to the one supported by your compiler, for example __builtin_uadd_overflow in gcc, or _addcarry_u32 for MSVC and ICC

For more information read these

For bit shifts you can find the C solution in the question Bitwise shift operation on a 128-bit number. This is a simple left shift but you can unroll the recursive call for more performance

void shiftl128 (
    unsigned int& a,
    unsigned int& b,
    unsigned int& c,
    unsigned int& d,
    size_t k)
{
    assert (k <= 128);
    if (k >= 32) // shifting a 32-bit integer by more than 31 bits is "undefined"
    {
        a=b;
        b=c;
        c=d;
        d=0;
        shiftl128(a,b,c,d,k-32);
    }
    else
    {
        a = (a << k) | (b >> (32-k));
        b = (b << k) | (c >> (32-k));
        c = (c << k) | (d >> (32-k));
        d = (d << k);
    }
}

The assembly for less-than-32-bit shifts can be found in the question 128-bit shifts using assembly language?

shld    edx, ecx, cl
shld    ecx, ebx, cl
shld    ebx, eax, cl
shl     eax, cl

Right shifts can be implemented similarly, or just copy from the above linked question

Multiplication and divisions are a lot more complex and you can reference the solution in the question Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit):

class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            \n\
        movl       %%eax,   %%ebp                   \n\
        movl       $0x00,   %%ebx                   \n\
        movl       $0x00,   %%ecx                   \n\
        movl       $0x00,   %%esi                   \n\
        movl       $0x00,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ebx                   \n\
        adcl       %%edx,   %%ecx                   \n\
        adcl       $0x00,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  \n\
        mull               (%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        pop        %%ebp                            \n"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

You can also find a lot of related questions with the 128bit tag

Steatite answered 23/1, 2015 at 20:5 Comment(4)

Your asm volatile doesn't actually clobber EBP, it saves/restores it. If you remove that clobber, it should be possible to compile this with -fno-omit-frame-pointer. But it's not safe to ask for only a pointer in a register, and then dereference it without using a "memory" clobber or a dummy memory source operand like "m"(data) as well. – Teplica 15/10, 2018 at 18:2

@PeterCordes I took it from the other answer. It wasn't written by me – Steatite 16/10, 2018 at 1:20

Then you should fix both, or at least the version that's in your answer. – Teplica 16/10, 2018 at 1:44

unfortunately I don't really understand how gcc extended assembly works – Steatite 16/10, 2018 at 6:33

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags