128 bit integer on cuda?

Asked 28/5, 2011 at 14:10 Answered 6/4, 2022 at 18:23

I just managed to install my cuda SDK under Linux Ubuntu 10.04. My graphic card is an NVIDIA geForce GT 425M, and I'd like to use it for some heavy computational problem. What I wonder is: is there any way to use some unsigned 128 bit int var? When using gcc to run my program on the CPU, I was using the __uint128_t type, but using it with cuda doesn't seem to work. Is there anything I can do to have 128 bit integers on cuda?

Milford answered 28/5, 2011 at 14:10 Comment(0)

For best performance, one would want to map the 128-bit type on top of a suitable CUDA vector type, such as uint4, and implement the functionality using PTX inline assembly. The addition would look something like this:

typedef uint4 my_uint128_t;
__device__ my_uint128_t add_uint128 (my_uint128_t addend, my_uint128_t augend)
{
    my_uint128_t res;
    asm ("add.cc.u32      %0, %4, %8;\n\t"
         "addc.cc.u32     %1, %5, %9;\n\t"
         "addc.cc.u32     %2, %6, %10;\n\t"
         "addc.u32        %3, %7, %11;\n\t"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(addend.x), "r"(addend.y), "r"(addend.z), "r"(addend.w),
           "r"(augend.x), "r"(augend.y), "r"(augend.z), "r"(augend.w));
    return res;
}

The multiplication can similarly be constructed using PTX inline assembly by breaking the 128-bit numbers into 32-bit chunks, computing the 64-bit partial products and adding them appropriately. Obviously this takes a bit of work. One might get reasonable performance at the C level by breaking the number into 64-bit chunks and using __umul64hi() in conjuction with regular 64-bit multiplication and some additions. This would result in the following:

__device__ my_uint128_t mul_uint128 (my_uint128_t multiplicand, 
                                     my_uint128_t multiplier)
{
    my_uint128_t res;
    unsigned long long ahi, alo, bhi, blo, phi, plo;
    alo = ((unsigned long long)multiplicand.y << 32) | multiplicand.x;
    ahi = ((unsigned long long)multiplicand.w << 32) | multiplicand.z;
    blo = ((unsigned long long)multiplier.y << 32) | multiplier.x;
    bhi = ((unsigned long long)multiplier.w << 32) | multiplier.z;
    plo = alo * blo;
    phi = __umul64hi (alo, blo) + alo * bhi + ahi * blo;
    res.x = (unsigned int)(plo & 0xffffffff);
    res.y = (unsigned int)(plo >> 32);
    res.z = (unsigned int)(phi & 0xffffffff);
    res.w = (unsigned int)(phi >> 32);
    return res;
}

Below is a version of the 128-bit multiplication that uses PTX inline assembly. It requires PTX 3.0, which shipped with CUDA 4.2, and the code requires a GPU with at least compute capability 2.0, i.e. a Fermi or Kepler class device. The code uses the minimal number of instructions, as sixteen 32-bit multiplies are needed to implement a 128-bit multiplication. By comparison, the variant above using CUDA intrinsics compiles to 23 instructions for an sm_20 target.

__device__ my_uint128_t mul_uint128 (my_uint128_t a, my_uint128_t b)
{
    my_uint128_t res;
    asm ("{\n\t"
         "mul.lo.u32      %0, %4, %8;    \n\t"
         "mul.hi.u32      %1, %4, %8;    \n\t"
         "mad.lo.cc.u32   %1, %4, %9, %1;\n\t"
         "madc.hi.u32     %2, %4, %9,  0;\n\t"
         "mad.lo.cc.u32   %1, %5, %8, %1;\n\t"
         "madc.hi.cc.u32  %2, %5, %8, %2;\n\t"
         "madc.hi.u32     %3, %4,%10,  0;\n\t"
         "mad.lo.cc.u32   %2, %4,%10, %2;\n\t"
         "madc.hi.u32     %3, %5, %9, %3;\n\t"
         "mad.lo.cc.u32   %2, %5, %9, %2;\n\t"
         "madc.hi.u32     %3, %6, %8, %3;\n\t"
         "mad.lo.cc.u32   %2, %6, %8, %2;\n\t"
         "madc.lo.u32     %3, %4,%11, %3;\n\t"
         "mad.lo.u32      %3, %5,%10, %3;\n\t"
         "mad.lo.u32      %3, %6, %9, %3;\n\t"
         "mad.lo.u32      %3, %7, %8, %3;\n\t"
         "}"
         : "=r"(res.x), "=r"(res.y), "=r"(res.z), "=r"(res.w)
         : "r"(a.x), "r"(a.y), "r"(a.z), "r"(a.w),
           "r"(b.x), "r"(b.y), "r"(b.z), "r"(b.w));
    return res;
}

Pellet answered 2/6, 2011 at 21:16 Comment(6)

@Pellet - I assume today you would suggest a solution based on 2 64-bit values? – Whatley 30/5, 2018 at 12:56

@Whatley Unlikely, since 64-bit integer operations are emulated and it is usually best to build emulations on top of native instructions rather than other emulations. Because 32-bit integer multiply and multiply-add are themselves emulated on Maxwell and Pascal architectures, it would possibly be best to use native 16-bit multiplies there which map to the machine instruction XMAD (a 16x16+32 bit multiply-add operation). I read that native 32-bit integer multiplies were restored with the Volta architecture , but I have no hands-on experience with Volta yet. – Pellet 30/5, 2018 at 15:41

How is performance compared to 32 bit integers? 1/16 or similar? – Lumbar 7/6, 2018 at 10:31

@huseyintugrulbuyukisik Based on instruction count it would be around 1/16 of a native 32-bit multiplication. The actual performance impact could vary a bit depending on code context based on the loading of functional units and register usage. – Pellet 7/6, 2018 at 14:17

Can we also do uint128 adds atomically? – Griceldagrid 22/1, 2020 at 19:37

@Griceldagrid Best I know the GPU hardware only supports atomic operations up to a size of 64 bits. I have not researched whether one could cleverly construct atomics for larger types via clever software constructs. – Pellet 22/1, 2020 at 20:59

CUDA doesn't support 128 bit integers natively. You can fake the operations yourself using two 64 bit integers.

Look at this post:

typedef struct {
  unsigned long long int lo;
  unsigned long long int hi;
} my_uint128;

my_uint128 add_uint128 (my_uint128 a, my_uint128 b)
{
  my_uint128 res;
  res.lo = a.lo + b.lo;
  res.hi = a.hi + b.hi + (res.lo < a.lo);
  return res;
}

Pacheco answered 28/5, 2011 at 15:28 Comment(4)

Thank you very much! Just one more question: from an efficiency point of view, is this going to be fast enough? – Milford 28/5, 2011 at 18:59

I tested that code on my CPU. It actually works, but it's 6 times slower than using the __uint128_t type... isn't there any way to make it faster? – Milford 28/5, 2011 at 22:4

You tested built-in 128 bit integers on CPU with this my_uint128 on the CPU? Of course the native support will be faster. The hope is that performance on the GPU with this 128 bit type will be faster than performance on the CPU with built-in 128 bit integers. – Pacheco 28/5, 2011 at 22:52

Is the link broken? – Boiled 10/2, 2020 at 20:51

For posterity, note that as of 11.5, CUDA and nvcc support __int128_t in device code when the host compiler supports it (e.g., clang/gcc, but not MSVC). 11.6 added support for debug tools with __int128_t.

See:

Sailcloth answered 6/4, 2022 at 18:23 Comment(1)

This should be the accepted answer nowadays. Who cares about M$VC and its shitty OS? :P – Snapper 23/3 at 14:50

A much-belated answer, but you could consider using this library:

https://github.com/curtisseizert/CUDA-uint128

which defines a 128-bit-sized structure, with methods and freestanding utility functions to get it to function as expected, which allow it to be used like a regular integer. Mostly.

Whatley answered 30/5, 2018 at 13:0 Comment(1)

This is really cool, and much better answer than the others :) After looking at the source code, I saw that there's a __mul64hi PTX instruction that makes 64 * 64 bit multiplication efficient. – Famulus 27/3, 2019 at 0:46

Recommended topics

Hot tags