Efficient 128-bit addition using carry flag
Asked Answered
A

2

43

I'm using a 128 bit integer counter in the very inner loops of my C++ code. (Irrelevant background: The actual application is evaluating finite difference equations on a regular grid, which involves repetitively incrementing large integers, and even 64 bits isn't enough precision because small rounding accumulates enough to affect the answers.)

I've represented the integer as two 64 bit unsigned longs. I now need to increment those values by a 128 bit constant. This isn't hard, but you have to manually catch the carry from the low word to the high word.

I have working code something like this:

inline void increment128(unsigned long &hiWord, unsigned long &loWord)
  {
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry
    hiWord += hiAdd;
  }

This is tight and simple code. It works.

Unfortunately this is about 20% of my runtime. The killer line is that loWord test. If I remove it, I obviously get the wrong answers but the runtime overhead drops from 20% to 4%! So that carry test is especially expensive!

My question: Does C++ expose the hardware carry flags, even as an extension to GCC? It seems like the additions could be done without the test-and-add-carry line above if the actual compiled instructions used an add using last carry instruction for the hiWord addition. Is there a way to rewrite the test-and-add-carry line to get the compiler to use the intrinsic opcode?

Araliaceous answered 12/7, 2011 at 4:12 Comment(2)
If this function is the showstopper, write it in inline assembly and you can deal with your carry efficiently : it's quite short so it should be manageable (even better, you could use SSE/MMX extensions and directly calculate in 128bits)Tantivy
@Tantivy — It's probably the branch instruction rather than the function per se.Archer
I
46

Actually gcc will use the carry automatically if you write your code carefully...

Current GCC can optimize hiWord += (loWord < loAdd); into add/adc (x86's add-with-carry). This optimization was introduced in GCC5.3.

(editor's note: Of course the hard part is writing a correct full-adder with carry in and carry out; that's hard in C and GCC doesn't know how to optimize any that I've seen.)

Also related: https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html can give you carry-out from unsigned, or signed-overflow detection.


Older GCC, like GCC4.5, will branch or setc on the carry-out from an add, instead of using adc, and only used adc (add-with-carry) on the flag-result from an add if you used __int128. (Or uint64_t on a 32-bit target). See Is there a 128 bit integer in gcc? - only on 64-bit targets, supported since GCC4.1.

I compiled this code with gcc -O2 -Wall -Werror -S:

void increment128_1(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    if (loWord < loAdd) ++hiWord; // test_and_add_carry                                                                                                             
    hiWord += hiAdd;
}

void increment128_2(unsigned long &hiWord, unsigned long &loWord)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    loWord += loAdd;
    hiWord += hiAdd;
    hiWord += (loWord < loAdd); // test_and_add_carry                                                                                                               
}

This is the assembly for increment128_1:

.cfi_startproc
        movabsq     $-8801131483544218438, %rax
        addq        (%rsi), %rax
        movabsq     $-8801131483544218439, %rdx
        cmpq        %rdx, %rax
        movq        %rax, (%rsi)
        ja  .L5
        movq        (%rdi), %rax
        addq        $1, %rax
.L3:
        movabsq     $6794178679361, %rdx
        addq        %rdx, %rax
        movq        %rax, (%rdi)
        ret

...and this is the assembly for increment128_2:

        movabsq     $-8801131483544218438, %rax
        addq        %rax, (%rsi)
        movabsq     $6794178679361, %rax
        addq        (%rdi), %rax
        movabsq     $-8801131483544218439, %rdx
        movq        %rax, (%rdi)
        cmpq        %rdx, (%rsi)
        setbe       %dl
        movzbl      %dl, %edx
        leaq        (%rdx,%rax), %rax
        movq        %rax, (%rdi)
        ret

Note the lack of conditional branches in the second version.

[edit]

Also, references are often bad for performance, because GCC has to worry about aliasing... It is often better to just pass things by value. Consider:

struct my_uint128_t {
    unsigned long hi;
    unsigned long lo;
};

my_uint128_t increment128_3(my_uint128_t x)
{
    const unsigned long hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    x.lo += loAdd;
    x.hi += hiAdd + (x.lo < loAdd);
    return x;
}

Assembly:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rdx
        movabsq     $-8801131483544218439, %rax
        movabsq     $6794178679362, %rcx
        addq        %rsi, %rdx
        cmpq        %rdx, %rax
        sbbq        %rax, %rax
        addq        %rcx, %rax
        addq        %rdi, %rax
        ret

This is actually the tightest code of the three.

...OK so none of them actually used the carry automatically :-). But they do avoid the conditional branch, which I bet is the slow part (since the branch prediction logic will get it wrong half the time).

[edit 2]

And one more, which I stumbled across doing a little searching. Did you know GCC has built-in support for 128-bit integers?

typedef unsigned long my_uint128_t __attribute__ ((mode(TI)));

my_uint128_t increment128_4(my_uint128_t x)
{
    const my_uint128_t hiAdd=0x0000062DE49B5241;
    const unsigned long loAdd=0x85DC198BCDD714BA;

    return x + (hiAdd << 64) + loAdd;
}

The assembly for this one is about as good as it gets:

        .cfi_startproc
        movabsq     $-8801131483544218438, %rax
        movabsq     $6794178679361, %rdx
        pushq       %rbx
        .cfi_def_cfa_offset 16
        addq        %rdi, %rax
        adcq        %rsi, %rdx
        popq        %rbx
        .cfi_offset 3, -16
        .cfi_def_cfa_offset 8
        ret

(Not sure where the push/pop of ebx came from, but this is still not bad.)

All of these are with GCC 4.5.2, by the way.

Inverness answered 12/7, 2011 at 4:21 Comment(9)
+1, the very last one has the adc instruction you're looking forLibido
@Gordy: Indeed. I am a little surprised I could not coax GCC into generating it otherwise... I could have sworn I had seen it do so at some point (and not with the "TI" mode type).Inverness
This is a terrific analysis! I'll be marking this as the correct answer, but bruce brings up a good point in his comment: why not just use an ASM include? I made a second answer below just to document it, even though this answer here by Nemo will be the best reference.Araliaceous
@Randall: The asm solution works, but it is specific to x86_64. The 128-bit GCC integers will work on any platform that has GCC, and the other suggestions will work anywhere you have a C++ compiler. You might think you can ignore such concerns, but years of experience suggest you will live to regret it if you do :-). Only go the full asm route if you simply cannot get adequate performance any other way; and if you do use it, be sure to limit it to a single module if possible.Inverness
@Nemo: does gcc support __int128_t on 32-bit platforms now? it didn't at some point in the past.Marje
@Stephen: Sorry, but I do not know. I do not even have a 32-bit system anymore... Very possible you are right.Inverness
@StephenCanon, GCC's __int128 requires 64-bit. It's easy to check by compiling for 32-bit (e.g. with -m32). MSVC's _umul128 also requires 64-bit.Randi
@Zboson: it's easy to check if you use (or have) an up-to-date GCC.Marje
re: previous comments: it's easy to check on godbolt.org which has up-to-date gcc, clang, ICC, and MSVC for a variety of architectures, including x86-64 (and you can use -m32). I'm pretty sure more recent GCC will recognize the hi += lo<x pattern and use adc on the carry-out from add. And clang definitely can, even for carry out from __int128: Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer?Sturgill
M
20

The best answer, of course, is to use the built-in __int128_t support.

Alternatively, use an inline asm. I prefer to use the named-argument form:

__asm("add %[src_lo], %[dst_lo]\n"
      "adc %[src_hi], %[dst_hi]"
      : [dst_lo] "+&r" (loWord), [dst_hi] "+r" (hiWord)
      : [src_lo] "erm" (loAdd), [src_hi] "erm" (hiAdd)
      : );

loWord is flagged as an early clobber operand, because it's written before some of the other operands are read. This avoids wrong code for hiAdd = loWord, because it will stop gcc from using the same register to hold both. It does stop the compiler from using the same register for the loAdd = loWord case, though, where it is safe.

As that early-clobber question points out, inline asm is really easy to get wrong (in hard-to-debug ways which only cause trouble up after some change to the code it's inlined into).

x86 and x86-64 inline asm is assumed to clobber the flags, so an explicit "cc" clobber isn't needed.

Marje answered 12/7, 2011 at 21:44 Comment(8)
Should you use an early clobber flag on the outputs (at least for hiWord)? For 256-bit addition (one add, three adc) I get the wrong result without using early clobber flags.Randi
Probably, but you'd need it on loWord, I think.Marje
Using add + adc is definitely the way to go. It's unfortunate that C's abstraction hides the fact that it's easy to do a 128bit addition.Lues
The constraints on the source operands (loAdd and hiAdd) could be emr. (e means 32bit signed constant. 64bit add/adc can take sign-extended 32bit immediate operands). There should also be a clobber on cc, since this modifies the flags. In this case, broadening the constraints doesn't really matter, since mov r64, imm64 is the best way to set things up anyway. But see goo.gl/MmdKgo for a case where an early-clobber on loWord will generate different code (but worse, with a useless mov). I still haven't figured out if there's a need at all.Sturgill
update: hiAdd is read after loWord is written, so with hiAdd = loWord, you get wrong code without an early clobber (code which would add the new value of loWord, not the old value).Sturgill
@PeterCordes Isn't the cc clobber implicit for inline asm on x86? (which is actually a pain in cases where you don't clobber it...)Armory
@MarcGlisse: Oh, I didn't realize that! Yes, it's that way for sure on x86-32 (i386), and it looks like that discussion about what to do for x86-64 picked the sane choice of working the same as i386. Thanks.Sturgill
How to generalize it to wider long integers? (maybe should be there a loop?)Hydride

© 2022 - 2024 — McMap. All rights reserved.