Efficient Multiply/Divide of two 128-bit Integers on x86 (no 64-bit)

Asked 8/1, 2012 at 7:32 Answered 16/1, 2012 at 21:17

Compiler: MinGW/GCC
Issues: No GPL/LGPL code allowed (GMP or any bignum library for that matter, is overkill for this problem, as I already have the class implemented).

I have constructed my own 128-bit fixed-size big integer class (intended for use in a game engine but may be generalized to any usage case) and I find the performance of the current multiply and divide operations to be quite abysmal (yes, I have timed them, see below), and I'd like to improve on (or change) the algorithms that do the low-level number crunching.

When it comes to the multiply and divide operators, they are unbearably slow compared to just about everything else in the class.

These are the approximate measurements relative to my own computer:

Raw times as defined by QueryPerformanceFrequency:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~546u
Division:           ~4760u (with maximum bit count)

As you can see, just doing the multiplication is many, many times slower than add or subtract. Division is about 10 times slower than multiplication.

I'd like to improve the speed of these two operators because there may be a very high amount of computations being done per frame (dot products, various collision detection methods, etc).

The structure (methods omitted) looks somewhat like:

class uint128_t
{
    public:
        unsigned long int dw3, dw2, dw1, dw0;
  //...
}

Multiplication is currently done using the typical long-multiplication method (in assembly so that I can catch the EDX output) while ignoring the words that go out of range (that is, I'm only doing 10 mull's compared to 16).

Division uses the shift-subtract algorithm (speed depends on bit counts of the operands). However, it is not done in assembly. I found that a little too difficult to muster and decided to let the compiler optimize it.

I have Google'd around for several days looking at pages describing algorithms such as Karatsuba Multiplication, high-radix division, and Newton-Rapson Division but the math symbols are a little too far over my head. I'd like to use some of these advanced methods to speed up my code, but I'd have to translate the "Greek" into something comprehensible first.

For those that may deem my efforts "premature optimization"; I consider this code to be a bottleneck because the very elementary-math operations themselves become slow. I can ignore such types of optimization on higher-level code, but this code will be called/used enough for it to matter.

I'd like suggestions on which algorithm I should use to improve multiply and divide (if possible), and a basic (hopefully easy to understand) explanation on how the suggested algorithm works would be highly appreciated.

EDIT: Multiply Improvements

I was able to improve the multiply operation by inlining code into operator*= and it seems as fast as possible.

Updated raw times:
1/60sec          31080833u
Addition:              ~8u
Subtraction:           ~8u
Multiplication:      ~100u (lowest ~86u, highest around ~256u)
Division:           ~4760u (with maximum bit count)

Here's some bare-bones code for you to examine (note that my type names are actually different, this was edited for simplicity):

//File: "int128_t.h"
class int128_t
{
    uint32_t dw3, dw2, dw1, dw0;

    // Various constrctors, operators, etc...

    int128_t& operator*=(const int128_t&  rhs) __attribute__((always_inline))
    {
        int128_t Urhs(rhs);
        uint32_t lhs_xor_mask = (int32_t(dw3) >> 31);
        uint32_t rhs_xor_mask = (int32_t(Urhs.dw3) >> 31);
        uint32_t result_xor_mask = (lhs_xor_mask ^ rhs_xor_mask);
        dw0 ^= lhs_xor_mask;
        dw1 ^= lhs_xor_mask;
        dw2 ^= lhs_xor_mask;
        dw3 ^= lhs_xor_mask;
        Urhs.dw0 ^= rhs_xor_mask;
        Urhs.dw1 ^= rhs_xor_mask;
        Urhs.dw2 ^= rhs_xor_mask;
        Urhs.dw3 ^= rhs_xor_mask;
        *this += (lhs_xor_mask & 1);
        Urhs += (rhs_xor_mask & 1);

        struct mul128_t
        {
            int128_t dqw1, dqw0;
            mul128_t(const int128_t& dqw1, const int128_t& dqw0): dqw1(dqw1), dqw0(dqw0){}
        };

        mul128_t data(Urhs,*this);
        asm volatile(
        "push      %%ebp                            \n\
        movl       %%eax,   %%ebp                   \n\
        movl       $0x00,   %%ebx                   \n\
        movl       $0x00,   %%ecx                   \n\
        movl       $0x00,   %%esi                   \n\
        movl       $0x00,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ebx                   \n\
        adcl       %%edx,   %%ecx                   \n\
        adcl       $0x00,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   16(%%ebp),   %%eax #Calc: (dw3*dw0)  \n\
        mull             12(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%ecx                   \n\
        adcl       %%edx,   %%esi                   \n\
        adcl       $0x00,   %%edi                   \n\
        movl   24(%%ebp),   %%eax #Calc: (dw1*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   20(%%ebp),   %%eax #Calc: (dw2*dw1)  \n\
        mull              8(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw2)  \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%esi                   \n\
        adcl       %%edx,   %%edi                   \n\
        movl   24(%%ebp),  %%eax #Calc: (dw1*dw2)   \n\
        mull              4(%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        movl   28(%%ebp),   %%eax #Calc: (dw0*dw3)  \n\
        mull               (%%ebp)                  \n\
        addl       %%eax,   %%edi                   \n\
        pop        %%ebp                            \n"
        :"=b"(this->dw0),"=c"(this->dw1),"=S"(this->dw2),"=D"(this->dw3)
        :"a"(&data):"%ebp");

        dw0 ^= result_xor_mask;
        dw1 ^= result_xor_mask;
        dw2 ^= result_xor_mask;
        dw3 ^= result_xor_mask;
        return (*this += (result_xor_mask & 1));
    }
};

As for division, examining the code is rather pointless, as I will need to change the mathematical algorithm to see any substantial benefits. The only feasible choice seems to be high-radix division, but I have yet to iron out (in my mind) just how it will work.

Aurochs answered 8/1, 2012 at 7:32 Comment(13)

Why do you need 128-bit integers? – Pains 8/1, 2012 at 7:38

I'm using them in fixed-point math, and I need the large size in order to preserve accuracy through calculations like line-line intersections, etc. – Aurochs 8/1, 2012 at 7:39

Wouldn't floating point arithmetic be (a) sufficiently accurate and (b) a lot faster? – Pains 8/1, 2012 at 7:41

I have in fact tested floating-point code and it runs faster, however I want fixed-point as an option, at the very least (to my estimation, extremely large levels will cause the double floating point to loose significant precision). In any case, I would still like to improve the class as I am also learning how this works and I'd like to make the class freely available once it is ready. – Aurochs 8/1, 2012 at 7:45

Why not use a native 128-bit type? – Unbind 8/1, 2012 at 8:7

Impossible, as this is 32bit-only code, and multiply/divide cannot be easily optimized with SSE2 instructions because the integer-based instructions that work with the entire %xmm register are quite lacking. – Aurochs 8/1, 2012 at 8:14

Actually, there seem to be some useful 32bit-packed-multiply instructions, but the users I'm targeting might not have SSE2. The minimum requirement is MMX. – Aurochs 8/1, 2012 at 8:21

SSE2 has been around since 2001. :/ – Unbind 8/1, 2012 at 21:28

I will be utilizing SSE2 but those optimizations are only optional in terms of absolute requirements to run the program (in some cases I can substitute with MMX). – Aurochs 8/1, 2012 at 23:18

Knuth's TeX and METAFONT manage fine with mostly 16 bit fixed point numbers, to precision better than a wavelength of light... – Timoshenko 5/2, 2013 at 4:44

Are you completely sure that (a) this precision is required, (b) it has to be fixed point, (c) no existing library does the job, (d) this is performance critical enough that it warrants loosing sleep over? And (e) it will run on 32-bit machines, ever? – Timoshenko 5/2, 2013 at 4:47

(a) To stop the math from falling apart at the maximum limits of the engine. (b) It's fixed point because it's a technical requirement of my engine (involved elsewhere in emulating behaviors that are inefficient to do with doubles). (c) There are libraries but they are all overkill for this - and in practice would be much slower than my inlined code, due to function call overhead. (d) Sure - it will be used enough to matter. The faster the better. (e) Yes. (f) I'm sticking with the current solution. Only the divide is slower than its corresponding FP operation. I can live with that. – Aurochs 2/5, 2013 at 3:50

@Timoshenko whether or not it's appropriate for Simion32's usecase, it's still an interesting question for others that may wish to do 128-bit arithmetic. For example, they may wish to write a function to do efficient A*B+C of 64-bit numbers without overflow. – Harrietharriett 9/5, 2014 at 7:34

I wouldn't worry much about multiplication. What you're doing seems quite efficient. I didn't really follow the Greek on the Karatsuba Multiplication, but my feeling is that it would be more efficient only with much larger numbers than you're dealing with.

One suggestion I do have is to try to use the smallest blocks of inline assembly, rather than coding your logic in assembly. You could write a function:

struct div_result { u_int x[2]; };
static inline void mul_add(int a, int b, struct div_result *res);

The function would be implemented in inline assembly, and you'll call it from C++ code. It should be as efficient as pure assembly, and much easier to code.

About division, I don't know. Most algorithms I saw talk about asymptotic efficiency, which may mean they're efficient only for very high numbers of bits.

Ennoble answered 8/1, 2012 at 8:1 Comment(4)

On multiplication, I suppose that taking the assembly and inlining it into the C++ operator*= function body would avoid a function call, but I don't think much else could be done unless there's some other obscure algorithm I haven't read about. – Aurochs 8/1, 2012 at 8:10

I know C much more than C++, and there inlining a function removes the function call overhead completely. Overriding *= shouldn't change things much - it may make the code nicer maybe. And I doubt if you'll find an algorithm that's better for 128bit numbers. – Ennoble 8/1, 2012 at 11:59

Depending on the speed of native multiplication Karatsuba Multiplication can be a significant win even for 128-bit multiplication. – Architect 9/1, 2012 at 0:2

I inlined the assembly code for operator*=, and got an over 2x speedup! (But what to do about division, hmm...) – Aurochs 9/1, 2012 at 7:32

Do I understand your data correctly that you are running your test on a 1.8 GHz machine and the "u" in your timings are processor cycles?

If so, 546 cycles for 10 32x32 bit MULs seem a bit slow to me. I have my own brand of bignums here on a 2GHz Core2 Duo and a 128x128=256 bit MUL runs in about 150 cycles (I do all 16 small MULs), i.e. about 6 times faster. But that could be simply a faster CPU.

Make sure you unroll the loops to save that overhead. Do as little register saving as is needed. Maybe it helps if you post the ASM code here, so we can review it.

Karatsuba will not help you, since it starts to be efficient only from some 20-40 32-bit words on.

Division is always much more expensive than multiplication. If you devide by a constant or by the same value many times, it might help to pre-compute the reciprocal and then multiply with it.

Doggone answered 16/1, 2012 at 21:17 Comment(2)

The "u" are units as defined by QueryPerformanceFrequency(), which are (roughly) CPU cycles on my CPU. But see my comment on ugoren's awnser; I was able to speed it up over 2 times by inlining the code. It now takes ~86u with a max delay of ~256u (fluctuates wildly between these). The inline asm volatile block passes params by pointer and even pushes/pops %ebp (%ecx:%edx:%edi:%esi = result accumulator; %ebp = data pointer; %eax:%edx = multiply intermediate results). This is as fast as it will go without extra instruction overhead! For division I may have to tackle high-radix methods... – Aurochs 16/1, 2012 at 21:58

I've edited in the code as you suggested, feel free to scrutinize... ;) – Aurochs 16/1, 2012 at 22:37

EDIT: Multiply Improvements

Recommended topics

Hot tags