x86-64 Big Integer Representation?
Asked Answered
F

2

4

How do hig-performance native big-integer libraries on x86-64 represent a big integer in memory? (or does it vary? Is there a most common way?)

Naively I was thinking about storing them as 0-terminated strings of numbers in base 264.

For example suppose X is in memory as:

[8 bytes] Dn
.
.
[8 bytes] D2
[8 bytes] D1
[8 bytes] D0
[8 bytes] 0

Let B = 264

Then

X = Dn * Bn + ... + D2 * B2 + D1 * B1 + D0

The empty string (i.e. 8 bytes of zero) means zero.

Is this a reasonable way? What are the pros and cons of this way? Is there a better way?

How would you handle signedness? Does 2's complement work with this variable length value?

(Found this: http://gmplib.org/manual/Integer-Internals.html Whats a limb?)

Flessel answered 18/7, 2012 at 18:34 Comment(1)
A "limb" is the word size. (or rather base) It's usually 32 or 64-bits. So the number is represented in base 2^32 or 2^64.Solve
E
2

I would think it would be as an array lowest value to highest. I implemented addition of arbitrary sized numbers in assembler. The CPU provides the carry flag that allows you to easily perform these sorts of operations. You write a loop that performs the operation in byte size chunks. The carry flag is included in the next operation using the "Add with carry" instruction (ADC opcode).

Entellus answered 18/7, 2012 at 18:42 Comment(3)
is your asm bignum solution open source?Alcalde
No, sorry. I wrote it 25 years ago in college. Adding bignums is trivial to implement and there are already a couple of good open source libraries.Entellus
@JanusTroelsen: GMP is the go-to open-source implementation, LGPL license. gmplib.org/repo/gmp/file/tip/mpn/x86_64/aors_n.asm is the hand-written AT&T syntax asm implementation of adding (or subtracting: aors_...) two bigints as arrays of 64-bit chunks aka limbs. (Speed per limb is independent of limb size, other than memory bandwidth, so you always want to use the full integer register width. Extended precision is something 64-bit ISAs really shine at.) GMP sources use some preprocessor macro stuff, but are decently readable.Larena
S
1

Here I have some examples of processing Big Integers.

Addition

Principle is pretty simple. You need to use CF (carry-flag) for any bigger overflow, with adc (add with carry) propagating that carry between chunks. Let's think about two 128-bit number addition.

num1_lo: dq 1<<63
num1_hi: dq 1<<63
num2_lo: dq 1<<63
num2_hi: dq 1<<62
;Result of addition should be 0xC0000000 0x000000001 0x00000000 0x00000000
mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]

add eax, dword [num2_lo]
adc ebx, dword [num2_lo+4]
adc ecx, dword [num2_hi]
adc edx, dword [num2_hi+4]
; 128-bit integer sum in EDX:ECX:EBX:EAX
jc .overflow                   ; detect wrapping if you want

You don't need all of it in registers at once; you could store a 32-bit chunk before loading the next, because mov doesn't affect FLAGS. (Looping is trickier, although dec/jnz is usable on modern CPUs which don't have partial-flag stalls for ADC reading CF after dec writes other FLAGS. See Problems with ADC/SBB and INC/DEC in tight loops on some CPUs)

Subtraction

Very similar to addition, although you CF is now called borrow.

mov eax, dword [num1_lo]
mov ebx, dword [num1_lo+4]
mov ecx, dword [num1_hi]
mov edx, dword [num1_hi+4]

sub eax, dword [num2_lo]
sbb ebx, dword [num2_lo+4]
sbb ecx, dword [num2_hi]
sbb edx, dword [num2_hi+4]
jb .overflow    ;or jc

Multiplication

Is much more difficult. You need to multiply each part of first number with each part of second number and add the results. You don't have to multiply only two highest parts that will surely overflow. Pseudocode:

long long int /*128-bit*/ result = 0;
long long int n1 = ;
long long int n2 = ;
#define PART_WIDTH 32 //to be able to manipulate with numbers in 32-bit registers

int i_1 = 0; /*iteration index*/
for(each n-bit wide part of first number : n1_part) {
    int i_2 = 0;
    for(each n-bit wide part of second number : n2_part) {
        result += (n1_part << (i_1*PART_WIDTH))*(n2_part << (i_2*PART_WIDTH));
        i_2++;
    }
    i++;
}

Division

is even more complicated. User Brendan on OsDev.org forum posted example pseudocode for division of n-bit integers. I'm pasting it here because principle is the same.

result = 0;
count = 0;
remainder = numerator;

while(highest_bit_of_divisor_not_set) {
    divisor = divisor << 1;
    count++;
}
while(remainder != 0) {
    if(remainder >= divisor) {
        remainder = remainder - divisor;
        result = result | (1 << count);
    }
    if(count == 0) {
        break;
    }
    divisor = divisor >> 1;
    count--;
}

Dividing a wide number by a 1-chunk (32 or 64-bit number) can use a sequence of div instructions, using the remainder of the high element as the high half of the dividend for the next lower chunk. See Why should EDX be 0 before using the DIV instruction? for an example of when div is useful with non-zero EDX.

But this doesn't generalize to N-chunk / N-chunk division, hence the above manual shift / subtract algorithm.

Shockheaded answered 13/8, 2013 at 7:11 Comment(1)
the OP is asking about x86_64, not x86. And in x86_64 you only need 2 registers to represent a 128-bit number. Adding rdx:rax to r11:r10 only needs 2 instructions: add r10, rax; adc r11, rdxCohlier

© 2022 - 2024 — McMap. All rights reserved.