Fastest way to convert binary to decimal?
Asked Answered
W

6

8

I've got four unsigned 32-bit integers representing an unsigned 128-bit integer, in little endian order:

typedef struct {
    unsigned int part[4];
} bigint_t;

I'd like to convert this number into its decimal string representation and output it to a file.

Right now, I'm using a bigint_divmod10 function to divide the number by 10, keeping track of the remainder. I call this function repeatedly, outputting the remainder as a digit, until the number is zero. It's pretty slow. Is this the fastest way to do it? If so, is there a clever way to implement this function that I'm not seeing? I've tried looking at GMP's get_str.c, but I find it pretty impenetrable.

EDIT: here's the fastest code I was able to come up with for the divmod10 function:

static unsigned uint128_divmod10(uint128 *value)
{
    unsigned int a = value->word[3];
    unsigned int b = value->word[2];
    unsigned int c = value->word[1];
    unsigned int d = value->word[0];

    unsigned int diva = a / 5;
    unsigned int divb = b / 5;
    unsigned int divc = c / 5;
    unsigned int divd = d / 5;

    value->word[3] = diva;
    value->word[2] = divb;
    value->word[1] = divc;
    value->word[0] = divd;

    unsigned int moda = a - diva*5;
    unsigned int modb = b - divb*5;
    unsigned int modc = c - divc*5;
    unsigned int modd = d - divd*5;

    unsigned int mod = 0;
    mod += moda;
    unsigned int carryb = mod*858993459;
    mod += modb;
    if (mod >= 5) {
        mod -= 5;
        carryb++;
    }
    unsigned int carryc = mod*858993459;
    mod += modc;
    if (mod >= 5) {
        mod -= 5;
        carryc++;
    }
    unsigned int carryd = mod*858993459;
    mod += modd;
    if (mod >= 5) {
        mod -= 5;
        carryd++;
    }

    uint128_add(value, carryd, 0);
    uint128_add(value, carryc, 1);
    uint128_add(value, carryb, 2);

    if (value->word[0] & 1) {
        mod += 5;
    }
    uint128_shift(value, -1);
    return mod;
}

where the add function is defined as:

static void uint128_add(uint128 *value, unsigned int k, unsigned int pos)
{
    unsigned int a = value->word[pos];
    value->word[pos] += k;
    if (value->word[pos] < a) {
        // overflow
        for (int i=pos+1; i<4; i++) {
            value->word[i]++;
            if (value->word[i]) {
                break;
            }
        }
    }
}
Wulfila answered 6/11, 2009 at 7:33 Comment(2)
why not hexadecimal but decimal string representation? to hex is faster.Dental
Are you 100% sure that this is performance critical to your program, and thus warrants you spending time tweaking it, and your successors trying to understand what this mess is all about?Cockfight
W
5

It depends what else you're doing with the numbers. You can trade off a slight loss in space efficiency and a modest loss in efficiency of multiprecision arithmetic in return for very efficient conversion to and from decimal. The key is to do multiprecision arithmetic with a base that is a power of 10 rather than a power of 2.

For example, you might use base 10,000, where you pack one digit into a 16-bit word and you do your arithmetic on digits in 32-bit integers. (If you're on a 64-bit machine you can double that and do base 1,000,000,000.) This kind of code is relatively efficient timewise, although not quite as fast as using the native power of two because you can't take advantage of the carry bit on the hardware. And you can't represent as many integers in the same number of bits. But it's a whiz at converting to and from decimal, because you get to convert the individual digits without any long division.

If you need to represent the full range of numbers from zero to ((1 << 128) - 1), you can still do this, but add an extra digit, so your numbers will be bigger.

If it turns out you really need the extra space/speed (maybe you're doing a lot of cryptographic 128-bit calculations) then the method of simultanous div/mod by 10 is the fastest method I know. The only other trick is that if small integers are common, you can handle them specially. (That is, if the three most significant 32-bit words are all zero, just use the native division to convert.)

Is there a clever way to implement this function that I'm not seeing?

Dave Hanson's C Interfaces and Implementations has a lengthy chapter on multiprecision arithmetic. Dividing a large number by a single digit is a special case that has this efficient implementation:

int XP_quotient(int n, T z, T x, int y) {
    int i;
    unsigned carry = 0;
    for (i = n - 1; i >= 0; i--) {
        carry = carry*BASE + x[i];
        z[i] = carry/y;
        carry %= y;
    }
    return carry;
}

For full understanding, it really helps to have the book, but the source code is still a lot easier to understand than the GNU source code. And you could easily adapt it to use base 10,000 (it currently uses base 256).

Summary: if your performance bottleneck is conversion to decimal, implement multiprecision arithmetic with a base that is a power of 10. If your machine's native word size is 32 and you are using C code, use 10,000 in a 16-bit word.

Wist answered 7/11, 2009 at 22:10 Comment(0)
I
3

If your values are mostly less than ULLONG_MAX (18446744073709551615) I'd try to use for them sprintf(buf,"%llu",ullong_val). I bet this is rather well optimized in standard library, but parsing of format will take some cycles though.

Otherwise I'd create a bigint_divmod1000000000 (or better name mod10to9) function and use that. It would need 9 times less divides than bigint_divmod10.

Inartistic answered 6/11, 2009 at 8:0 Comment(1)
Huge divmod functions like that are actually slower (I tried).Wulfila
I
2

Lookup table of 8 bits. You can have 4 lookup tables of 256 numbers. First is from 0-256 for LSB bytes, Second table is first table multiplied by 256 and so on.

SO when you need your number sum up numbers from lookup table. When you adding you can add as bunary and go later one pass over each byte to fix owerflows.

Example number 0x12345678 In first lookup table there is under addres (0x78 = 120) so 0x010200 is first number in second table under(0x56=87) is 0x0202000106 (0x56 in dec is 22016) in third table you hou would have 0x03040007080702 and under last lable at 0x12 you have 0x030001090809080808 (this does not fit in 32 bit arithmetic, but that you allredy know)

Then sum up this numbers (as binary bumbers) and go one pass, byte by byte for overflow code in for loop is something like

s=carry+val[i];
val[i]=val[i]&10
carry=s/10; 
//you can put last two operations in table

If we count operations needed for this.

1.(looking in tables and adding) 4 lookup tables. 16 additions (keep in mind that when you do not need to carry about owerflow, becuase they can not ocur)
2. one pass in each step 3 operatins 16 steps to pass.

passimistic upper bound 6*16 = 100 operations.

EDIT:

Here is c++ code, and is 30% faster than naive implementation.

#include <iostream>
#include <stdint.h>
#include <array>

static uint64_t lu[4][256];

constexpr uint64_t lookup_value(uint64_t n) {
  uint64_t r = 0;
  uint64_t t = 1;
  while (n) {
    uint64_t rem = n % 10;
    n /= 10;
    r += rem * t;
    t *= 256;
  }
  return r;
}

void make_lu() {
  uint64_t step = 1;
  for (int j = 0; j < 4; ++j) {
    uint64_t n = 0;
    for (int i = 0; i < 256; ++i) {
      lu[j][i] = lookup_value(n);
      n += step;
    }
    step *= 256;
  }
}

struct DivMod {
  uint8_t div;
  uint8_t rem;
};

static DivMod dm[256];

void make_dm() {
  for (int i = 0; i < 256; ++i) {
    dm[i].div = i / 10;
    dm[i].rem = i % 10;
  }
}

void init() {
  make_lu();
  make_dm();
}

uint64_t b2d(uint64_t n) {
  uint64_t r = 0;
  for (int i = 0; i < 4; ++i) {
    r += lu[i][(n >> (i * 8)) & 0xff];
  }
  uint64_t r2 = 0;
  uint64_t of = 0;
  for (int i = 0; i < 8; ++i) {
    uint64_t v = ((r >> (i * 8)) & 0xff) + of;
    DivMod &x = dm[v];
    of = x.div;
    r2 += uint64_t(x.rem) << (i * 8);
  }
  return r2;
}

int main() {
  init();
  uint64_t n;
  std::cin >> n;
  std::cout << std::hex << b2d(n) << "\n";
  return 0;
}
Inoculate answered 6/11, 2009 at 8:36 Comment(6)
The numbers in the question are 128-bit; correct me if I'm wrong, but your answer seems to assume 32-bit numbers.Wulfila
As far as i know i made correct assumptions. In first step you made 4 additions of 128 bit numbers using lookuptable ( 16 addions in total ) actualy little less becuase you know that LSB Byte does not exceed 32bits. So For LSB byte only one addition instead of 4. I found mistake and did changegs in explanation. insted of 0x120 there is 0x010200Inoculate
val[i] & 10 (bitwise AND with 0b1010) makes no sense. It's not a remainder. Shift/mask only works for bases that are a power of 2, like hex or octal. The least-significant decimal digit depends on all the bits, so a lookup-table doesn't work at all.Mulberry
@PeterCordes That is correct. Main point was that one need only additions and lookups in small tables.Inoculate
I'm not sure exactly how your algorithm works. Your answer doesn't explain it well. I don't understand why you're using hex numbers when the OP wants a decimal string. Are you building a binary number whose hex digits have the values of the decimal digits of the original number? I also don't understand what types your loop variables have. e.g. is s a BigInteger? (large enough to hold a number with as many hex digits as the output can have decimal digits?). What type is val[]?Mulberry
@PeterCordes 32bit numbers are used for vectorization purpose. Long binary number is first split in 8 bits chunks (bytes). Then each byte is depending on index and value presented as decimal number. This can be read from table. We add up this numbers together and finnaly fix overflows. Final solution uses only addition and lookup tables, but no branching (if statements). I am not sure how it compares with other optimized solutions.Inoculate
W
0

For future reference, instead of implementing a uint128 type, I just used the characters of the string directly. This turned out to be much faster than going from string to uint128 and back.

Wulfila answered 7/11, 2009 at 7:40 Comment(1)
That makes sense if you only do a couple operations before converting to a string. Using 32-bit chunks of base 10^9 can work well, and be a good tradeoff between space and performance of addition/subtraction vs. division by (powers of) 10. I used that to calculate the first 1000 digits of Fibonacci(10^9) in reasonable time (~80 seconds), keeping only the leading 1009 digits after every step using division by 10. Only 105 bytes of x86 code, using a compare to generate the carry-out and wrap at 10^9. :)Mulberry
C
-1

The most immediate speedup will come from inlining the conversion rather than calling functions; it could be as simple as marking bigint_divmod10() inline, or using profile-guided optimisation as offered by your compiler.

Complain answered 6/11, 2009 at 7:54 Comment(0)
S
-1

I know this question is old, but I want to contribute as none put a way avoiding the division cycle. This one uses pow2, I haven't tested the benchmark but in theory should be faster than any other, and also could be tweaked in the pow function as well.

#include <iostream>
#include <cmath>
using namespace std;

#define MathBintodec(arr,len)({int dec=0;int ci_;for(ci_=len;ci_--;)dec+=arr[ci_]*pow(2,len-ci_-1);dec;})

int main(){
    int r[]={1,0,0,1,0,0};
    cout<<MathBintodec(r,6)<<endl;
}

Output: 36

Sift answered 3/7, 2013 at 3:43 Comment(1)
You're converting a place-value array to a binary integer. (And very inefficiently at that). The OP wants to convert uint128_t to a string efficiently.Mulberry

© 2022 - 2024 — McMap. All rights reserved.