Is there a fast way to convert a string of 8 ASCII decimal digits into a binary number?
Asked Answered
K

4

4

Consider 8 digit characters like 12345678 as a string. It can be converted to a number where every byte contains a digit like this:

const char* const str = "12345678";
const char* const base = "00000000";
const uint64_t unpacked = *reinterpret_cast<const uint64_t*>(str)
    - *reinterpret_cast<const uint64_t*>(base);

Then unpacked will be 0x0807060504030201 on a little-endian system.

What is the fastest way to convert the number into 12345678, perhaps by multiplying it by some magic number or using SIMD up to AVX2?

UPDATE: 12345678 has to be a number stored in a 32-bit or 64-bit integer, not a string.

Krug answered 22/3, 2022 at 11:0 Comment(7)
Have a look at _mm_maddubs_epi16 and _mm_madd_epi16Elfland
First step write test (to check validity), second step write performance test (to check speed) - this step is quite hard. Then do experiments. Writing code and assuming it is correct and fastest often leads of track.Wendelina
"UPDATE: 12345678 has to be a number stored in a 32-bit or 64-bit integer, not a string." UNclear what is your input, if you have your number stored as uin64_t, then you don't have conversion to have uin64_t...Stagg
@Jarod42, the input is a string, it's in the code snippet in the question. I'm simply looking for an optimization of parsing a string into an integer.Krug
So your are looking for std::from_chars?Stagg
See How to implement atoi using SIMD? for some ideas about using pmaddwd (and possibly pmaddubsw) with a vector of [1,10,1,10,...] for pairs of digits. That Q&A is considering variable-length digit strings, though, IIRC, so it does more work.Polypropylene
Your title is really misleading here. You're not packing (like BCD), you're converting an ASCII string of decimal digits to a binary integer. That's not just packing.Polypropylene
C
2

Multiplication in binary is just a series of shift & adds. A SWAR approach shouldn't be too hard to understand. For a detailed walk-thru see:

// http://govnokod.ru/13461
static inline
uint32_t parse_8digits_swar_classic (char* str) {
    uint64_t v;

    memcpy(&v, str, 8);
    v = (v & 0x0F0F0F0F0F0F0F0F) * 2561 >> 8;
    v = (v & 0x00FF00FF00FF00FF) * 6553601 >> 16;
    v = (v & 0x0000FFFF0000FFFF) * 42949672960001 >> 32;
    return v;
}

// attempt to improve the latency
static inline
uint32_t parse_8digits_swar_aqrit (char* str) {
    const uint64_t mask = 0x000000FF000000FF;
    uint64_t v, t;

    memcpy(&v, str, 8);
    v = (v * 10) + (v >> 8);
    t = (v & mask) * 0x000F424000000064;
    v = ((v >> 16) & mask) * 0x0000271000000001;
    v = (v + t + 0xFF0915C600000000ULL) >> 32;
    return v;
}

// SSSE3 needs less `shift & mask` operations...
static inline
uint32_t parse_8digits_simd_ssse3 (char* str) {
    const __m128i mul1 = _mm_set_epi32(0, 0, 0x010A0A64, 0x14C814C8);
    const __m128i mul2 = _mm_set_epi32(0, 0, 0x0001000A, 0x00FA61A8);
    const __m128i mask = _mm_set1_epi8(0x0F);
    __m128i v;

    v = _mm_loadl_epi64((__m128i*)(void*)str);
    v = _mm_and_si128(v, mask);
    v = _mm_madd_epi16(_mm_maddubs_epi16(mul1, v), mul2);
    v = _mm_add_epi32(_mm_add_epi32(v, v), _mm_shuffle_epi32(v, 1));
    return (uint32_t)_mm_cvtsi128_si32(v);
}
Coachwhip answered 23/3, 2022 at 22:21 Comment(1)
I thought this problem looked familiar; googling on that 0x010A0A64 constant found SIMD string to unsigned int parsing in C# performance improvement which you answered.Polypropylene
B
1

On an older x86-64 system without AVX2, this simple version based on gathering digits in tree fashion is quite efficient, with performance on par with a simple SWAR-based implementation per my measurements. This requires a processor with a lot of instruction-level parallelism however, as it comprises 50% more instructions than the SWAR -based code when compiled with full optimizations.

/* convert a string of exactly eight 'char' into a 32-bit unsigned integer */
uint32_t string_to_number (const char * s)
{
    uint32_t t0 = s[0] * 10 + s[1];
    uint32_t t1 = s[2] * 10 + s[3];
    uint32_t t2 = s[4] * 10 + s[5];
    uint32_t t3 = s[6] * 10 + s[7];
    uint32_t s0 = t0 * 100 + t1;
    uint32_t s1 = t2 * 100 + t3;
    uint32_t num = s0 * 10000 + s1;
    uint32_t corr =
        '0' * 10000000 +
        '0' * 1000000 +
        '0' * 100000 +
        '0' * 10000 +
        '0' * 1000 +
        '0' * 100 +
        '0' * 10 +
        '0' * 1;
    return num - corr;
}
Besom answered 24/3, 2022 at 0:48 Comment(8)
Did you compare it against aqrit's SSSE3 version? For a single integer, it does only need SSSE3, and can do 2 in parallel. (It's highly throughput-optimized, about 8 uops. Not counting loading vector constants. Still not bad for latency either, but all the uops dependent on each other, including two SIMD integer multiplies, except for a paddd / pshufd pair.)Polypropylene
@PeterCordes I did not compare it because I did not see it until you pointed it out. I was mostly curious about the SWAR approach, because the simple tree approach needs only simple, high-throughput operations, except for three IMUL.Besom
Fair enough. How did you test (I assume throughput not latency converting stuff in a loop)? And on what hardware? Zen / Ice Lake have wider pipelines than previous x86, and have more execution units that handle LEA which I assume compilers will use a lot of, e.g. 2 for each x * 10 + y operation.Polypropylene
AArch64 is interesting for the SWAR versions; it can encode immediates for bitwise ops like AND as repeating bit-patterns, unlike x86-64 needing clunky 10-byte instructions. godbolt.org/z/c1EPsdW5K. For your version, clang chooses to use a lot of madd instructions, some with a constant 10. But AArch64 probably can use SIMD efficiently for something similar to what we can do with SSSE3, probably even baseline unlike SSSE3.Polypropylene
@PeterCordes (1) Throughput; (2) Ivybridge, I think; (3) Intel compiler. As godbolt shows, Clang, gcc, and icc produce pretty similar code when targeting generic x86-64, unsurprisingly dominated by movsx and 2-operand lea, both with throughput of 2/cycle on all(?) Intel CPUs of the past ten years per Agner Fog's tables. The total number of instructions is 22 or 23. Compilers realize one can get by with one lea per movsx (e.g. godbolt)Besom
LEA has 4/clock throughput on Ice Lake, 5/clock on Alder Lake P-cores (uops.info). Before that, yes, 2/clock for non-slow LEA (reg+reg*scale, but [reg+reg*scale+disp] is only port 1). MOVSX is handled purely in the load ports on Intel including IvB, but some CPUs have more efficient MOVZX than MOVSX, so unsigned char could be a good idea. And yeah, 1 LEA per MOV[ZS]X is 2 LEAs per x * 10 + y where both x and y are ASCII chars that need extension from byte to dword. x *= 5 and x * 2 + y as in NASM Assembly convert input to integer?Polypropylene
So you're right to do the correction at the end, instead of tempting compilers into using lea eax, [rax + 4*rax - 5*'0'] along the way. That would save instructions, but cost code-size and limit it to 1 LEA per clock on Intel before ICL.Polypropylene
@PeterCordes Sorry still before first coffee. Yes, 1 lea per movsx <=> 2 lea per *10, so we are saying the same thing.Besom
M
0

If you change your input format to breadth-first element order like this:

Sample 9 numbers, interleaved
digit[]:
1 1 1 1 1 1 1 1 1 ... 2 2 2 2 2 2 2 2 2 ...
... 3 3 3 3 3 3 3 3 3 ....

for(int j=0; j<num_parse; j+=9)
{
    for(int i=0; i<9; i++)
    {
        value[i] += 
          (multiplier[i]*=10)*
          (digit[i+j]-'0');
    }
    // value vector copied to output
    // clear value & multiplier vectors
}

And if you convert more than just 9 values, like 512 or 8192 with padding to any multiple of 32, compiler should vectorize it.

To prepare input, you can use 8 different channels, 1 per digit of every parsed value.

Marbles answered 22/3, 2022 at 11:23 Comment(0)
K
0

I've implemented a small program to test some ideas. AVX2 implementation is ~1.5 times faster than the naive, with the table implementation in the middle:

AVX2: 12345678 in 3.42759
Naive: 12345678 in 5.12581
Table: 12345678 in 4.49478

Source code:

#include <cstdlib>
#include <cstdint>
#include <immintrin.h>
#include <iostream>
using namespace std;

const __m256i mask = _mm256_set1_epi32(0xf);
const __m256i mul = _mm256_setr_epi32(10000000, 1000000, 100000, 10000, 1000, 100, 10, 1);

const volatile char* str = "12345678";
volatile uint32_t h;

const int64_t nIter = 1000LL * 1000LL * 1000LL;

inline void parse_avx2() {
  const char* const base = "00000000";
  const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
    - *reinterpret_cast<const uint64_t*>(base);

  const __m128i a = _mm_set1_epi64x(unpacked);
  const __m256i b = _mm256_cvtepu8_epi32(a);
  const __m256i d = _mm256_mullo_epi32(b, mul);
  const __m128i e = _mm_add_epi32(_mm256_extractf128_si256(d, 0), _mm256_extractf128_si256(d, 1));
  const uint64_t f0 = _mm_extract_epi64(e, 0);
  const uint64_t f1 = _mm_extract_epi64(e, 1);
  const uint64_t g = f0 + f1;
  h = (g>>32) + (g&0xffffffff);
}

inline void parse_naive() {
  const char* const base = "00000000";
  const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
    - *reinterpret_cast<const uint64_t*>(base);

  const uint8_t* a = reinterpret_cast<const uint8_t*>(&unpacked);
  h = a[7] + a[6]*10 + a[5]*100 + a[4]*1000 + a[3]*10000 + a[2]*100000 + a[1]*1000000 + a[0]*10000000;
}

uint32_t summands[8][10];
inline void parse_table() {
  const char* const base = "00000000";
  const uint64_t unpacked = *reinterpret_cast<const volatile uint64_t*>(str)
    - *reinterpret_cast<const uint64_t*>(base);

  const uint8_t* a = reinterpret_cast<const uint8_t*>(&unpacked);
  h = summands[7][a[0]] + summands[6][a[1]] + summands[5][a[2]] + summands[4][a[3]]
    + summands[3][a[4]] + summands[2][a[5]] + summands[1][a[6]] + summands[0][a[7]];
}

int main() {
  clock_t start = clock();
  for(int64_t i=0; i<nIter; i++) {
    parse_avx2();
  }
  clock_t end = clock();
  cout << "AVX2: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;

  start = clock();
  for(int64_t i=0; i<nIter; i++) {
    parse_naive();
  }
  end = clock();
  cout << "Naive: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;

  uint32_t mul=1;
  for(int i=0; i<8; i++, mul*=10) {
    for(int j=0; j<9; j++) {
      summands[i][j] = j*mul;
    }
  }

  start = clock();
  for(int64_t i=0; i<nIter; i++) {
    parse_table();
  }
  end = clock();
  cout << "Table: " << h << " in " << double(end-start)/CLOCKS_PER_SEC << endl;
  return 0;
}
Krug answered 22/3, 2022 at 13:44 Comment(1)
Unpacking to 32-bit right away is quite slow; vpmulld is 2 uops on Intel CPUs from Haswell onward (ones which support AVX2). It's also unnecessary to broadcast unpacked before using only the low 64 bits of it anyway. And *reinterpret_cast<const volatile uint64_t*> is a pretty inefficient way to maybe work around the strict-aliasing undefined behaviour (if it works at all), instead of using a movq intrinsic like _mm_loadl_epi64 and SIMD _mm_and_si128 or _mm_sub_epi8 to get 0..9 integer digits out of the ASCII characters.Polypropylene

© 2022 - 2024 — McMap. All rights reserved.