Implementing SSE 4.2's CRC32C in software

Asked 15/7, 2013 at 0:27 Answered 4/10, 2016 at 22:35

So I have a design which incorporates CRC32C checksums to ensure data hasn't been damaged. I decided to use CRC32C because I can have both a software version and a hardware-accelerated version if the computer the software runs on supports SSE 4.2

I'm going by Intel's developer manual (vol 2A), which seems to provide the algorithm behind the crc32 instruction. However, I'm having little luck. Intel's developer guide says the following:

BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0]  <- BIT_REFLECT(TEMP6[31-0])

Now, as far as I can tell, I've done everything up to the line starting TEMP6 correctly, but I think I may be either misunderstanding the polynomial division, or implementing it incorrectly. If my understanding is correct, 1 / 1 mod 2 = 1, 0 / 1 mod 2 = 0, and both divides-by-zero are undefined.

What I don't understand is how binary division with 64-bit and 33-bit operands will work. If SRC is 0x00000000, and DEST is 0xFFFFFFFF, TEMP5[63-32] will be all set bits, while TEMP5[31-0] will be all unset bits.

If I was to use the bits from TEMP5 as the numerator, there would be 30 divisions by zero as the polynomial 11EDC6F41 is only 33 bits long (and so converting it to a 64-bit unsigned integer leaves the top 30 bits unset), and so the denominator is unset for 30 bits.

However, if I was to use the polynomial as the numerator, the bottom 32 bits of TEMP5 are unset, resulting in divides by zero there, and the top 30 bits of the result would be zero, as the top 30 bits of the numerator would be zero, as 0 / 1 mod 2 = 0.

Am I misunderstanding how this works? Just plain missing something? Or has Intel left out some crucial step in their documentation?

The reason I went to Intel's developer guide for what appeared to be the algorithm they used is because they used a 33-bit polynomial, and I wanted to make outputs identical, which didn't happen when I used the 32-bit polynomial 1EDC6F41 (show below).

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;

for (n = 0; n < 256; n++) {
    sres = n;
    for (k = 0; k < 8; k++)
        sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
    crcTable[n] = sres;
}
sres = 0xFFFFFFFF;

for (n = 0; n < 4; n++) {
    sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);
}

The above code produces 4138093821 as an output, and the crc32 opcode produces 2346497208 using the input 0x00000000.

Sorry if this is badly written or incomprehensible in places, it is rather late for me.

Airdrie answered 15/7, 2013 at 0:27 Comment(2)

For those using Delphi, I've written some Open Source code using the new crc32 hardware instruction if available, and fast x86 asm or pure pascal code (using pre-computed tables) if SSE 4.2 is not available. Naive rolled version runs at 330 MB/s, optimized unrolled x86 asm performs at 1.7 GB/s, and SSE 4.2 hardware gives an amazing 3.7 GB/s speed (on both Win32 and Win64 platforms). – Pawn 27/5, 2014 at 6:48

If it's legal to you to read LGPL code, see code.woboq.org/qt5/qtbase/src/corelib/tools/qhash.cpp.html#95 – Beisel 2/7, 2015 at 12:7

Here are both software and hardware versions of CRC-32C. The software version is optimized to process eight bytes at a time. The hardware version is optimized to run three crc32q instructions effectively in parallel on a single core, since the throughput of that instruction is one cycle, but the latency is three cycles.

crc32c.c:

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
 * Copyright (C) 2013, 2021, 2023 Mark Adler
 * Version 1.3  7 Dec 2023  Mark Adler
 */

/*
  This software is provided 'as-is', without any express or implied
  warranty. In no event will the author be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  [email protected]
 */

/* Version History:
 1.0  10 Feb 2013  First version
 1.1  31 May 2021  Correct register constraints on assembly instructions
                   Include pre-computed tables to avoid use of pthreads
                   Return zero for the CRC when buf is NULL, as initial value
 1.2   5 Jun 2021  Make tables constant
 1.3   7 Dec 2023  Improve register constraints for optimization
 */

// Use hardware CRC instruction on Intel SSE 4.2 processors. This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc. A software
// version is provided as a fall-back, as well as for speed comparisons.

#include <stddef.h>
#include <stdint.h>

// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"

// Table-driven software version as a fall-back. This is about 15 times slower
// than using the hardware instructions. This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) {
    if (buf == NULL)
        return 0;
    unsigned char const *data = buf;
    while (len && ((uintptr_t)data & 7) != 0) {
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
        len--;
    }
    size_t n = len >> 3;
    for (size_t i = 0; i < n; i++) {
        uint64_t word = crc ^ ((uint64_t const *)data)[i];
        crc = crc32c_table[7][word & 0xff] ^
              crc32c_table[6][(word >> 8) & 0xff] ^
              crc32c_table[5][(word >> 16) & 0xff] ^
              crc32c_table[4][(word >> 24) & 0xff] ^
              crc32c_table[3][(word >> 32) & 0xff] ^
              crc32c_table[2][(word >> 40) & 0xff] ^
              crc32c_table[1][(word >> 48) & 0xff] ^
              crc32c_table[0][word >> 56];
    }
    data += n << 3;
    len &= 7;
    while (len) {
        len--;
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
    }
    return crc;
}

// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) {
    return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
           zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];
}

// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) {
    if (buf == NULL)
        return 0;

    // Pre-process the crc.
    uint64_t crc0 = crc ^ 0xffffffff;

    // Compute the crc for up to seven leading bytes, bringing the data pointer
    // to an eight-byte boundary.
    unsigned char const *next = buf;
    while (len && ((uintptr_t)next & 7) != 0) {
        __asm__("crc32b\t%1, %0"
                : "+r"(crc0)
                : "m"(*next));
        next++;
        len--;
    }

    // Compute the crc on sets of LONG*3 bytes, making use of the three-cycle
    // latency and one-cycle throughput of a single core.
    while (len >= LONG*3) {
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + LONG;
        do {
            __asm__("crc32q\t%3, %0\n\t"
                    "crc32q\t%4, %1\n\t"
                    "crc32q\t%5, %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "m"(*(uint64_t const *)next),
                      "m"(*(uint64_t const *)(next + LONG)),
                      "m"(*(uint64_t const *)(next + 2*LONG)));
            next += 8;
        } while (next < end);
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
        next += LONG*2;
        len -= LONG*3;
    }

    // Do the same thing, but now on SHORT*3 blocks for the remaining data less
    // than a LONG*3 block.
    while (len >= SHORT*3) {
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + SHORT;
        do {
            __asm__("crc32q\t%3, %0\n\t"
                    "crc32q\t%4, %1\n\t"
                    "crc32q\t%5, %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "m"(*(uint64_t const *)next),
                      "m"(*(uint64_t const *)(next + SHORT)),
                      "m"(*(uint64_t const *)(next + 2*SHORT)));
            next += 8;
        } while (next < end);
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
        next += SHORT*2;
        len -= SHORT*3;
    }

    // Compute the crc on the remaining eight-byte units less than a SHORT*3
    // block.
    unsigned char const *end = next + (len - (len & 7));
    while (next < end) {
        __asm__("crc32q\t%1, %0"
                : "+r"(crc0)
                : "m"(*(uint64_t const *)next));
        next += 8;
    }
    len &= 7;

    // Compute the crc for up to seven trailing bytes.
    while (len) {
        __asm__("crc32b\t%1, %0"
                : "+r"(crc0)
                : "m"(*next));
        next++;
        len--;
    }

    // Return the crc, post-processed.
    return ~(uint32_t)crc0;
}

// Check for SSE 4.2. SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008. This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors. cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
    do { \
        uint32_t eax, ecx; \
        eax = 1; \
        __asm__("cpuid" \
                : "=c"(ecx) \
                : "a"(eax) \
                : "%ebx", "%edx"); \
        (have) = (ecx >> 20) & 1; \
    } while (0)

// Compute a CRC-32C. If the crc32 instruction is available, use the hardware
// version. Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) {
    int sse42;
    SSE42(sse42);
    return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);
}

#ifdef TEST

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#define SIZE (262144*3)
#define CHUNK SIZE

int main(int argc, char **argv) {
    (void)argv;
    uint32_t crc = 0;
    char *buf = malloc(SIZE);
    if (buf == NULL) {
        fputs("out of memory", stderr);
        return 1;
    }
    ssize_t got;
    while ((got = read(0, buf, SIZE)) > 0) {
        size_t off = 0;
        do {
            size_t n = (size_t)got - off;
            if (n > CHUNK)
                n = CHUNK;
            crc = argc > 1 ? crc32c_sw(crc, buf + off, n) :
                             crc32c(crc, buf + off, n);
            off += n;
        } while (off < (size_t)got);
    }
    free(buf);
    if (got == -1) {
        fputs("read error\n", stderr);
        return 1;
    }
    printf("%08x\n", crc);
    return 0;
}

#endif /* TEST */

Code to generate crc32c.h (stackoverflow won't let me post the tables themselves, due to a 30,000 character limit in an answer):

// Generate crc32c.h for crc32c.c.

#include <stdio.h>
#include <stdint.h>

#define LONG 8192
#define SHORT 256

// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) {
    printf("static uint32_t const %s[][%zu] = {\n", name, cols);
    size_t end = rows * cols;
    size_t k = 0;
    for (;;) {
        fputs("   {", stdout);
        size_t n = 0, j = 0;
        for (;;) {
            printf("0x%08x", tab[k + n]);
            if (++n == cols)
                break;
            putchar(',');
            if (++j == 6) {
                fputs("\n   ", stdout);
                j = 0;
            }
            putchar(' ');
        }
        k += cols;
        if (k == end)
            break;
        puts("},");
    }
    puts("}\n};");
}

/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78

static void crc32c_word_table(void) {
    uint32_t table[8][256];

    // Generate byte-wise table.
    for (unsigned n = 0; n < 256; n++) {
        uint32_t crc = ~n;
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
        table[0][n] = ~crc;
    }

    // Use byte-wise table to generate word-wise table.
    for (unsigned n = 0; n < 256; n++) {
        uint32_t crc = ~table[0][n];
        for (unsigned k = 1; k < 8; k++) {
            crc = table[0][crc & 0xff] ^ (crc >> 8);
            table[k][n] = ~crc;
        }
    }

    // Print table.
    print_table(table[0], 8, 256, "crc32c_table");
}

// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) {
    uint32_t prod = 0;
    for (;;) {
        if (a & 0x80000000) {
            prod ^= b;
            if ((a & 0x7fffffff) == 0)
                break;
        }
        a <<= 1;
        b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
    }
    return prod;
}

/* Take a length and build four lookup tables for applying the zeros operator
   for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) {
    // Generate operator for len zeros.
    uint32_t op = 0x80000000;               // 1 (x^0)
    uint32_t sq = op >> 4;                  // x^4
    while (len) {
        sq = multmodp(sq, sq);              // x^2^(k+3), k == len bit position
        if (len & 1)
            op = multmodp(sq, op);
        len >>= 1;
    }

    // Generate table to update each byte of a CRC using op.
    uint32_t table[4][256];
    for (unsigned n = 0; n < 256; n++) {
        table[0][n] = multmodp(op, n);
        table[1][n] = multmodp(op, n << 8);
        table[2][n] = multmodp(op, n << 16);
        table[3][n] = multmodp(op, n << 24);
    }

    // Print the table to stdout.
    print_table(table[0], 4, 256, name);
}

int main(void) {
    puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
    );
    crc32c_word_table();
    puts(
"\n// Block sizes for three-way parallel crc computation.  LONG and SHORT\n"
"// must both be powers of two."
        );
    printf("#define LONG %d\n", LONG);
    printf("#define SHORT %d\n", SHORT);
    puts(
"\n// Table to shift a CRC-32C by LONG bytes."
    );
    crc32c_zero_table(8192, "crc32c_long");
    puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
    );
    crc32c_zero_table(256, "crc32c_short");
    return 0;
}

Carinthia answered 15/7, 2013 at 4:28 Comment(36)

I can't seem to find mention of a crc32q instruction anywhere (just references to aviation), nor a crc32b instruction, which you've used in your inline assembly, and MSVC doesn't accept it in its inline assembly. – Airdrie 16/7, 2013 at 14:9

That was written for the GNU compiler (gcc), which uses the AT&T syntax for assembler instructions, as opposed to the Intel syntax. The AT&T syntax is much more clear about what instruction is generated, since it doesn't depend on argument typing for that (e.g. dword ptr, etc.). Your assembler probably uses the Intel syntax, where the crc32 "instruction" can actually generate one of six different instructions. Which one has to be determined by the assembler, as well as by a human attempting to read the code, based on the nature of the arguments. – Carinthia 16/7, 2013 at 14:46

crc32q operates on a "quadword", 64 bits, at a time. crc32b operates on a byte at a time. – Carinthia 16/7, 2013 at 14:48

Also note when translating from AT&T to Intel, that AT&T uses source followed by destination, whereas Intel uses destination followed by source. – Carinthia 16/7, 2013 at 14:50

Thank you. I'd known about the Source-Destination thing, but not that GCC/AT&T used different opcode names to specify different operand sizes. – Airdrie 16/7, 2013 at 15:21

The reason to process 3 buffers in parallel is that the CRC32C instruction is pipelined and has a 3 cycle latency with 1 cycle throughput - you can get one CRC32C instruction retiring every clock cycle provided the result isn't used as input to another CRC32C instruction for 3 cycles... there is only one ALU capable of executing CRC32C - instructions dispatch to it through port 1, this ALU does "complex/slow" integer instructions. The other ALUs cannot handle CRC32C. intel.com/content/dam/www/public/us/en/documents/manuals/… – Cubical 1/8, 2013 at 20:10

Thanks! I misconstrued why doing four CRC instructions in parallel doesn't help. I will fix the comments. – Carinthia 2/8, 2013 at 0:11

You are welcome - and thanks for posting your source code. I predict Silvermont (new Atom micro-architecture), with only 2 integer ALUs will still benefit from your implementation... only 1 ALU computes CRC32 but in your implementation that ALU has 3 different CRC32 instructions in the pipeline, each at a different stage of execution. – Cubical 2/8, 2013 at 1:49

I have wrapped the code in a library for Windows and added .NET wrapper and NuGet packages. I've also speeded up the software fallback by 50%. – Crop 19/2, 2014 at 22:7

Mark, Valgrind is confused with SSE42() macro in crc32c_init(): "Conditional jump or move depends on uninitialised value(s)". – Lipid 11/1, 2018 at 2:30

@Lipid Of course it is. I wouldn't expect Valgrind to know about interpret assembler instructions. – Carinthia 11/1, 2018 at 6:26

Nice answer, but note that C++ constexpr initialization of lookup tables may be faster than this C version since you may pay a little cost on every call because of that pthread_once_t – Heroine 17/6, 2018 at 21:26

Using more accumulators than the bare minimum to hide latency can help a bit, especially when the inputs are coming from memory (so cache misses are possible). See the experimental results with FMA on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables?. I didn't try benchmarks with CRC32, though. And BTW, Agner Fog's testing shows Silvermont / Knight's Landing has 6-cycle latency for crc32 r64 (but still 1c throughput), so more would help there. But small buffers could spend more time in prolog/epilogue :/ – Beatitude 18/6, 2018 at 19:7

Your asm constraints aren't strictly safe. A pointer in a "r" constraint does not imply that the pointed-to memory is also an input. You could just use memory operands (or the _mm_crc32_u64 intrinsic), but if you want to force the addressing mode you can use a dummy memory operand like asm("crc32 (%1),%0 " : "+r"(crc0) : "r"(next), "m" (*(const char (*)[24]) pStr) );. Inlining or LTO could break this. See at&t asm inline c++ problem (which needs an update: pointer-to-array is cleaner than a struct with a flexible array member). – Beatitude 18/6, 2018 at 19:30

@PeterCordes What exactly is the anticipated danger? – Carinthia 18/6, 2018 at 19:56

foo = bar; tmp1 = crc(&foo, 4); foo = baz; tmp2 = crc(&foo, 4); could see the foo=bar assignment as a dead store, not realizing that crc(&foo, 4); cares about the value of foo instead of just the address. A non-inline function can't have a problem, but if it inlines (e.g. with link-time optimization) then the compiler could reorder things so the asm statements are ahead of the stores that leave foo with its final value of baz. It doesn't see a dependency. (And foo=bar never happens). For this case, a "rm" input would let the compiler optimize the operand into a register. – Beatitude 18/6, 2018 at 20:3

BTW as possible speed optimization in C++17 you could populate your tables at compile time using constexpr and std::array and then skip use of pthread_once_t. Unfortunately without benchmarking it is hard to know if it would make any noticable perf diff. – Heroine 19/6, 2018 at 16:40

@NoSenseEtAl: You can do that in normal C++ with a class at global scope, with a constructor. (A static const crctable_t LUT; inside the function would still check a static flag to see if the object was already constructed or not, if the compiler doesn't compute the table at compile time and store it in the file. At least the check would be inlined to ~2 instructions on the fast path on x86, though; much cheaper than a presumably non-inline call to a pthread function.) But anyway, I think Mark is aiming for a C implementation, given the comments at the top of the code block. – Beatitude 19/6, 2018 at 17:12

You can always get the constexpr-like behavior in C with a big brace-enclosed initializer list. Storing big tables in the .data (or .rodata) section versus creating them dynamically at startup (e.g., constructor as @PeterCordes suggested), versus creating them lazily is an interesting question in its own right. Putting them in .data moves all the calculation overhead outside of runtime, but bloats the binary, prevents hugepage use and can be the slowest if cold (disk reads). At construction means you pay the price even if you don't use the table (bad if you have many ... – Ridgeling 19/6, 2018 at 18:19

... such tables), and lazy adds some instructions every time you access the table. However, the lazy method works great when you are using the table inside a method that does many table lookups, since you only pay the "is it initalized?" check once. Also, in practice, this lazy check can be very cheap on x86: a single fused cmp; jz pair. – Ridgeling 19/6, 2018 at 18:21

my point was that constexpr makes it easier since you can do computations at compile time. Aka no need to manually unroll the logic into huge initializer list, you can do something like this: wandbox.org/permlink/CouhNWmaNHEkUUmG – Heroine 19/6, 2018 at 18:24

also I agree with everything you said, I just think that constexpr may still be faster, but like I said without realistic benchmark IDK. – Heroine 19/6, 2018 at 18:25

@Heroine - I'm not sure what you mean by "manually unroll", but yes, constexpr is much nicer. For C you'll probably have as part of your build process a small program that runs the table building logic and prints it to a file which is subsequently compiled into the main binary. constexpr lets you skip that whole step. You are saying that constexpr is faster than what? Than lazy initialized tables? Yes, by a bit. Compared to using C++ constructors to calculate it at runtime? Probably not at the point of use, but definitely yes at startup! – Ridgeling 19/6, 2018 at 18:43

by unrolling I meant that you instead of a for loop manually (or by codegen) unroll that loop into initializer list.... not in a sense like compilers unroll the loops, but in the manual source transormation way – Heroine 19/6, 2018 at 21:27

According to Intel, the three-way algorithm is faster only with sequences longer than ~250 bytes. In case you want compute the crc for a short string, just use the sequential version of it. See Figures 1 & 2 here: intel.com/content/dam/www/public/us/en/documents/white-papers/… – Amberlyamberoid 11/2, 2019 at 6:0

@Nybble If you look at the code, you will see that it does just that for less than 768 bytes of data. – Carinthia 11/2, 2019 at 6:10

@rcgldr Because that is the definition of that CRC. If you're asking why it's common for CRCs to be both initialized with and then exclusive-or'ed with –1, I suppose because it is aesthetically pleasing for the CRC of zero bytes (the empty set) to be zero. – Carinthia 7/3, 2019 at 0:18

@MarkAdler - Update on benchmarks. I modified your code to use Visual Studio intrincsics and confirmed the generated code was a tight loop with 3 crc32q's in a row. Throughput is ~18.150 GB/sec. I only get a bit more with 8 mov instructions in a tight loop reading from two chunks (two cache lines), ~18.175 GB/sec. If the 8 movs are sequential, one chunk (one cache line), it's slower at ~14.9 GB/sec, so crc32q's from 3 chunks (3 cache lines) is helping. – Henrique 7/3, 2019 at 4:29

@MarkAdler - (pre + post compement), what I was getting at is that your function could work like crc32q, don't modify the input crc, don't post complement the crc, then do this from either a higher level helper function or from the main code. – Henrique 7/3, 2019 at 6:9

Ah. Well, then it's a good thing it's open source. Trivial change if desired. – Carinthia 10/3, 2019 at 2:28

Note that this code was adopted for CRC computation in Julia, and updated to have precomputed tables, runtime cpuid detection, ARMv8 support, and some other portability improvements: github.com/JuliaLang/julia/blob/master/src/crc32c.c – Discriminator 16/12, 2020 at 21:56

@Discriminator I have updated the code in this answer. The register constraints in the assembly code, which was inherited in your code, had a problem which is now fixed. As well as some other improvements. – Carinthia 31/5, 2021 at 16:4

@MarkAdler, thanks, the Julia version has been updated to fix the register constraints. – Discriminator 1/12, 2023 at 3:2

@Discriminator I have updated the assembler code constraints yet again, to improve the ability of the compiler to optimize the addressing. – Carinthia 7/12, 2023 at 21:45

@MarkAdler, we actually already did something similar in our version. – Discriminator 8/12, 2023 at 22:6

@Discriminator Not quite. See my note there. You need to get rid of the r in order to set the compiler loose. – Carinthia 9/12, 2023 at 2:12

Mark Adler's answer is correct and complete, but those seeking quick and easy way to integrate CRC-32C in their application might find it a little difficult to adapt the code, especially if they are using Windows and .NET.

I've created a library that implements CRC-32C using either hardware or software method depending on available hardware. It's available as a NuGet package for C++ and .NET. It's opensource of course.

Besides packaging Mark Adler's code above, I've found a simple way to improve throughput of the software fallback by 50%. On my computer, the library now achieves 2 GB/s in software and over 20 GB/s in hardware. For those curious, here's the optimized software implementation:

static uint32_t append_table(uint32_t crci, buffer input, size_t length)
{
    buffer next = input;
#ifdef _M_X64
    uint64_t crc;
#else
    uint32_t crc;
#endif

    crc = crci ^ 0xffffffff;
#ifdef _M_X64
    while (length && ((uintptr_t)next & 7) != 0)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    while (length >= 16)
    {
        crc ^= *(uint64_t *)next;
        uint64_t high = *(uint64_t *)(next + 8);
        crc = table[15][crc & 0xff]
            ^ table[14][(crc >> 8) & 0xff]
            ^ table[13][(crc >> 16) & 0xff]
            ^ table[12][(crc >> 24) & 0xff]
            ^ table[11][(crc >> 32) & 0xff]
            ^ table[10][(crc >> 40) & 0xff]
            ^ table[9][(crc >> 48) & 0xff]
            ^ table[8][crc >> 56]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][(high >> 24) & 0xff]
            ^ table[3][(high >> 32) & 0xff]
            ^ table[2][(high >> 40) & 0xff]
            ^ table[1][(high >> 48) & 0xff]
            ^ table[0][high >> 56];
        next += 16;
        length -= 16;
    }
#else
    while (length && ((uintptr_t)next & 3) != 0)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    while (length >= 12)
    {
        crc ^= *(uint32_t *)next;
        uint32_t high = *(uint32_t *)(next + 4);
        uint32_t high2 = *(uint32_t *)(next + 8);
        crc = table[11][crc & 0xff]
            ^ table[10][(crc >> 8) & 0xff]
            ^ table[9][(crc >> 16) & 0xff]
            ^ table[8][crc >> 24]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][high >> 24]
            ^ table[3][high2 & 0xff]
            ^ table[2][(high2 >> 8) & 0xff]
            ^ table[1][(high2 >> 16) & 0xff]
            ^ table[0][high2 >> 24];
        next += 12;
        length -= 12;
    }
#endif
    while (length)
    {
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    }
    return (uint32_t)crc ^ 0xffffffff;
}

As you can see, it merely crunches larger block at a time. It needs larger lookup table, but it's still cache-friendly. The table is generated the same way, only with more rows.

One extra thing I explored is the use of PCLMULQDQ instruction to get hardware acceleration on AMD processors. I've managed to port Intel's CRC patch for zlib (also available on GitHub) to CRC-32C polynomial ~~except the magic constant 0x9db42487. If anyone is able to decipher that one, please let me know~~. After supersaw7's excellent explanation on reddit, I have ported also the elusive 0x9db42487 constant and I just need to find some time to polish and test it.

Streeto answered 20/2, 2014 at 17:28 Comment(7)

+1 Thanks for sharing your code. It helps me a lot when porting it to Delphi. – Pawn 27/5, 2014 at 6:49

I fixed the link to the patch and added some additional links. Did you progress on this issue, Robert? – Fungiform 22/3, 2015 at 13:36

it seems that cloudflare's zlib with PCLMULQDQ support does not use the constant... maybe that is useful to you? – Fungiform 22/3, 2015 at 16:14

PCLMULQDQ is no longer a mystery. See updated answer. – Crop 26/3, 2015 at 11:42

@RobertVažan - maybe too late, but I have working versions using pclmulqdq converted to work with Visual Studio assembler (ML64.EXE), for both left and right shifting CRCs and two polynomials each. On my system, Intel 3770K 3.5 ghz, speed is about 3.3 GB/sec. – Henrique 3/3, 2019 at 1:42

@Henrique If you can publish it as a library, I will be happy to link to it. I am currently doing mostly non-Windows development and my Windows projects don't get that much attention. You can also adopt my crc32c project and add your code there. Email me if you are interested. – Crop 3/3, 2019 at 20:4

@RobertVažan - the pclmulqdq stuff is all assembly code (over 500 lines). You also need separate programs to generate the constants for specific CRCs, and the code for left shifting CRC is a bit different than right shifting CRC. There is example code on github, but the constants are not all commented. – Henrique 6/3, 2019 at 22:37

First of all the Intel's CRC32 instruction serves to calculate CRC-32C (that is uses a different polynomial that regular CRC32. Look at the Wikipedia CRC32 entry)

To use Intel's hardware acceleration for CRC32C using gcc you can:

Inline assembly language in C code via the asm statement
Use intrinsics _mm_crc32_u8, _mm_crc32_u16, _mm_crc32_u32 or _mm_crc32_u64. See Intel Intrinsics Guide for a description of those for the Intel's compiler icc but gcc also implements them.

This is how you would do it with __mm_crc32_u8 that takes one byte at a time, using __mm_crc32_u64 would give further performance improvement since it takes 8 bytes at a time.

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
  uint32_t hash = 0;
  size_t i = 0;
  for (i=0;i<len;i++) {
    hash = _mm_crc32_u8(hash, bytes[i]);
  }

  return hash;
}

To compile this you need to pass -msse4.2 in CFLAGS. Like gcc -g -msse4.2 test.c otherwise it will complain about undefined reference to _mm_crc32_u8.

If you want to revert to a plain C implementation if the instruction is not available in the platform where the executable is running you can use GCC's ifunc attribute. Like

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)
{
  /* use _mm_crc32_u* here */
}

uint32_t default_crc32(const uint8_t *bytes, size_t len)
{
  /* pure C implementation */
}

/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) {
  __builtin_cpu_init();
  if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;

  return default_crc32;
}

/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));

Strident answered 2/7, 2015 at 12:5 Comment(2)

is there a method for to get the checksum if I am processing lets a a 1MB block with the method mentioned above – Bookshelf 26/4, 2018 at 6:44

You can create a version of this function where the initial hash value is passed as a parameter. That would allow you to process block by block – Strident 26/4, 2018 at 19:9

I compare various algorithms here: https://github.com/htot/crc32c

The fastest algorithm has been taken from Intels crc_iscsi_v_pcl.asm assembly code (which is available in a modified form in the linux kernel) and using a C wrapper (crcintelasm.cc) included into this project.

To be able to run this code on 32 bit platforms first it has been ported to C (crc32intelc) where possible, a small amount of inline assembly is required. Certain parts of the code depend on the bitness, crc32q is not available on 32 bits and neither is movq, these are put in macro's (crc32intel.h) with alternative code for 32 bit platforms.

Navigation answered 4/10, 2016 at 22:35 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags