The hunt for the fastest Hamming Distance C implementation [duplicate]

Asked 13/4, 2013 at 12:7 Answered 13/4, 2013 at 15:27

c performance assembly optimization hamming-distance

I want to find how many different characters two strings of equal length have. I have found that xoring algorithms are considered to be the fastest, but they return distance expressed in bits. I want the results expressed in characters. Suppose that "pet" and "pit" have distance 1 expressed in characters but 'e' and 'i' might have two different bits, so xoring returns 2.

The function i wrote is:

// na = length of both strings
unsigned int HammingDistance(const char* a, unsigned int na, const char* b) {

    unsigned int num_mismatches = 0;
    while (na) {
        if (*a != *b)
            ++num_mismatches;

        --na;
        ++a;
        ++b;
    }

    return num_mismatches;
}

Could it become any faster? Maybe using some lower level commands or implementing a different algorithm?

System: Gcc 4.7.2 on Intel Xeon X5650

Thank you

Marplot answered 13/4, 2013 at 12:7 Comment(3)

Do you have any problems with performance so that you try to optimize it? – Aixlachapelle 13/4, 2013 at 12:12

xoring at the individual caracters could work, since you want to detect if there is a difference between 2 chars, you xor them and check that the result is different of 0. However it might not optimal, if a shorter branchless instruction sequence can be devised - look up branchless condtionals or branchless code. – Doubling 13/4, 2013 at 14:44

@Aixlachapelle 2 It's a vital part of a very intensive process. It runs 5 billion times in a program of mine. – Marplot 13/4, 2013 at 14:50

If the strings are padded with zero to always be 32 bytes and their addresses are 16-aligned, you could do something like this: (code neither tested nor profiled)

movdqa xmm0, [a]
movdqa xmm1, [a + 16]
pcmpeqb xmm0, [b]
pcmpeqb xmm1, [b + 16]
pxor xmm2, xmm2
psadbw xmm0, xmm2
psadbw xmm1, xmm2
pextrw ax, xmm0, 0
pextrw dx, xmm1, 0
add ax, dx
movsx eax, ax
neg eax

But if the strings are usually tiny, it does a lot of unnecessary work and it may not be any faster. It should be faster if the strings are usually (nearly) 32 bytes though.

edit: I wrote this answer before I saw your updated comment - if the strings are usually that tiny, this is probably not very good. A 16-byte version could (maybe) be useful though (run the second iteration conditionally, the branch for that should be well-predicted because it'll be rarely taken). But with such short strings, the normal code is hard to beat.

movdqa xmm0, [a]
pxor xmm1, xmm1
pcmpeqb xmm0, [b]
psadbw xmm0, xmm1
pextrw ax, xmm0, 0
movsx eax, ax
neg eax

Escapism answered 13/4, 2013 at 15:27 Comment(3)

The whole padding phase must be too expensive for my case because it involves zeroing an array of 32 bytes and then copy the char* till na. I know it's just 2 x memset + 2 x memcpy but i don't think traversing and compairing 4-31 bytes will be costier. – Marplot 22/4, 2013 at 23:0

@PetrosDrakoulis that's a shame. I had expected the hamming weight to be taken significantly more often than once per array, making it (probably) worth it. – Escapism 23/4, 2013 at 7:24

hm... hamming will be calculated many times (thousands) for the same String but i don't have access to the caller. Your solution would be very good if i could make caller store strings padded at 32 bytes and provide them that way. In current implementation caller provides two char* and a length up to which char* is guaranteed to be valid. After that it's probably trash, and even if not, i can not risk it. Good though though... – Marplot 23/4, 2013 at 13:5

You can make your comparison compare more bytes at a time by doing a bitwise operator on the native integer size.

In your code, you're comparing equality of a byte at a time, but your CPU can compare at least a word in a single cycle, and 8 bytes if it's x86-64. The exact performance capabilities depend on the CPU architecture, of course.

But if you would advance through the two pointers with a stride the size of 8, it could sure be faster in some scenarios. When it has to read from the strings from main memory, the memory load time will actually dominate the performance. But if the strings are in the CPU cache, You might be able to do an XOR, and interpret the results by testing where in the 64bit value the bits are changed.

Counting the buckets that aren't 0 can be done with a variant of the SWAR algorithm starting from 0x33333333 instead of 0x55555555.

The algorithm will be harder to work with, because it will require using uint64_t pointers that have proper memory alignment. You'll need a preamble and postscript that covers the leftover bytes. Maybe you should read the assembly the compiler outputs and see if it's not doing something more clever before you try something more complicated in code.

Finis answered 13/4, 2013 at 12:30 Comment(3)

I thought about it. While it has better best case (stings are equal), the average will not be significantly lower. Especially considering that this implementation will be significantly more complex (esp. alignment). – Aixlachapelle 13/4, 2013 at 12:44

@Aixlachapelle Yeah, I can't say I would recommend trying it. – Finis 13/4, 2013 at 14:25

I did not try this due to complexity and the relatively small size of strings involved. Thank you anyway though – Marplot 22/4, 2013 at 22:52

Instead of

if (*a != *b)
    ++num_mismatches;

this would be faster on some architectures (with 8 bit bytes) because it avoids the branch:

int bits = *a ^ *b;
bits |= bits >> 4;
bits |= bits >> 2;
bits |= bits >> 1;
num_mismatches += bits & 1;

Estienne answered 13/4, 2013 at 12:56 Comment(2)

I just tried this on an Intel Xeon and runs slower than original. Thank you for trying to help me though. – Marplot 13/4, 2013 at 14:51

On x86, the sequence num_mismatches += *a != *b can be compiled into something like test %areg,%breg; setnz %al; movzbl %al,%eax; add %eax,%mismatches_reg; which is branch-free. – Jetliner 4/5, 2015 at 11:52

How about loop unrolling:

while (na >= 8){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  num_mismatches += (a[4] != b[4]);
  num_mismatches += (a[5] != b[5]);
  num_mismatches += (a[6] != b[6]);
  num_mismatches += (a[7] != b[7]);
  a += 8; b += 8; na -= 8;
}
if (na >= 4){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  num_mismatches += (a[2] != b[2]);
  num_mismatches += (a[3] != b[3]);
  a += 4; b += 4; na -= 4;
}
if (na >= 2){
  num_mismatches += (a[0] != b[0]);
  num_mismatches += (a[1] != b[1]);
  a += 2; b += 2; na -= 2;
}
if (na >= 1){
  num_mismatches += (a[0] != b[0]);
  a += 1; b += 1; na -= 1;
}

Also, if you know there are long stretches of equal characters, you could cast the pointers to long* and compare them 4 at a time, and only if not equal look at the individual characters. This code is based on memset and memcpy being fast. It copies the strings into long arrays to 1) eliminate alignment issues, and 2) pad the strings with zeros out to an integer number of longs. As it compares each pair of longs, if they are not equal, it casts the pointers to char* and counts up the unequal characters. The main loop could also be unrolled, similar to the above.

long la[BIG_ENOUGH];
long lb[BIG_ENOUGH];
memset(la, 0, sizeof(la));
memset(lb, 0, sizeof(lb));
memcpy(la, a, na);
memcpy(lb, b, nb);
int nla = (na + 3) & ~3; // assuming sizeof(long) = 4
long *pa = la, *pb = lb;
while(nla >= 1){
  if (pa[0] != pb[0]){
    num_mismatches += (((char*)pa[0])[0] != ((char*)pb[0])[0])
                    + (((char*)pa[0])[1] != ((char*)pb[0])[1])
                    + (((char*)pa[0])[2] != ((char*)pb[0])[2])
                    + (((char*)pa[0])[3] != ((char*)pb[0])[3])
                    ;
  }
  pa += 1;pb += 1; nla -= 1;
}

Poppy answered 13/4, 2013 at 13:54 Comment(8)

I just tried your loop unrolling. Its slower than original on an Intel Xeon. I have trouble implementing your second solution. Thanks for helping me anyway. – Marplot 13/4, 2013 at 15:1

@Petros: Hmm... The first unrolling solution shouldn't be slower than your basic character loop, unless maybe na is typically pretty short. – Poppy 13/4, 2013 at 15:6

Newer Intel's don't really like unrolling - at least, not always. The loop overhead is typically smaller than the instruction redecoding time (due to the loop not fitting in the loop buffer). – Escapism 13/4, 2013 at 15:8

@harold: Maybe that's it. I guess it really depends on the processor. – Poppy 13/4, 2013 at 15:9

@MikeDunlavey na is between 4 and 32 at most, between 6 and 9 on average case. – Marplot 13/4, 2013 at 15:19

@Petros: In that case, it's going to be hard to beat the naive loop, especially if it is inlined. I've done some extreme performance tuning. Here's one example. Basically, opportunities to speed up the program lurk in places you would never guess, and when you find them, they may be asking you to re-architect the program. What's more, you don't just do it once. The real payoff comes when you repeat it as many times as you can. – Poppy 13/4, 2013 at 15:31

@Petros: I forgot to ask: How often are the strings different? If they are almost always equal, a simple memcmp(a, b, na)==0 will be hard to beat, and only when they are unequal call your function. – Poppy 13/4, 2013 at 15:45

@MikeDunlavey Quite often. I tried checking memcmp() first and is slower than standard – Marplot 13/4, 2013 at 17:5

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags