Fast calculate hamming distance in C
Asked Answered
B

2

11

I read the Wikipedia article on Hamming Weight and noticed something interesting:

It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string. In this binary case, it is also called the population count, popcount or sideways sum.

[emphasis mine]

So something occurred to me. Could I calculate the Hamming Distance between two strings by XORing them and then taking the Hamming Weight (POPCOUNT) of the resulting string?

Something along the lines of this (using gcc intrinsics):

#include <stdint.h>

int hammingDistance (uint64_t x, uint64_t y) {
        uint64_t res = x ^ y;
        return __builtin_popcountll (res);
}

Now, as for why I would want to do this, well, on some platforms, yes, this would just translate to gcc emitting a call to a function that calculates popcount. For instance, on x64 without popcnt, gcc spits out (Godbolt's GCC Online):

hammingDistance:
    sub rsp, 8
    xor rdi, rsi
    call    __popcountdi2
    add rsp, 8
    ret

OTOH, if you have a platform that supports POPCOUNT, like x64 models including nehalem and after (which have POPCNT), you get (Godbolt's GCC Online):

hammingDistance:
    xor rdi, rsi
    popcnt  rax, rdi
    ret

which should be waaay faster, especially once inlined.


But back to the original question. Can you take the Hamming Weight of the XOR of two strings to find their Hamming Distance? ie:

HD = HW (x xor y)
Bolyard answered 2/8, 2014 at 20:13 Comment(3)
Are you asking if the Hamming weight of the xor of two bitstrings is equal to their Hamming distance? (Answer : Yes, it trivially follows from the definition.) Or are you asking for a generalization of this efficient method to general strings?Excrescency
I'm asking for both the first and for whether my implementation works too.Bolyard
It is interesting to note that popcnt is not always the fastest solution. On the Intel Haswell processor, an AVX2 in-register lookup table method is faster. A utility that can test various population count methods is here: notabs.org/blcutil.Comely
E
6

Hamming distance between two equal length strings, x and y, is defined to be the number of positions where they differ. In the case of x and y being bitstrings, x^y is a string with 1s in exactly the positions they differ. Thus, HammingDistance(x,y) = Number of 1s in x^y, for bitstrings. Also, HammingWeight(x) = number of 1s in x for a bitstring x. Thus, your first claim, HammingDistance(x,y) = HammingWeight(x^y) is true for bitstrings. Having established that, it is clear that your implementation is correct.

Excrescency answered 2/8, 2014 at 20:23 Comment(1)
Both answers are equally good; that being said, I'm marking this one as correct because it was first.Bolyard
S
3

Yes, that works. For each bit, the bit is 1 if and only if the input bits are different. Hence, applied to a whole bit vector, the result has as many one bits (HW) as the inputs have differing bits (HD). And your code seems to exploit that relationship perfectly well. In fact, this shortcut is even mentioned further into the Hamming weight article you link to (Efficient implementation):

The Hamming distance of two words A and B can be calculated as the Hamming weight of A xor B.

Swen answered 2/8, 2014 at 20:21 Comment(1)
Wow. I guess skimming doesn't work too well all the time then. I'll try and spend more time actually reading in the future.Bolyard

© 2022 - 2024 — McMap. All rights reserved.