Why is XOR the default way to combine hashes?

L

9

198

Say you have two hashes H(A) and H(B) and you want to combine them. I've read that a good way to combine two hashes is to XOR them, e.g. XOR( H(A), H(B) ).

The best explanation I've found is touched briefly here on these hash function guidelines:

XORing two numbers with roughly random distribution results in another number still with roughly random distribution*, but which now depends on the two values.
...
* At each bit of the two numbers to combine, a 0 is output if the two bits are equal, else a 1. In other words, in 50% of the combinations, a 1 will be output. So if the two input bits each have a roughly 50-50 chance of being 0 or 1, then so too will the output bit.

Can you explain the intuition and/or mathematics behind why XOR should be the default operation for combining hash functions (rather than OR or AND etc.)?

Latchet answered 4/5, 2011 at 20:7 Comment(7)

I think you just did ;) – Presuppose 4/5, 2011 at 20:13

note that XOR may or may not be a "good" way to "combine" hashes, depending on what you want in a "combination". XOR is commutative: XOR(H(A),H(B)) is equal to XOR(H(B),H(A)). This means that XOR is not a proper way to create a kind of hash of an ordered sequence of values, since it does not capture the order. – Plastometer 5/5, 2011 at 13:46

Besides the issue with order (comment above), there is problem with equal values. XOR(H(1), H(1))=0 (for any function H), XOR(H(2),H(2))=0 and so on. For any N: XOR(H(N),H(N))=0. Equal values happens quite often in real apps, it means result of XOR will be 0 too often to be considered as good hash. – Wagshul 6/4, 2016 at 6:10

What do you use for ordered sequence of values ? Let's say I'd like to create a hash of timestamp or index. (MSB less important than LSB). Sorry if this thread is 1year old. – Ganny 8/4, 2017 at 9:7

Related: What is the best algorithm for an overridden System.Object.GetHashCode? – Daybreak 17/5, 2017 at 20:41

A word of warning: don't use XOR to combine CRC values because CRC is a linear function in the sense that CRC(a) ^ CRC(b) = CRC(a ^ b). Additionally, two equal elements will cancel out. I think summing CRC values (with addition) is okay if you want a hash of an unordered list, but I'm not 100% on that. – Missie 12/3, 2019 at 17:54

why not concatenate the hash digests and hash again? H(concat(H(1), H(2))) – Politics 14/2 at 13:59

S

142

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

Scantling answered 4/5, 2011 at 20:9 Comment(11)

answering to the exercise: from the 16 possible different a XXX b operations (0, a & b, a > b, a, a < b, b, a % b, a | b, !a & !b, a == b, !b, a >= b, !a, a <= b, !a | !b, 1), the following have 50%-50% distributions of 0s and 1s, assuming a and b have 50%-50% distributions of 0s and 1s: a, b, !a, !b, a % b, a == b, i. e., the opposite of XOR (EQUIV) could have been used as well... – Presuppose 4/5, 2011 at 20:25

Greg, this is an awesome answer. The light bulb went on for me after I saw your original answer and wrote out my own truth tables. I considered @Massa's answer about how there are 6 suitable operations for maintaining the distribution. And while a, b, !a, !b will have the same distribution as their respective inputs, you lose the entropy of the other input. That is, XOR is most suitable for the purpose of combining hashes because we want to capture entropy from both a and b. – Latchet 4/5, 2011 at 21:34

Here is a paper that explains that combining hashes securely where each function is called only once is not possible without outputting less bits than the sum of number of bits in each hash value. This suggest that this answer is not correct. – Illuviation 23/7, 2012 at 10:28

@fish: That paper describes building secure hashes from a secure/possibly-insecure pair. I saw nothing about combining two secure hashes. In any event, I think this discussion has more to do with the use of hashes in randomised algorithms (where there are numerous good tricks that will do the job) than in cryptography, where a huge amount of care must be taken to thwart cryptanalysis. – Machos 24/4, 2013 at 22:5

@Presuppose I've never seen % used for XOR or not equal. – Carr 15/8, 2014 at 15:33

@GregHewgill, I know this thread is old; trying my luck. will XOR(A,B) will generate a unique bit sequence if A and B are unique and have same length? – Aerial 7/9, 2016 at 7:16

@tpk: No, the result is not unique. There are many different ways to generate a given result R from R = A XOR B. For example, consider 0010 XOR 1100, and 1111 XOR 0001. Both give the result 1110. – Scantling 7/9, 2016 at 7:39

As Yakk points out, XOR can be dangerous as it produces zero for identical values. This means (a,a) and (b,b) both produce zero, which in many (most?) cases greatly increases the likelihood of collisions in hash-based data structures. – Friede 15/11, 2016 at 12:38

@2943 consider XORing two bytes has 256*256 possible input values, and only 256 output values. It's not possible to come up with a unique output given two inputs, assuming all three values have the same options. – Friede 15/11, 2016 at 12:40

This is not really a very good answer. It addresses the matter probabilistically, without considering cross probabilities. The question was "Why is XOR the default way to combine hashes", and XOR shouldn't be the default, because there will probably be a relation between the two values (two small integers, two letters, etc). And it gets a lot worse if more than two hashes are being combined. – Sapindaceous 8/1, 2018 at 16:19

Another way to think about this: XOR is reversible: it doesn't destroy information. You can XOR the same thing again to flip the bits back to what they were. AND and OR aren't reversible. – Wichita 12/12, 2018 at 10:54

A

246

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.

So hash(a) + hash(b) is better than hash(a) xor hash(b) in that if a==b, the result is hash(a)<<1 instead of 0.

This remains symmetric; so the "bad" and "dab" getting the same result remains a problem. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a "k-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of lhs with a constant (which is basically random 0s and 1s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1 and 0s after each combine. My naive 3*hash(a)+hash(b) simply outputs a 0 in that case).

Extending this to 64 bits (using the expansion of pi as our constant for 64 bits, as it is odd at 64 bits):

size_t hash_combine( size_t lhs, size_t rhs ) {
  if constexpr (sizeof(size_t) >= 8) {
    lhs ^= rhs + 0x517cc1b727220a95 + (lhs << 6) + (lhs >> 2);
  } else {
    lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  }
  return lhs;
}

(For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)

Anthiathia answered 14/1, 2015 at 21:21 Comment(19)

Nice answer Yakk. Does this algorithm work equally well on both 32bit and 64bit systems? Thanks. – Bilection 21/10, 2015 at 0:39

@dave add more bits to 0x9e3779b9. – Anthiathia 21/10, 2015 at 1:48

@Yakk Thanks. For anyone else listening, I doubled the binary bits of the 32bit case ( 0x9e3779b9 ) for a 64bit value of ( 0x9e3779b99e377800 ) and switch which to use by testing cpp macros i386 (32 bit intel) and x86_64 (64 bit intel) – Bilection 4/11, 2015 at 0:49

@dave use a base 2 fractional irrational value for max entropy. – Anthiathia 4/11, 2015 at 0:52

@Yakk Oh! Thank you, I'd forgotten the number wasn't just any constant, which you explained so well above. :) Using your inverse of the golden ratio as a 64 bit fixed point number, I come up with this, which I'll use instead for the 64bit case: 0x9e3779b97f492000. Does it matter that this constant is even? Would it be better to add a one to the end of it? – Bilection 4/11, 2015 at 3:2

@Bilection Not sure; but it ending with 000 is suspicious; that value probably has double bits of precision, not 64. – Anthiathia 4/11, 2015 at 3:32

@Yakk I used a couple online converters to come up with the numbers (probably written in javascript), so hmm... you're right, I doubt anyone is trying to be more precise than double. I'll re-examine. Also, oops stack overflow formatting prints the macros wrong in my earlier comment. They are __i386__ and __x86_64__ (with leading and trailing double-underlines) – Bilection 4/11, 2015 at 3:37

OK, to be complete... here is the full precision 64bit constant (calculated with long doubles, and unsigned long longs): 0x9e3779b97f4a7c16. Interestingly it is still even. Re-doing the same calculation using PI instead of the Golden Ratio produces: 0x517cc1b727220a95 which is odd, instead of even, thus probably "more prime" than the other constant. I used: std::cout << std::hex << (unsigned long long) ((1.0L/3.14159265358979323846264338327950288419716939937510L)*(powl(2.0L,64.0L))) << std::endl; with cout.precision( numeric_limits<long double>::max_digits10 ); Thanks again Yakk. – Bilection 4/11, 2015 at 4:22

@Bilection the inverse golden ratio rule for these cases is the first odd number equal to or larger than the calculation you are doing. So just add 1. It is an important number because the sequence of N * the ratio, mod the max size (2^64 here) places the next value in the sequence exactly at that ratio in the middle of the largest 'gap' in numbers. Search the web for "Fibonacci hashing" for more info. – Pique 5/1, 2017 at 23:18

@Bilection the right number would be 0.9E3779B97F4A7C15F39... See link. You're could be suffering from the round-to-even rule (which is good for accountants), or simply, if you start with a literal sqrt(5) constant, when you subtract 1, you remove the high order bit, a bit must have been lost. – Sapindaceous 8/1, 2018 at 16:52

Also good, but a lot more expensive, would be hash(hash(a)) + hash(b). – Sapindaceous 8/1, 2018 at 16:55

But, wait, is it 0x0.9e377... or 0x9e377 ? Sorry getting confused since the 32bit version int the main answer uses 0x9e377... – Bilection 16/1, 2018 at 20:44

@Bilection The hash constant is a fixed-point hex decimal. The decimal isn't part of the encoding, as it is implicitly before the most significant byte of the value. This is a bit confusing as C++ has (recently?) added hex floating point literals, but prior to that a decimal point and hex values wasn't legal C++. In short, omit the decimal point. – Anthiathia 16/1, 2018 at 21:4

@Dave, just reading your comments after some years... I think, instead of testing the macros, it would be better to just have two overloads for uint32_t and uint64_t. – Spheroidal 20/3, 2019 at 14:36

In the last paragraph seed should probably be changed with lhs. Great answer! – Goddart 28/10, 2020 at 8:21

"this means it erases less information" - no. There is the same amount of information when you add two random numbers and truncate or when you xor them. Both results have maximum entropy. The rest is still true though. – Untangle 28/8, 2021 at 8:59

Except, sometimes we want our hash to be order agnostic, e.g., when trying to hash an unordered collection. – Rosana 23/5, 2022 at 10:49

In 2022, it might make sense to present the 64 bit variant by default, as this is the canonical answer for how to combine hashes in C++ on SO. – Peti 30/11, 2022 at 22:18

we need some way to retrieve the "magic" constants so we won't need the if constexpr. Also, the the choice of inverse pi on 64-bit systems seems arbitrary to me. Why not just stick to the inverse golden ratio or, alternatively, switch to the inverse pi everywhere? – Toodleoo 1/2, 2023 at 13:19

S

142

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0