Why is XOR the default way to combine hashes?
Asked Answered
L

9

198

Say you have two hashes H(A) and H(B) and you want to combine them. I've read that a good way to combine two hashes is to XOR them, e.g. XOR( H(A), H(B) ).

The best explanation I've found is touched briefly here on these hash function guidelines:

XORing two numbers with roughly random distribution results in another number still with roughly random distribution*, but which now depends on the two values.
...
* At each bit of the two numbers to combine, a 0 is output if the two bits are equal, else a 1. In other words, in 50% of the combinations, a 1 will be output. So if the two input bits each have a roughly 50-50 chance of being 0 or 1, then so too will the output bit.

Can you explain the intuition and/or mathematics behind why XOR should be the default operation for combining hash functions (rather than OR or AND etc.)?

Latchet answered 4/5, 2011 at 20:7 Comment(7)
I think you just did ;)Presuppose
note that XOR may or may not be a "good" way to "combine" hashes, depending on what you want in a "combination". XOR is commutative: XOR(H(A),H(B)) is equal to XOR(H(B),H(A)). This means that XOR is not a proper way to create a kind of hash of an ordered sequence of values, since it does not capture the order.Plastometer
Besides the issue with order (comment above), there is problem with equal values. XOR(H(1), H(1))=0 (for any function H), XOR(H(2),H(2))=0 and so on. For any N: XOR(H(N),H(N))=0. Equal values happens quite often in real apps, it means result of XOR will be 0 too often to be considered as good hash.Wagshul
What do you use for ordered sequence of values ? Let's say I'd like to create a hash of timestamp or index. (MSB less important than LSB). Sorry if this thread is 1year old.Ganny
Related: What is the best algorithm for an overridden System.Object.GetHashCode?Daybreak
A word of warning: don't use XOR to combine CRC values because CRC is a linear function in the sense that CRC(a) ^ CRC(b) = CRC(a ^ b). Additionally, two equal elements will cancel out. I think summing CRC values (with addition) is okay if you want a hash of an unordered list, but I'm not 100% on that.Missie
why not concatenate the hash digests and hash again? H(concat(H(1), H(2)))Politics
S
142

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

Scantling answered 4/5, 2011 at 20:9 Comment(11)
answering to the exercise: from the 16 possible different a XXX b operations (0, a & b, a > b, a, a < b, b, a % b, a | b, !a & !b, a == b, !b, a >= b, !a, a <= b, !a | !b, 1), the following have 50%-50% distributions of 0s and 1s, assuming a and b have 50%-50% distributions of 0s and 1s: a, b, !a, !b, a % b, a == b, i. e., the opposite of XOR (EQUIV) could have been used as well...Presuppose
Greg, this is an awesome answer. The light bulb went on for me after I saw your original answer and wrote out my own truth tables. I considered @Massa's answer about how there are 6 suitable operations for maintaining the distribution. And while a, b, !a, !b will have the same distribution as their respective inputs, you lose the entropy of the other input. That is, XOR is most suitable for the purpose of combining hashes because we want to capture entropy from both a and b.Latchet
Here is a paper that explains that combining hashes securely where each function is called only once is not possible without outputting less bits than the sum of number of bits in each hash value. This suggest that this answer is not correct.Illuviation
@fish: That paper describes building secure hashes from a secure/possibly-insecure pair. I saw nothing about combining two secure hashes. In any event, I think this discussion has more to do with the use of hashes in randomised algorithms (where there are numerous good tricks that will do the job) than in cryptography, where a huge amount of care must be taken to thwart cryptanalysis.Machos
@Presuppose I've never seen % used for XOR or not equal.Carr
@GregHewgill, I know this thread is old; trying my luck. will XOR(A,B) will generate a unique bit sequence if A and B are unique and have same length?Aerial
@tpk: No, the result is not unique. There are many different ways to generate a given result R from R = A XOR B. For example, consider 0010 XOR 1100, and 1111 XOR 0001. Both give the result 1110.Scantling
As Yakk points out, XOR can be dangerous as it produces zero for identical values. This means (a,a) and (b,b) both produce zero, which in many (most?) cases greatly increases the likelihood of collisions in hash-based data structures.Friede
@2943 consider XORing two bytes has 256*256 possible input values, and only 256 output values. It's not possible to come up with a unique output given two inputs, assuming all three values have the same options.Friede
This is not really a very good answer. It addresses the matter probabilistically, without considering cross probabilities. The question was "Why is XOR the default way to combine hashes", and XOR shouldn't be the default, because there will probably be a relation between the two values (two small integers, two letters, etc). And it gets a lot worse if more than two hashes are being combined.Sapindaceous
Another way to think about this: XOR is reversible: it doesn't destroy information. You can XOR the same thing again to flip the bits back to what they were. AND and OR aren't reversible.Wichita
A
246

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.

So hash(a) + hash(b) is better than hash(a) xor hash(b) in that if a==b, the result is hash(a)<<1 instead of 0.

This remains symmetric; so the "bad" and "dab" getting the same result remains a problem. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a "k-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of lhs with a constant (which is basically random 0s and 1s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1 and 0s after each combine. My naive 3*hash(a)+hash(b) simply outputs a 0 in that case).

Extending this to 64 bits (using the expansion of pi as our constant for 64 bits, as it is odd at 64 bits):

size_t hash_combine( size_t lhs, size_t rhs ) {
  if constexpr (sizeof(size_t) >= 8) {
    lhs ^= rhs + 0x517cc1b727220a95 + (lhs << 6) + (lhs >> 2);
  } else {
    lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  }
  return lhs;
}

(For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)

Anthiathia answered 14/1, 2015 at 21:21 Comment(19)
Nice answer Yakk. Does this algorithm work equally well on both 32bit and 64bit systems? Thanks.Bilection
@dave add more bits to 0x9e3779b9.Anthiathia
@Yakk Thanks. For anyone else listening, I doubled the binary bits of the 32bit case ( 0x9e3779b9 ) for a 64bit value of ( 0x9e3779b99e377800 ) and switch which to use by testing cpp macros i386 (32 bit intel) and x86_64 (64 bit intel)Bilection
@dave use a base 2 fractional irrational value for max entropy.Anthiathia
@Yakk Oh! Thank you, I'd forgotten the number wasn't just any constant, which you explained so well above. :) Using your inverse of the golden ratio as a 64 bit fixed point number, I come up with this, which I'll use instead for the 64bit case: 0x9e3779b97f492000. Does it matter that this constant is even? Would it be better to add a one to the end of it?Bilection
@Bilection Not sure; but it ending with 000 is suspicious; that value probably has double bits of precision, not 64.Anthiathia
@Yakk I used a couple online converters to come up with the numbers (probably written in javascript), so hmm... you're right, I doubt anyone is trying to be more precise than double. I'll re-examine. Also, oops stack overflow formatting prints the macros wrong in my earlier comment. They are __i386__ and __x86_64__ (with leading and trailing double-underlines)Bilection
OK, to be complete... here is the full precision 64bit constant (calculated with long doubles, and unsigned long longs): 0x9e3779b97f4a7c16. Interestingly it is still even. Re-doing the same calculation using PI instead of the Golden Ratio produces: 0x517cc1b727220a95 which is odd, instead of even, thus probably "more prime" than the other constant. I used: std::cout << std::hex << (unsigned long long) ((1.0L/3.14159265358979323846264338327950288419716939937510L)*(powl(2.0L,64.0L))) << std::endl; with cout.precision( numeric_limits<long double>::max_digits10 ); Thanks again Yakk.Bilection
@Bilection the inverse golden ratio rule for these cases is the first odd number equal to or larger than the calculation you are doing. So just add 1. It is an important number because the sequence of N * the ratio, mod the max size (2^64 here) places the next value in the sequence exactly at that ratio in the middle of the largest 'gap' in numbers. Search the web for "Fibonacci hashing" for more info.Pique
@Bilection the right number would be 0.9E3779B97F4A7C15F39... See link. You're could be suffering from the round-to-even rule (which is good for accountants), or simply, if you start with a literal sqrt(5) constant, when you subtract 1, you remove the high order bit, a bit must have been lost.Sapindaceous
Also good, but a lot more expensive, would be hash(hash(a)) + hash(b).Sapindaceous
But, wait, is it 0x0.9e377... or 0x9e377 ? Sorry getting confused since the 32bit version int the main answer uses 0x9e377...Bilection
@Bilection The hash constant is a fixed-point hex decimal. The decimal isn't part of the encoding, as it is implicitly before the most significant byte of the value. This is a bit confusing as C++ has (recently?) added hex floating point literals, but prior to that a decimal point and hex values wasn't legal C++. In short, omit the decimal point.Anthiathia
@Dave, just reading your comments after some years... I think, instead of testing the macros, it would be better to just have two overloads for uint32_t and uint64_t.Spheroidal
In the last paragraph seed should probably be changed with lhs. Great answer!Goddart
"this means it erases less information" - no. There is the same amount of information when you add two random numbers and truncate or when you xor them. Both results have maximum entropy. The rest is still true though.Untangle
Except, sometimes we want our hash to be order agnostic, e.g., when trying to hash an unordered collection.Rosana
In 2022, it might make sense to present the 64 bit variant by default, as this is the canonical answer for how to combine hashes in C++ on SO.Peti
we need some way to retrieve the "magic" constants so we won't need the if constexpr. Also, the the choice of inverse pi on 64-bit systems seems arbitrary to me. Why not just stick to the inverse golden ratio or, alternatively, switch to the inverse pi everywhere?Toodleoo
S
142

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

Scantling answered 4/5, 2011 at 20:9 Comment(11)
answering to the exercise: from the 16 possible different a XXX b operations (0, a & b, a > b, a, a < b, b, a % b, a | b, !a & !b, a == b, !b, a >= b, !a, a <= b, !a | !b, 1), the following have 50%-50% distributions of 0s and 1s, assuming a and b have 50%-50% distributions of 0s and 1s: a, b, !a, !b, a % b, a == b, i. e., the opposite of XOR (EQUIV) could have been used as well...Presuppose
Greg, this is an awesome answer. The light bulb went on for me after I saw your original answer and wrote out my own truth tables. I considered @Massa's answer about how there are 6 suitable operations for maintaining the distribution. And while a, b, !a, !b will have the same distribution as their respective inputs, you lose the entropy of the other input. That is, XOR is most suitable for the purpose of combining hashes because we want to capture entropy from both a and b.Latchet
Here is a paper that explains that combining hashes securely where each function is called only once is not possible without outputting less bits than the sum of number of bits in each hash value. This suggest that this answer is not correct.Illuviation
@fish: That paper describes building secure hashes from a secure/possibly-insecure pair. I saw nothing about combining two secure hashes. In any event, I think this discussion has more to do with the use of hashes in randomised algorithms (where there are numerous good tricks that will do the job) than in cryptography, where a huge amount of care must be taken to thwart cryptanalysis.Machos
@Presuppose I've never seen % used for XOR or not equal.Carr
@GregHewgill, I know this thread is old; trying my luck. will XOR(A,B) will generate a unique bit sequence if A and B are unique and have same length?Aerial
@tpk: No, the result is not unique. There are many different ways to generate a given result R from R = A XOR B. For example, consider 0010 XOR 1100, and 1111 XOR 0001. Both give the result 1110.Scantling
As Yakk points out, XOR can be dangerous as it produces zero for identical values. This means (a,a) and (b,b) both produce zero, which in many (most?) cases greatly increases the likelihood of collisions in hash-based data structures.Friede
@2943 consider XORing two bytes has 256*256 possible input values, and only 256 output values. It's not possible to come up with a unique output given two inputs, assuming all three values have the same options.Friede
This is not really a very good answer. It addresses the matter probabilistically, without considering cross probabilities. The question was "Why is XOR the default way to combine hashes", and XOR shouldn't be the default, because there will probably be a relation between the two values (two small integers, two letters, etc). And it gets a lot worse if more than two hashes are being combined.Sapindaceous
Another way to think about this: XOR is reversible: it doesn't destroy information. You can XOR the same thing again to flip the bits back to what they were. AND and OR aren't reversible.Wichita
M
33

In spite of its handy bit-mixing properties, XOR is not a good way to combine hashes due to its commutativity. Consider what would happen if you stored the permutations of {1, 2, …, 10} in a hash table of 10-tuples.

A much better choice is m * H(A) + H(B), where m is a large odd number.

Credit: The above combiner was a tip from Bob Jenkins.

Machos answered 24/4, 2013 at 21:37 Comment(8)
Sometimes commutativity is a good thing, but xor is a lousy choice even then because all pairs of matching items will get hashed to zero. An arithmetic sum is better; the hash of a pair of matching items will retain only 31 bits of useful data rather than 32, but that's a lot better than retaining zero. Another option may be to compute the arithmetic sum as a long and then munge the upper portion back in with the lower portion.Crusade
m = 3 is actually a good choice and very fast on many systems. Note that for any odd m integer multiplication is modulo 2^32 or 2^64 and is therefore invertible so you're not losing any bits.Aldosterone
What happens when you go beyond MaxInt?Braden
instead of any odd number one should choose a primeLayby
XOR is fine, if you are combining two different quality hash functions. For example SHA1(A) XOR SipHash(B) (mashed together at equal length, of course)Pique
@Infinum that's not necessary when combining hashes.Machos
Why not do H(H(A) || H(B)) where || is concatenate?Extensor
@CaseyRodarmor you could, but stringifying and concatenating two hashes and then computing a third hash is far more expensive than a multiplication and an addition for no improvement in the quality of the hash.Machos
S
18

Xor may be the "default" way to combine hashes but Greg Hewgill's answer also shows why it has its pitfalls: The xor of two identical hash values is zero. In real life, there are identical hashes are more common than one might have expected. You might then find that in these (not so infrequent) corner cases, the resulting combined hashes are always the same (zero). Hash collisions would be much, much more frequent than you expect.

In a contrived example, you might be combining hashed passwords of users from different websites you manage. Unfortunately, a large number of users reuse their passwords, and a surprising proportion of the resulting hashes are zero!

Sankaran answered 19/8, 2011 at 0:9 Comment(1)
I hope the contrived example never happens, passwords should be salted.Exterminatory
G
8

There's something I want to explicitly point out for others who find this page. AND and OR restrict output like BlueRaja - Danny Pflughoe is trying to point out, but can be better defined:

First I want to define two simple functions I'll use to explain this: Min() and Max().

Min(A, B) will return the value that is smaller between A and B, for example: Min(1, 5) returns 1.

Max(A, B) will return the value that is larger between A and B, for example: Max(1, 5) returns 5.

If you are given: C = A AND B

Then you can find that C <= Min(A, B) We know this because there is nothing you can AND with the 0 bits of A or B to make them 1s. So every zero bit stays a zero bit and every one bit has a chance to become a zero bit (and thus a smaller value).

With: C = A OR B

The opposite is true: C >= Max(A, B) With this, we see the corollary to the AND function. Any bit that is already a one cannot be ORed into being a zero, so it stays a one, but every zero bit has a chance to become a one, and thus a larger number.

This implies that the state of the input applies restrictions on the output. If you AND anything with 90, you know the output will be equal to or less than 90 regardless what the other value is.

For XOR, there is no implied restriction based on the inputs. There are special cases where you can find that if you XOR a byte with 255 than you get the inverse but any possible byte can be output from that. Every bit has a chance to change state depending on the same bit in the other operand.

Guillermoguilloche answered 20/5, 2011 at 21:11 Comment(3)
One could say that OR is bitwise max, and AND is bitwise min.Pavlodar
Very well stated Paulo Ebermann. Nice to see you here as well as Crypto.SE!Guillermoguilloche
I created a filter which includes me everything tagged cryptography, also changes to old questions. This way I found your answer here.Pavlodar
C
4

If you XOR a random input with a biased input, the output is random. The same is not true for AND or OR. Example:

00101001 XOR 00000000 = 00101001
00101001 AND 00000000 = 00000000
00101001 OR  11111111 = 11111111

As @Greg Hewgill mentions, even if both inputs are random, using AND or OR will result in biased output.

The reason we use XOR over something more complex is that, well, there's no need: XOR works perfectly, and it's blazingly stupid-fast.

Cestar answered 4/5, 2011 at 20:13 Comment(0)
I
3

Cover the left 2 columns and try to work out what the inputs are using just the output.

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

When you saw a 1-bit you should have worked out that both inputs were 1.

Now do the same for XOR

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

XOR gives away nothing about it inputs.

Incoordination answered 23/5, 2017 at 10:53 Comment(0)
G
1

XOR does not ignore some of the inputs sometimes like OR and AND.

If you take AND(X, Y) for example, and feed input X with false, then the input Y does not matter...and one probably would want the input to matter when combining hashes.

If you take XOR(X, Y) then BOTH inputs ALWAYS matter. There would be no value of X where Y does not matter. If either X or Y is changed then the output will reflect that.

Griddle answered 24/7, 2019 at 1:45 Comment(0)
K
0

The source code for various versions of hashCode() in java.util.Arrays is a great reference for solid, general use hashing algorithms. They are easily understood and translated into other programming languages.

Roughly speaking, most multi-attribute hashCode() implementations follow this pattern:

public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}

You can search other StackOverflow Q&As for more information about the magic behind 31, and why Java code uses it so frequently. It is imperfect, but has very good general performance characteristics.

Keratitis answered 12/5, 2015 at 15:43 Comment(1)
Java's default "multply by 31 and add / accumulate" hash is loaded with collisions (e.g. any string collides with string + "AA" IIRC) and they long ago wished they had not baked in that algorithm into the spec. That said, using a larger odd number with more bits set, and adding a shifts or rotations fixes that problem. MurmurHash3's 'mix' does this.Pique

© 2022 - 2024 — McMap. All rights reserved.