Order insensitive hash function for an array

Asked 12/12, 2012 at 20:19 Answered 14/4 at 1:38

I'm looking for a hash-function which will produce the same result for unordered sequences containing same elements.

For example:

Array_1: [a, b, c]
Array_2: [b, a, c]
Array_3: [c, b, a]

The hash-function should return the same result for each of these arrays.

How to achieve this?

The most popular answer is to sort elements by some rule, then concatenate, then take hash.

Is there any other method?

Electrocorticogram answered 12/12, 2012 at 20:19 Comment(5)

Can explain what you dislike about the most popular option? That might help us answer. – Gendarmerie 13/12, 2012 at 10:28

Any commutative function will combine the given elements to give an order-insensitive result. You can then do any further processing you want on the output. – Upwards 13/12, 2012 at 13:37

What properties should the has have? Collision resistance against an attacker? Or just no accidental collisions? – Asphalt 15/12, 2012 at 14:37

@DuncanJones the most popular option requires sorting and such heavy operations are not supposed to be inside a hashing function. – Carr 15/12, 2012 at 17:1

What are a,b and c? How many different a,b,c's do you have? Is number of elements in an array a constant? Can a, b, c etc be known at compile time? If you could provide this much a better answer will be possible – Carr 15/12, 2012 at 17:44

if a,b,c are numbers, you could sum up and then build a hash on the sum. You may multiply, too. But take care about zeros! XOR-ing numbers is also an approach.

for very small numbers you may consider to set the bit indexed by the number. This means building a long (64bit) as input for the hash allows only element numbers in range 0-63.

The more elements you have the more collisions you will get. In the end you map n elements with m bits (resulting to 2^(m*n) range) to a hash value with k bits. Usually m and k is a constant but n varies.

Please aware any access as by a hash requires a test whether to get the correct element. In general a hash is NOT unique.

otherwise sort the element and then do the hash as proposed

Regarding the comment from CodesInChaos:

in order to be able to omit a test, the numbers of bits of the hash should be much greater than the sum of elements bits. Say at least 64 bits more. In general this situation is not given.

One common case of secure hash/unique id is a guid. This means effectively 128 bits. A random sequence of text char reaches this number of bits within 20-25 characters. Longer texts are very likely to produce collisions. It depends on the use case whether this is still acceptable.

Corriveau answered 12/12, 2012 at 20:30 Comment(3)

Too much collisions with XOR I think. Especially if a, b, c are tiny integers like 0,1,2 etc. Of course in this case sorting is a good idea. But I'm trying to find some universal hashing method for this case. – Electrocorticogram 12/12, 2012 at 20:58

the requirements will imply collisions. – Corriveau 15/12, 2012 at 14:11

"Please aware any access as by a hash requires a test whether to get the correct element." if it's a secure cryptographic hash you don't need that test, since finding a collision is too hard/the chance of it happening is negligible. – Asphalt 15/12, 2012 at 14:40

XOR | Sum | Sum of squares | ...

where | denotes concat.

XOR of hash of elements

Bradawl answered 15/12, 2012 at 16:17 Comment(1)

Intuitively it looks good, but do we have a proof for this? Is collision possible? – Portal 9/6, 2013 at 22:51

In order to have order independence, you need a combining operation that is both commutative and associative. XOR and ADD are decent choices, but may lead to too many collisions.

If your values are uniformly distributed and your output hash is no larger than your input values, there's not much more you can (or need to) do -- both XOR and ADD will be close to optimimum in this case. If, however, your input values are NOT uniformly distributed, or your hash output is larger than your input values, you can improve things by first using a high-diffusion transformation on each input number (independently) before combining them with XOR or ADD.

Bascially, this is just a mapping from n-bit numbers (where n is the size of your input values) to k-bit numbers (where k is the output hash size), such that for any given value, flipping one bit will flip about half of the k output bits. There are many ways to do this, but pretty much any good cryptographic transformation will do. For example, if you want to produce a 128-bit hash, simply use AES128 on each input (padded with 0s and with a fixed key), and then xor together all the ciphertext blocks. If you don't require full cryptographic strength in your hash, a reduced-round AES variant may be good enough (and may be supported in hardware on your CPU with just a couple of instructions).

If you DO need a cryptographically secure hash, you'll need to superencrypt the final result of the XOR as well in order to protect it from attacks.

Ashleaashlee answered 14/4 at 1:38 Comment(0)

Recommended topics

Hot tags