Hash Function For Sequence of Unique Ids (UUID)
Asked Answered
S

4

12

I am storing message sequences in the database each sequence can have up to N number of messages. I want to create a hash function which will represent the message sequence and enable to check faster if message sequence exists.

Each message has a case-sensitive alphanumeric universal unique id (UUID). Consider following messages (M1, M2, M3) with ids-

M1 - a3RA0000000e0taBB M2 - a3RA00033000e0taC M3 - a3RA0787600e0taBB

Message sequences can be

Sequence-1 : (M1,M2,M3) Sequence-2 : (M1,M3,M2) Sequence-3 : (M2,M1,M3) Sequence-4 : (M1,M2) Sequence-5 : (M2,M3) ...etc...

Following is the database structure example for storing message sequence

enter image description here

Given the message sequence, we need to check whether that message sequence exists in the database. For example, check if message sequence M1 -> M2 -> M3 i.e. with UIDs (a3RA0000000e0taBB -> a3RA00033000e0taC -> a3RA0787600e0taBB) exists in the database.

Instead of scanning the rows in the table, I want to create a hash function which represents the message sequence with a hash value. Using the hash value lookup in the table supposedly faster.

My simple hash function is- enter image description here

I am wondering what would be an optimal hash function for storing the message sequence hash for faster is exists check.

Secundine answered 20/8, 2018 at 22:2 Comment(0)
Q
6

You don't need a full-blown cryptographic hash, just a fast one, so how about having a look at FastHash: https://github.com/ZilongTan/Coding/tree/master/fast-hash. If you believe 32 or 64 bit hashes are not enough (i.e. produce too many collisions) then you could use the longer MurmurHash: https://en.wikipedia.org/wiki/MurmurHash (actually, the author of FastHash recommends this approach)

There's a list of more algorithms on Wikipedia: https://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_functions

In any case, hashes using bit operations (SHIFT, XOR ...) should be faster than the multiplication in your approach, even on modern machines.

Questionnaire answered 23/8, 2018 at 15:13 Comment(1)
He didn't ask for the fastest hash function -- he asked for a 'faster is exists check'. I'd be surprised if the built-in hash function in whatever language was slower than the database lookup.Supranatural
S
2

How about using MD5 algorithm to generate the hash for a concatenated string of messageUIDs.

For instance- consider messages

M1 - a3RA0000000e0taBB M2 - a3RA00033000e0taC M3 - a3RA0787600e0taBB

For message sequence M1->M2->M3 string would be a3RA0000000e0taBB;a3RA00033000e0taC;a3RA0787600e0taBB which will have MD5 hash as 176B1CDE75EDFE1554888DAA863671C4.

According to this answer MD5 is robust against collisions. In the given scenario there is no need for security so MD5 may suffice.

Secundine answered 23/8, 2018 at 19:10 Comment(3)
MD5 will be a lot slower than the hashes in @memo's answer, and will probably not do be any better at avoiding collisions than any other 128-bit hash function, and probably not much or any better than your simple hash function.Lipscomb
@James Okay. I see. Is there any specific reason that you think- "MD5 will be a lot slower than the hashes in memo's answer (Fast-Hash and MurMurHash3)". Thanks!Secundine
Just my intuition based on my understanding of the number of steps in each hash functions.Lipscomb
S
0

Premature optimisation is the root of all evil. Start with the hashing function that is built into your language of choice, and then hash the lists (M1, M2), etc.. Then profile it and see if that's the bottleneck before you start using third-party hash libraries.

My guess is that database lookup will be slower than the hash computation, so it won't matter which hash you use.

In Python you can just call hash([m1, m2, m3])

In Java call the hashCode method on your ArrayList.

Supranatural answered 26/8, 2018 at 5:1 Comment(1)
In Python - TypeError: unhashable type: 'list'Santoyo
S
0

Any regular string hash algorithm (say, your language of choice base library string hash) applied to the concatenation of messages UUIDs would suffice as long as you select all messages by that hash and check that they are indeed your messages in correct order. That may or may not be efficient depending on how many messages are in a sequence usually (also think about the worst case). There is no way to guarantee collision-free hash calculation in general so you should think what you are going to do in case of a collision. Now, if you want to optimize this to make sure your hash is unique, it could be possible in some circumstances. You will know about collision once you try to insert the data, so you can do something about it (say, apply a salt or a dummy message to the sequence, or something like that to modify the hash and keep doing it until you get an empty spot), but it will require sufficiently large hashes and potentially other app-specific modifications.

Sarajane answered 29/8, 2018 at 19:8 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.