I have a java applciation in which I want to generate long
ids for strings (in order to store those strings in neo4j). In order to avoid data duplication, I would like to generate an id for each string stored in a long
integer, which should be unique for each string. How can I do that ?
long
has 64 bits. A String
of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long
.
If you still want a long
hash: You can just take two standard [different!] hash functions for String->int
, hash1()
and hash2()
and calculate: hash(s) = 2^32* hash1(s) + hash2(s)
String
of length 9 is 144 bits. Also, using two hash functions make so sense. What you want is simply a hashing algorithm that can operate on data of variable length such as MD5 for instance. For sure there will be collisions, but it will remain close to as minimal as it can be with 64 bits of cardinality. –
Orelee This code will calculate pretty good hash:
String s = "some string";
long hash = UUID.nameUUIDFromBytes(s.getBytes()).getMostSignificantBits();
Why don't you have a look a the hashcode()
function of String, and just adopt it to using long values instead?
Btw. if there was a way to create a unique ID for each String, then you would have found a compression algorithm that would be able to pack every String into 8 bytes (not possible by definition).
long
has 64 bits. A String
of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long
.
If you still want a long
hash: You can just take two standard [different!] hash functions for String->int
, hash1()
and hash2()
and calculate: hash(s) = 2^32* hash1(s) + hash2(s)
String
of length 9 is 144 bits. Also, using two hash functions make so sense. What you want is simply a hashing algorithm that can operate on data of variable length such as MD5 for instance. For sure there will be collisions, but it will remain close to as minimal as it can be with 64 bits of cardinality. –
Orelee A simple 64 bits hash can be implemented by combining CRC32 with Adler32, although they are not made for hashing. Of course the combination is not as strong as modern hash techniques, but it is portable for languages that natively provide a library for CRC.
Example in Java:
package com.example;
import java.util.zip.Adler32;
import java.util.zip.CRC32;
public class MySimpleHash {
/**
* Calculate a 64 bits hash by combining CRC32 with Adler32.
*
* @param bytes a byte array
* @return a hash number
*/
public static long getHash(byte[] bytes) {
CRC32 crc32 = new CRC32();
Adler32 adl32 = new Adler32();
crc32.update(bytes);
adl32.update(bytes);
long crc = crc32.getValue();
long adl = adl32.getValue();
return (crc << 32) | adl;
}
public static void main(String[] args) {
String string = "This is a test string";
long hash = getHash(string.getBytes());
System.out.println("output: " + hash);
}
}
output: 7732385261082445741
Example in Python:
#!/usr/bin/python3
import zlib
def get_hash(bytes):
return zlib.crc32(bytes) << 32 | zlib.adler32(bytes)
string = "This is a test string"
hash = get_hash(string.encode())
print("output:", hash)
output: 7732385261082445741
This Gist compares some hash methods: https://gist.github.com/fabiolimace/507eac3d35900050eeb9772e5b1871ba
There are many answers, try the following:
https://mcmap.net/q/53050/-how-can-i-generate-an-md5-hash-in-javaEDIT: removed, I've missed thelong
requirement. Mea culpa.- http://en.wikipedia.org/wiki/Perfect_hash_function
Or, as suggested before, check out the sources.
PS. One more technique is to maintain a dictionary of strings: since you're unlikely to get 264 strings any time soon, you can have perfect mapping. Note though that that mapping may as well become a major bottleneck.
With a 64-bit (long) hash you can have zero collisions if you hash no more than ten billion strings. A great 64-bit hash function is mzHash64():
public static long mzHash64(byte[] data, int start, int length, long seed) {
long hash = 0xD45E69F901E72147L ^ seed;
for(int i = 0; i < length; i++)
hash = 0x3631754B22FF2D5CL * (i + data[start + i]) ^ (hash << 2) ^ (hash >>> 2);
return hash;
}
public static long mzHash64(String str) {
byte[] data = str.getBytes();
return mzHash64(data, 0, data.length, 0);
}
Sources: https://github.com/matteo65/mzHash64
© 2022 - 2024 — McMap. All rights reserved.