How can i generate a long hash of a String?
Asked Answered
M

6

14

I have a java applciation in which I want to generate long ids for strings (in order to store those strings in neo4j). In order to avoid data duplication, I would like to generate an id for each string stored in a long integer, which should be unique for each string. How can I do that ?

Mitis answered 16/2, 2012 at 10:34 Comment(3)
Couldn't you just get the hash of the Strings and cast them to long before storing in neo?Straightway
You cannot achieve "unique for all strings" - long has 64 bits, a string of length 9 has 72 bits, there got to be some strings which will be hashed to the same longTrelu
You can't get uniqueness, since there are infintitely many strings and only finitely many longs. Can you describe more specifically what you're looking for?Tala
T
5

long has 64 bits. A String of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long.

If you still want a long hash: You can just take two standard [different!] hash functions for String->int, hash1() and hash2() and calculate: hash(s) = 2^32* hash1(s) + hash2(s)

Trelu answered 16/2, 2012 at 10:40 Comment(1)
A character in java is UTC-16 thus a String of length 9 is 144 bits. Also, using two hash functions make so sense. What you want is simply a hashing algorithm that can operate on data of variable length such as MD5 for instance. For sure there will be collisions, but it will remain close to as minimal as it can be with 64 bits of cardinality.Orelee
C
26

This code will calculate pretty good hash:

String s = "some string";
long hash = UUID.nameUUIDFromBytes(s.getBytes()).getMostSignificantBits();
Carrolcarroll answered 7/9, 2017 at 11:30 Comment(3)
What could be the percentage of collision?Exhibition
it use md5, I am not sure about the percentageCarrolcarroll
Logged in just to upvote this. Maybe not the most correct, but practicalPhenanthrene
W
8

Why don't you have a look a the hashcode() function of String, and just adopt it to using long values instead?

Btw. if there was a way to create a unique ID for each String, then you would have found a compression algorithm that would be able to pack every String into 8 bytes (not possible by definition).

Wintergreen answered 16/2, 2012 at 10:38 Comment(0)
T
5

long has 64 bits. A String of length 9 has 72 bits. from pigeon hole principle - you cannot get a unique hashing for 9 chars long strings to a long.

If you still want a long hash: You can just take two standard [different!] hash functions for String->int, hash1() and hash2() and calculate: hash(s) = 2^32* hash1(s) + hash2(s)

Trelu answered 16/2, 2012 at 10:40 Comment(1)
A character in java is UTC-16 thus a String of length 9 is 144 bits. Also, using two hash functions make so sense. What you want is simply a hashing algorithm that can operate on data of variable length such as MD5 for instance. For sure there will be collisions, but it will remain close to as minimal as it can be with 64 bits of cardinality.Orelee
J
2

A simple 64 bits hash can be implemented by combining CRC32 with Adler32, although they are not made for hashing. Of course the combination is not as strong as modern hash techniques, but it is portable for languages that natively provide a library for CRC.

Example in Java:

package com.example;

import java.util.zip.Adler32;
import java.util.zip.CRC32;

public class MySimpleHash {

    /**
     * Calculate a 64 bits hash by combining CRC32 with Adler32.
     * 
     * @param bytes a byte array
     * @return a hash number
     */
    public static long getHash(byte[] bytes) {

        CRC32 crc32 = new CRC32();
        Adler32 adl32 = new Adler32();

        crc32.update(bytes);
        adl32.update(bytes);

        long crc = crc32.getValue();
        long adl = adl32.getValue();

        return (crc << 32) | adl;
    }

    public static void main(String[] args) {
        String string = "This is a test string";
        long hash = getHash(string.getBytes());
        System.out.println("output: " + hash);
    }
}
output: 7732385261082445741

Example in Python:

#!/usr/bin/python3

import zlib

def get_hash(bytes):
    return zlib.crc32(bytes) << 32 | zlib.adler32(bytes)

string = "This is a test string"
hash = get_hash(string.encode())
print("output:", hash)
output: 7732385261082445741

This Gist compares some hash methods: https://gist.github.com/fabiolimace/507eac3d35900050eeb9772e5b1871ba

Jaggy answered 31/10, 2021 at 4:2 Comment(0)
G
1

There are many answers, try the following:

Or, as suggested before, check out the sources.

PS. One more technique is to maintain a dictionary of strings: since you're unlikely to get 264 strings any time soon, you can have perfect mapping. Note though that that mapping may as well become a major bottleneck.

Goodkin answered 16/2, 2012 at 10:40 Comment(0)
S
0

With a 64-bit (long) hash you can have zero collisions if you hash no more than ten billion strings. A great 64-bit hash function is mzHash64():

public static long mzHash64(byte[] data, int start, int length, long seed) {    
    long hash = 0xD45E69F901E72147L ^ seed;

    for(int i = 0; i < length; i++)
        hash = 0x3631754B22FF2D5CL * (i + data[start + i]) ^ (hash << 2) ^ (hash >>> 2);

    return hash;
}

public static long mzHash64(String str) {
    byte[] data = str.getBytes();
    return mzHash64(data, 0, data.length, 0);
}

Sources: https://github.com/matteo65/mzHash64

Satire answered 19/3 at 15:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.