Generate same unique hash code for all anagrams
Asked Answered
T

8

18

Recently, I attended an interview and faced a good question regarding hash collisions.

Question : Given a list of strings, print out the anagrams together.

Example :

i/p :              {act, god, animal, dog, cat}

o/p :                 act, cat, dog, god


I want to create hashmap and put the word as key and value as list of anagrams

To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.

I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list

Can anyone suggest a good algorithm ?

Tradesfolk answered 13/9, 2013 at 7:55 Comment(7)
As far as I know, hash algorithms SHOULD have collisions to group 'equal' items. Otherwise you end up with the sorted word as keys.Diaphoretic
You should treat the words as a bag of characters, rather than a list. The most trivial thing I can come up with is multiply all character values modulo some prime. But of course, that has undesired collisions really quickly.Lowenstern
Sort every word and hash the sort value is what I would do. Hashing has a side effect that collisions can always occur. There are ways to get around it but they wont guarantee you O(1) access any more iirc.Subdual
I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will treat as values.Tradesfolk
@user1554241 You may well be able to generate the same hash code for "act" and "cat". Hashing after sorting the letters in each word is a good suggestion. You cannot ensure no collisions, because there are more strings than fixed width integers.Pitchford
@Patricia, I am not sure whether ideal solution exists. I don't want to sort and use the resulting string as key. I am looking for a solution which generated same hash code for anagrams. Assume given input is with in the fixed width rangeTradesfolk
This question is baslicly your question: #6691684 . The conclusion is you will have to sort in worst case.Subdual
G
32

Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.

This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).

simple e.g. -

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word); 

[edit]

A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1*p2^n2*... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.

This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)

Gypsum answered 13/9, 2013 at 11:45 Comment(6)
I think this is the right solution. I googled and found similar solution. Can you give data points for selecting prime numbers as for multiplying? I will be happy if you can point me to any documentTradesfolk
Well, I'm not familiar with the "official" solution so I don't know if there are any pitfalls here, but I think any prime would do fine, so just pick from here - en.wikipedia.org/wiki/Prime_number , or use the sieve method to create them.Gypsum
Keep in mind you will very fast have an overflow if you use regular 32 or 64 bit numbers. You might want to consider another way to store your numbers to avoid this problem.Subdual
good point, since you have 26 letters the highest prime you need is 101, so if you're not limited to 9 letter words you'll probably need a big number libraryGypsum
A 64bit integer can only represent 2^64 different values (and here you use a very small set of them), but the space of anagrams is infinite.Scrapbook
@FeiJiang, yes, see my comment above yours. If we're talking about conceptual algorithms, we don't have to limit ourselves to 64bit implementations.Gypsum
P
6

Hash function : Assign primary numbers to each character. While calculating hash code, get the prime number assigned to that character and multiply with to existing value.Now all anagrams produce same hash value.

ex : a - 2, c - 3 t - 7

hash value of cat = 3*2*7 = 42 hash value of act = 2*3*7 = 42 Print all strings which are having same hash value(anagrams will have same hash value)

Potiche answered 13/9, 2013 at 7:55 Comment(2)
This would break in case the string is 'nc' because the hash value would still result to 42 but this time it would 14('n') * 3('c').Damage
@deepkataria you can have sum of digits and prod of digits ,here in case 7*3*2 =42 and 7+3+2 = 12 .These both values then needs to be matched against another set of characters of same length..But this approach will break if length of string is too large as product of these digits will cause overflow.Quagga
D
3

The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)

My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."
Doty answered 30/3, 2016 at 13:48 Comment(3)
Can someone explain the prime_modulus? I don't understand why we care about overflow when multiplying by 32 specifically either.Evade
The prime_modulus serves two purposes: (1) it is small enough to prevent overflow of a 64 bit integer (2) it has to be relatively prime with the other primes used so that when we take the modulus we do not yield different hash codes for words that are anagrams of each other. That means that no matter what order the letters appear, they give the same hash code. Thus READ and DARE and DEAR all end up with the same code.Doty
Thanks. Do you know any good resources to learn more about this? I'm a idiot.Evade
P
3

Small practical Optimization , I would suggest for the above hash method is :

Assign least prime number to vowels and then most frequently occurring consonants. Ex : e : 2 a : 3 i : 5 o : 7 u : 11 t : 13 and so on...

Also, average word length for english is : ~ 6

Also, top 26 prime numbers are less than 100 [2,3,5,7, .. , 97]

Hence, on average your hash would generate value around 100^6 = 10^12.

So there are very less chances of collision if you take prime number for modulo bigger than 10^12.

Pretermit answered 15/10, 2016 at 16:51 Comment(1)
The first 26 prime numbers are not less than 100 - the 26th prime number is 101. en.wikipedia.org/wiki/…Tenebrae
C
2

The complexity above seems very misplaced! You don't need prime numbers or hashes. It's just three simple ops:

  1. Map each OriginalWord to a Tuple of (SortedWord, OriginalWord). Example: "cat" becomes ("act", "cat"); "dog" becomes ("dgo", "dog"). This is a simple sort on the chars of each OriginalWord.
  2. Sort the Tuples by their first element. Example: ("dgo", "dog"), ("act, "cat") sorts to ("act", "cat"), ("dgo", "dog"). This is a simple sort on the entire collection.
  3. Iterate through the tuples (in order), emitting the OriginalWord. Example: ("act", "cat"), ("dgo", "dog") emits "cat" "dog". This is a simple iteration.

Two iterations and two sorts are all it takes!

In Scala, it's exactly one line of code:

val words = List("act", "animal", "dog", "cat", "elvis", "lead", "deal", "lives", "flea", "silent", "leaf", "listen")

words.map(w => (w.toList.sorted.mkString, w)).sorted.map(_._2)
# Returns: List(animal, act, cat, deal, lead, flea, leaf, dog, listen, silent, elvis, lives)

Or, as the original question implies, you only want cases where the count > 1, it's just a bit more:

scala> words.map(w => (w.toList.sorted.mkString, w)).groupBy(_._1).filter({case (k,v) => v.size > 1}).mapValues(_.map(_._2)).values.toList.sortBy(_.head)
res64: List[List[String]] = List(List(act, cat), List(elvis, lives), List(flea, leaf), List(lead, deal), List(silent, listen))
Costmary answered 13/11, 2017 at 14:50 Comment(3)
I like the simplicity here as a practical solution, but OP mentioned this was for an interview - this is straight out of CTCI, so mostly likely they're looking at this from Big-O perspective. Isn't sorting S words with max W length per word an O(S * WlogW) operation? And then we need O(SlogS) to sort the tuples? Vs hashing all the words into a dictionary and then just printing out from the dictionary, which would be O(S * hash cost), and hash cost shouldn't be more than W? Please let me know if I've missed something here - ty!Kenrick
@SlugFrisco No. The question requires the answers be sorted, so you'll need the final O(SlogS) sort regardless. As for the WlogW sorts, if the words are natural words (less than a few dozen chars each), the sort is trivial (the constant factor will eclipse any O). And, if the strings are very long (more than a few dozen chars), the multiplication of thousands of primes will be far far far more complex: even a 100 char string would require a product much much much greater than 1 googol.Costmary
Ah fair enough - the question says "print out the anagrams together" - I didn't assume they had to be in sorted order, although I see OP's output is indeed in alpha order so that's a reasonable assumption. Fair point on WlogW being generally trivial for most words as well. Thanks for replying!Kenrick
V
0

The solution using product of primes is brilliant and here's a Java implementation incase anyone needs one.

class HashUtility {
    private int n;
    private Map<Character, Integer> primeMap;

    public HashUtility(int n) {
        this.n = n;
        this.primeMap = new HashMap<>();
        constructPrimeMap();
    }

    /**
     * Utility to check if the passed {@code number} is a prime.
     *
     * @param number The number which is checked to be prime.
     * @return {@link boolean} value representing the prime nature of the number.
     */
    private boolean isPrime(int number) {
        if (number <= 2)
            return number == 2;
        else
            return (number % 2) != 0
                    &&
                    IntStream.rangeClosed(3, (int) Math.sqrt(number))
                            .filter(n -> n % 2 != 0)
                            .noneMatch(n -> (number % n == 0));
    }

    /**
     * Maps all first {@code n} primes to the letters of the given language.
     */
    private void constructPrimeMap() {
        List<Integer> primes = IntStream.range(2, Integer.MAX_VALUE)
                .filter(this::isPrime)
                .limit(this.n)      //Limit the number of primes here
                .boxed()
                .collect(Collectors.toList());

        int curAlphabet = 0;
        for (int i : primes) {
            this.primeMap.put((char) ('a' + curAlphabet++), i);
        }
    }

    /**
     * We calculate the hashcode of a word by calculating the product of each character mapping prime. This works since
     * the product of 2 primes is unique from the products of any other primes.
     * <p>
     * Since the hashcode can be huge, we return it modulo a large prime.
     *
     * @param word The {@link String} to be hashed.
     * @return {@link int} representing the prime hashcode associated with the {@code word}
     */
    public int hashCode(String word) {
        long primeProduct = 1;
        long mod = 100000007;
        for (char currentCharacter : word.toCharArray()) {
            primeProduct *= this.primeMap.get(currentCharacter) % mod;
        }

        return (int) primeProduct;
    }
}

Please let me know if/how I can improve this.

Villalba answered 2/11, 2020 at 9:12 Comment(0)
M
0

We can use the binary value representation of array. This code snippet is assuming all characters are small latin characters.

public int hashCode() {
    //TODO: so that each set of anagram generates same hashCode
    int sLen = s.length();
    int [] ref = new int[26];
    for(int i=0; i< sLen; i++) {
      ref[s.charAt(i) - 'a'] +=1;
    }
    int hashCode = 0;
    for(int i= 0; i < ref.length; i++) {
      hashCode += new Double(Math.pow(2, i)).intValue() * ref[i];
    }
    return hashCode;
  }
Mushro answered 19/10, 2021 at 20:9 Comment(0)
S
0

create the hascode in following way

   String hash(String s){
   char[] hashValue = new char[26];
   for(char c: s.toCharArray()){
                hash[c-'a']++;
            }
            return new String(hashValue);
   }

here the hash will be initialized with the default value of char u0000 and an increment will make the value to the next Unicode. since its a char array we can convert it to string and use it as the key

Station answered 8/6, 2022 at 6:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.