Hash function for a string
Asked Answered
B

6

31

We are currently dealing with hash function in my class. Our instructor asked us to a hash function on the internet to compare to the two we have used in our code.

The first one:

int HashTable::hash (string word)   
// POST: the index of entry is returned
{       int sum = 0;
        for (int k = 0; k < word.length(); k++)
            sum = sum + int(word[k]);
        return  sum % SIZE; 
}

Second:

int HashTable::hash (string word)
{
   int seed = 131; 
   unsigned long hash = 0;
   for(int i = 0; i < word.length(); i++)
   {
      hash = (hash * seed) + word[i];
   }
   return hash % SIZE;
}

Where SIZE is 501 (The size of the hash table) and the input is coming from a text file of 20,000+ words.

I saw this question with a few code examples but wasn't exactly sure what to be looking for in a hash function. If I understand correctly, in my case, a hash takes an input (string) and does a math calculation to assign the string a number and inserts it in a table. This process is done to increase the speed of searching the list?

If my logic is sound, does anyone have a good example or a resource showing a different hash function that involves a string? Or even the process of writing my own efficient hash function.

Breakfast answered 29/11, 2011 at 20:47 Comment(4)
You just provided 2 answers to your question.Polis
How can your instructor ask you to analyse two hash functions when he hasn't taught you anything about hash tables/functions?Hyperbole
"Does anyone have a good example or a resource?" Yes.Litter
See also softwareengineering.stackexchange.com/questions/49550/…Humo
W
64

First, it usually does not matter that much in practice. Most hash functions are "good enough".

But if you really care, you should know that it is a research subject by itself. There are thousand of papers about that. You can still get a PhD today by studying & designing hashing algorithms.

Your second hash function might be slightly better, because it probably should separate the string "ab" from the string "ba". On the other hand, it is probably less quick than the first hash function. It may, or may not, be relevant for your application.

I'll guess that hash functions used for genome strings are quite different than those used to hash family names in telephone databases. Perhaps even some string hash functions are better suited for German, than for English or French words.

Many software libraries give you good enough hash functions, e.g. Qt has qhash, and C++11 has std::hash in <functional>, Glib has several hash functions in C, and POCO has some hash function.

I quite often have hashing functions involving primes (see Bézout's identity) and xor, like e.g.

#define A 54059 /* a prime */
#define B 76963 /* another prime */
#define C 86969 /* yet another prime */
#define FIRSTH 37 /* also prime */
unsigned hash_str(const char* s)
{
   unsigned h = FIRSTH;
   while (*s) {
     h = (h * A) ^ (s[0] * B);
     s++;
   }
   return h; // or return h % C;
}

But I don't claim to be an hash expert. Of course, the values of A, B, C, FIRSTH should preferably be primes, but you could have chosen other prime numbers.

Look at some MD5 implementation to get a feeling of what hash functions can be.

Most good books on algorithmics have at least a whole chapter dedicated to hashing. Start with wikipages on hash function & hash table.

Wilderness answered 29/11, 2011 at 20:56 Comment(0)
I
12

-- The way to go these days --

Use SipHash. For your own protection.

-- Old and Dangerous --

unsigned int RSHash(const std::string& str)
{
    unsigned int b    = 378551;
    unsigned int a    = 63689;
    unsigned int hash = 0;

    for(std::size_t i = 0; i < str.length(); i++)
    {
        hash = hash * a + str[i];
        a    = a * b;
    }

    return (hash & 0x7FFFFFFF);
 }

 unsigned int JSHash(const std::string& str)
 {
      unsigned int hash = 1315423911;

      for(std::size_t i = 0; i < str.length(); i++)
      {
          hash ^= ((hash << 5) + str[i] + (hash >> 2));
      }

      return (hash & 0x7FFFFFFF);
 }

Ask google for "general purpose hash function"

Inflection answered 29/11, 2011 at 21:0 Comment(0)
Q
3

Hash functions for algorithmic use have usually 2 goals, first they have to be fast, second they have to evenly distibute the values across the possible numbers. The hash function also required to give the all same number for the same input value.

if your values are strings, here are some examples for bad hash functions:

  1. string[0] - the ASCII characters a-Z are way more often then others
  2. string.lengh() - the most probable value is 1

Good hash functions tries to use every bit of the input while keeping the calculation time minimal. If you only need some hash code, try to multiply the bytes with prime numbers, and sum them.

Quickly answered 29/11, 2011 at 21:30 Comment(0)
P
2

C++ has an already implemented hash for std::string:

std::hash<std::string>

#include <iostream> // not actually required for the hash
#include <string>

auto main() ->int
{
    const std::string input = "Hello World!";
    const std::hash<std::string> hasher;
    const auto hashResult = hasher(input);
    
    std::cout << "Hash for the input is: " << hashResult << std::endl;
}

Run this code here: https://onlinegdb.com/33KLb91ku

Precede answered 8/7, 2021 at 8:3 Comment(0)
P
1

Use boost::hash

#include <boost\functional\hash.hpp>

...

std::string a = "ABCDE";
size_t b = boost::hash_value(a);
Pentecost answered 28/3, 2016 at 21:16 Comment(2)
On Linux, backslashes in #include directives is unlikely to work, so your code is probably Windows specific (or you should change the backslashes to slashes)Wilderness
This was an academic question about the hash concept so this is of no use.Breakfast
S
0

Java's String implements hashCode like this:

public int hashCode()

Returns a hash code for this string. The hash code for a String object is computed as

     s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.) 

So something like this:

int HashTable::hash (string word) {
    int result = 0;
    for(size_t i = 0; i < word.length(); ++i) {
        result += word[i] * pow(31, i);
    }
    return result;
}
Selhorst answered 29/11, 2011 at 20:55 Comment(1)
I think java uses clevel shifts to calculate that value, rather than computing the expression directly. 31 = 32 - 1, so 31^k = (32 - 1)^k = (-1)^k + 2*32*(-1)^(k-1) ... 32^k; since 32 = 2^5, 32^7 > sizeof(int), so you only have to calculate the first 6 of the sum, and even that can be done with shifts. its way faster than using pow(), so don't so it unless you're willing to optimize some calculations.Quickly

© 2022 - 2024 — McMap. All rights reserved.