How can I hash a string to an int using c++?
Asked Answered
H

10

17

I have to write my own hash function. If I wanted to just make the simple hash function that maps each letter in the string to a numerical value (i.e. a=1, b=2, c=3, ...), is there a way I can perform this hash on a string without having to first convert it to a c-string to look at each individual char? Is there a more efficient way of hashing strings?

Hege answered 29/3, 2010 at 1:4 Comment(0)
C
7

Re the first question, sure, e.g, something like:

int hash = 0;
int offset = 'a' - 1;
for(string::const_iterator it=s.begin(); it!=s.end(); ++it) {
  hash = hash << 1 | (*it - offset);
}

regarding the second, there are many better ways to hash strings. E.g., see here for a few C examples (easily translatable to C++ along the lines of the snippet above).

Common answered 29/3, 2010 at 1:12 Comment(3)
i see. how about if i wanted to do case-insensitive hashing. where A=a=1?Hege
+1, if only for use of *2 and | to create a comedicaly poor hash ;-)Difficile
-1 for creating a comically poor hash. Use '^', never '|'! Even with '^', this will create a poor distribution (lots more collisions than you need) with short strings.Truehearted
T
9

From personal experience I know that this works and produces good distributions. (Plagiarised from http://www.cse.yorku.ca/~oz/hash.html):

djb2

this algorithm (k=33) was first reported by dan bernstein many years ago in comp.lang.c. another version of this algorithm (now favored by bernstein) uses xor: hash(i) = hash(i - 1) * 33 ^ str[i]; the magic of number 33 (why it works better than many other constants, prime or not) has never been adequately explained.

unsigned long hash(unsigned char *str) {
    unsigned long hash = 5381;
    int c;

    while (c = *str++) {
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    }

    return hash;
}
Truehearted answered 21/11, 2012 at 5:52 Comment(0)
C
7

Re the first question, sure, e.g, something like:

int hash = 0;
int offset = 'a' - 1;
for(string::const_iterator it=s.begin(); it!=s.end(); ++it) {
  hash = hash << 1 | (*it - offset);
}

regarding the second, there are many better ways to hash strings. E.g., see here for a few C examples (easily translatable to C++ along the lines of the snippet above).

Common answered 29/3, 2010 at 1:12 Comment(3)
i see. how about if i wanted to do case-insensitive hashing. where A=a=1?Hege
+1, if only for use of *2 and | to create a comedicaly poor hash ;-)Difficile
-1 for creating a comically poor hash. Use '^', never '|'! Even with '^', this will create a poor distribution (lots more collisions than you need) with short strings.Truehearted
M
5

You can examine each individual char from a std::string using the [] operator. However, you can look at Boost::Functional/Hash for guidance on a better hashing scheme. There is also a list of hashing functions in c located here.

Mclaughlin answered 29/3, 2010 at 1:9 Comment(2)
so, my understanding is that hash functions map a string to an int, but usually these ints are mapped using a compression map to table addresses so that the hashtable is a more manageable size. is this applicable to the hash functions you recommended in the link?Hege
You mean buckets? There are a number of "usual" functions which are trade-offs in terms of size of the hash table produced and performance criteria. The biggest concern you should have is how many repeated values, that is, how uniformly distributed your results are. Poor hashing will invariably leave you with a small collection of linked lists rather than a constant amortized time look up table. I have not examined the later while I've seen Boost. Did I answer that?Mclaughlin
C
5

Here's a C (++) hash function that I found in Stroustrup's book:

int hash(const char *str)
{
    int h = 0;
    while (*str)
       h = h << 1 ^ *str++;
    return h;
}

If you're using it for a hash table (which Stroustrup does) then you can instead return the abs of the hash modulo a prime number. So instead

    return (h > 0 ? h : -h) % N_BUCKETS;

for the last line.

Christianly answered 5/8, 2012 at 19:38 Comment(1)
If h is INT_MIN, evaluating -h results in undefined behavior. Better use unsigned numbers for hashing.Marxismleninism
G
1

C++11 ships with a standard hashing function for strings.

https://en.cppreference.com/w/cpp/string/basic_string/hash

#include <string>
#include<functional> // hash
int main(){
    std::string s = "Hello";
    std::size_t hash = std::hash<std::string>{}(s);
}
Gastroenterology answered 8/7, 2018 at 22:4 Comment(0)
N
1

Just posting an improvement to Arnestig's djb2 algorithm to be constexpr-friendly. I had to remove the unsigned qualifier of the argument so it can work with literal strings.

constexpr unsigned long hash(const char *str) {
    unsigned long hash = 5381;

    while (int c = *str++) {
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
    }

    return hash;
}
Northcutt answered 28/6, 2020 at 21:49 Comment(0)
K
0

xor the characters together, four at a time.

Kuth answered 29/3, 2010 at 1:10 Comment(5)
i don't really understand what xor is/does. could you explain?Hege
xor is a bitwise operator meaning "one-but-not-both", the '^' operator in c++. e.g. 0 ^ 1 => 1 1 ^ 1 => 0 3 ^ 1 => 2 (11 ^ 01 => 10) It'll give you a randomish integer value. Either way, you'll need to traverse the string in a way similar to Alex Martelli's solution. So go with that and you don't need to worry about word size. :)Kuth
That's not a great hash function. For example, on ASCII data it won't touch the 8th, 16th, 24th or 32nd bits of the word at all. As a practical effect, if your hashtable has 512 buckets, then half of them would never be used by ASCII strings. You want to introduce some co-prime numbers somewhere along the line, and restricting the bucket count to compensate for a weakness in the hash just isn't necessary given the availability of better hashes that aren't much slower.Difficile
Fair point. I hadn't intended this to be a good hash function, just a simple hash function. There are plenty of better hashing algorithms described by the links in other answers. I had assumed (perhaps mistakenly) that hash<string> wasn't available and the question didn't really ask for performance or hash quality. I should have stated that explicitly.Kuth
This hash function will collide on e.g. "abcd1234" and "1234abcd". More seriously, it'll produce bad distributions.Truehearted
C
0
#include <iostream>
#include <string>
#include <algorithm>

using namespace std;

// a variation on dan bernstein's algorithm
// [http://www.cse.yorku.ca/~oz/hash.html]
template<typename Int>
struct hash {
    hash() : acc(5381) { }
    template<typename Ch>
    void operator()(Ch ch) { acc = ((acc << 5) + acc) ^ ch; }
    operator Int() const { return acc; }
    Int acc;
};

int main(int argc, char* argv[])
{
    string s("Hellp, world");
    cout << hex << showbase
        << for_each(s.begin(), s.end(), hash<unsigned long long>()) << '\n';
    return 0;
}
Cabby answered 14/4, 2010 at 8:30 Comment(0)
T
0

Another way for small strings:

int hash(const char* str) {
    int hash = 0;
    int c = 0;

    while (c < std::strlen(str)) {
        hash += (int)str[c] << (int)str[c+1];
        c++;
    }
    return hash;
}
Tashinatashkent answered 7/12, 2015 at 8:4 Comment(0)
H
-2

You can make use of the member functions operator[] or at of the string class or iterators to access individual char of a string object without converting it to c-style char array.

To hash a string object to an integer you'll have to access each individual char of the string object which you can do as:

for (i=0; i < str.length(); i++) {
    // use str[i] or str.at(i) to access ith element.
}
Hasin answered 29/3, 2010 at 1:10 Comment(1)
Don't call str.length() on each for iteration, especially for hashing strings that don't change during the loop. Also, consider working directly on the str.c_str() to avoid any function call in this. Strings do end in NULL character.Johan

© 2022 - 2024 — McMap. All rights reserved.