c++11 speed comparison/cost std::hash<std::string> equal versus std::string equal directly on 2 large strings
Asked Answered
M

1

6

Hi I have a question on std::hash if I have 2 large strings for comparison and I am willing to accept that std::hash will compare equal in most cases is it more performance compliant to do something like the following instead of a direct string comparison? Also consider this will be in a loop reading a file so it will be executed several times which is the concern for large files.

std::string largeString1;  // large but not huge meaning a line of text like up to lets say 500 chars 
std::string largeString2;

// is this better than then next block in terms of performance and if so by how much?
if ( std::hash<std::string>(largeString1) == std::hash<std::string>(largeString2) )
{
// true logic
}

// is this a lot slower than the previous
if ( largeString1 == largeString2 )
{
// true logic
}
Martial answered 26/7, 2013 at 17:7 Comment(2)
And what if there is a hash collision between 2 different strings? The two comparisons will produce different results in that case. Also, most (all?) implementations will implement string comparison as an eager comparison that quits at the first mismatch found; this is not the case with computing a hash.Turne
yes on second thought this probably wasn't a great question I thought by hashing the string I was somehow gaining performance by not having to do a char by char comparison but as you and Mooing Duck show that I should trust the lib implementation to be fast by default.Martial
H
12
std::hash<std::string>(largeString1) == std::hash<std::string>(largeString2)

Will be far slower than

largeString1 == largeString2

Hashing a string involve iterating over the entire length of it. So the hash comparison requires the code to iterate the full length of both strings one at a time and run them through complex equations. The straight equality code simply iterates them at the same time and immediately quits the instant it finds a difference. Trust the library. If == could be done faster, they would have done it faster.

If you're going to be comparing each string many times, then hashing ahead of time once and comparing just the hashes may be faster, but you would still have to confirm matches since comparing hashes can give false positives. It only makes the "do not match" case faster.

Halve answered 26/7, 2013 at 17:16 Comment(3)
You are absolutely correct in the vast majority of cases. However, consider a scenario where you have to repeatedly compare the same strings again and again. Having computed hashes will allow you to perform inequality checks much much faster (Obviously, equality will have to be computed the normal way as well once the hashes match)Bauhaus
@Vadim: Edited that into the answerHalve
If you aren't expecting strings of the same size often, it will be faster to first check the lengths for equality (operator== does this first before doing the character by character comparison). We are getting down to only pathelogical cases where doing the precomputed hash comparison first will be faster (mainly same length strings and same initial substring), but even then I can come up with a better algorithm (in this case such as doing the equality comparison of characters from right to left) than checking hashes first.Knish

© 2022 - 2024 — McMap. All rights reserved.