Why might a System.String object not cache its hash code?
Asked Answered
A

6

44

A glance at the source code for string.GetHashCode using Reflector reveals the following (for mscorlib.dll version 4.0):

public override unsafe int GetHashCode()
{
    fixed (char* str = ((char*) this))
    {
        char* chPtr = str;
        int num = 0x15051505;
        int num2 = num;
        int* numPtr = (int*) chPtr;
        for (int i = this.Length; i > 0; i -= 4)
        {
            num = (((num << 5) + num) + (num >> 0x1b)) ^ numPtr[0];
            if (i <= 2)
            {
                break;
            }
            num2 = (((num2 << 5) + num2) + (num2 >> 0x1b)) ^ numPtr[1];
            numPtr += 2;
        }
        return (num + (num2 * 0x5d588b65));
    }
}

Now, I realize that the implementation of GetHashCode is not specified and is implementation-dependent, so the question "is GetHashCode implemented in the form of X or Y?" is not really answerable. I'm just curious about a few things:

  1. If Reflector has disassembled the DLL correctly and this is the implementation of GetHashCode (in my environment), am I correct in interpreting this code to indicate that a string object, based on this particular implementation, would not cache its hash code?
  2. Assuming the answer is yes, why would this be? It seems to me that the memory cost would be minimal (one more 32-bit integer, a drop in the pond compared to the size of the string itself) whereas the savings would be significant, especially in cases where, e.g., strings are used as keys in a hashtable-based collection like a Dictionary<string, [...]>. And since the string class is immutable, it isn't like the value returned by GetHashCode will ever even change.

What could I be missing?


UPDATE: In response to Andras Zoltan's closing remark:

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

Whoa, whoa there! This is an interesting point to make (and yes it's very true), but I really doubt that this was taken into consideration in the implementation of GetHashCode. The statement "therefore to cache the result would be wrong" implies to me that the framework's attitude regarding strings is "Well, they're supposed to be immutable, but really if developers want to get sneaky they're mutable so we'll treat them as such." This is definitely not how the framework views strings. It fully relies on their immutability in so many ways (interning of string literals, assignment of all zero-length strings to string.Empty, etc.) that, basically, if you mutate a string, you're writing code whose behavior is entirely undefined and unpredictable.

I guess my point is that for the author(s) of this implementation to worry, "What if this string instance is modified between calls, even though the class as it is publicly exposed is immutable?" would be like for someone planning a casual outdoor BBQ to think to him-/herself, "What if someone brings an atomic bomb to the party?" Look, if someone brings an atom bomb, party's over.

Advance answered 16/6, 2010 at 13:37 Comment(7)
you can always wrap string into something that caches hash code. but this doesn't answer your questionNorvun
@Andrey: Right, really I'm just wondering why this choice would have been made. Because it seems obvious to me that this would at least have crossed someone's mind. I am pretty confident there is either a very good reason this isn't done that I haven't thought of, or I'm simply missing something.Advance
don't know offhand, but it may not be necessary due to the string interning subsystem of the framework.Armington
@Will: But it's unclear to me how a class like Dictionary<TKey, TValue>, for example, could leverage string interning. It does need to call GetHashCode to figure out which bucket to put a string in, right? And if it needs to call GetHashCode, whether the string is interned or not, it seems to me this calculation needs to be performed. But, as I have said, I feel like I'm missing something here.Advance
@Will: I wondered about interning as well, but that's only used for string literals and strings which are explicitly Interned at runtime - Dictionary et al do not appear to do any of that (memory overhead would be huge too in many cases). And as Dan points out - you'd still need to recalculate the hashcode to fetch the correct interned string; except in the edge case of compile-time constant strings.Demandant
Might be interesting to point out that Java does cache the hash code for strings.Porett
This is such a good question with even better answers.Agglomeration
O
30

Obvious potential answer: because that will cost memory.

There's a cost/benefit analysis here:

Cost: 4 bytes for every string (and a quick test on each call to GetHashCode). Also make the string object mutable, which would obviously mean you'd need to be careful about the implementation - unless you always compute the hash code up-front, which is a cost of computing it once for every string, regardless of whether you ever hash it at all.

Benefit: Avoid recomputing the hash for string values hashed more than once

I would suggest that in many cases, there are many, many string objects and very few of them are hashed more than once - leading to a net cost. For some cases, obviously that won't be the case.

I don't think I'm in a good position to judge which comes up more often... I would hope that MS has instrumented various real apps. (I'd also hope that Sun did the same for Java, which does cache the hash...)

EDIT: I've just spoken to Eric Lippert about this (NDC is awesome :) and basically it is about the extra memory hit vs the limited benefits.

Olaolaf answered 16/6, 2010 at 13:51 Comment(15)
I don't personally think that the memory is the issue here. I think that it's partly to do with the fact that string.GetHashCode is rarely called - Dictionary, HashSet et al will use a different implementation (see my answer), that the concept of a de-facto hash-code for a string is actually nonsensical unless you always see it as a byte array, plus the fact that via the code that Tim posted you can modify the underlying memory buffer of the string, thus rendering a stored hash code useless.Demandant
@Andras: I don't see it as a byte array - I see it as a char array, basically. I think if someone mutates the memory, they're on dangerous ground to start with... so my guess (and that's all it is) is still on memory. Dictionary etc can use a different implementation, but I don't believe they will by default, which is what most people are likely to use.Olaolaf
At this point it would be hard not to acknowledge this as the authoritative answer. I might make a few more remarks on this later, but for now I'm just going to go ahead and accept it. Thanks for the really helpful insight.Advance
The string object wouldn't have to be mutable if you just generate the hash code eagerly. This can be partly justified because the hash code is an equality optimization for ordinal comparisons. If two hash codes are not equal, you immediately know the strings are not equal.Cropdusting
@naasking: But it's a waste of time for any string which isn't compared for ordinal equality with other strings. So yes, it's definitely a third option - but not without its own downsides. Will edit that into the answer.Olaolaf
True, but consider that this hash code can be constructed while the string is being copied from its source. The extra cost is those 4 bytes, which isn't too bad, and you immediately optimize string equality. Small strings take a hit in memory bloat, but you can do a low-level optimization here too, by merging the hash code and the first 4 bytes of the string for small strings. These hash codes are then equal to the bits making up the first 4 characters of the string. You need compiler support for strings, which is pretty typical these days.Cropdusting
@naasking: You can doing it while it's being copied, but that's still a CPU hit which may well be completely unnecessary for many, many strings in the system. Would want to see numbers about how many strings are compared for equality in various situations.Olaolaf
I wonder what the cost would be of having two string constructors, one of which allocates an extra four bytes for a string object and stores the hash code there (and the other of which, doesn't), and force the LSB of the "identity hash code" value to be set for strings which are pre-hashed and clear for those that don't? The only extra work on "normal" string construction would be clearing the bit in the identity-hash field, and the only other added work would be that GetHashCode would check that bit to determine whether to fetch or compute the hash.Shulman
@supercat: I'm not sure how feasible it is to override the way that the identity hash code is constructed - I think it's tied into aspects of the object's sync block, so can't be tweaked independently. I could be wrong, of course.Olaolaf
@JonSkeet: Last I read, any object whose identity hash has been queried has the hash value stored, either in the "sync block index" word (if there's no sync block; the stored number will be something that couldn't be a valid sync block index) or in the object's sync block entry within the sync block table. For this approach to work, all strings would have to pick an identity-hash value eagerly rather than lazily, but given that it doesn't matter if objects occasionally share identity-hash values, I would think a 3-4 instruction sequence would suffice.Shulman
@JonSkeet: BTW, I'm somewhat curious as to what it would have cost to have all objects' constructors choose an identity-hash value? If ESI holds the new address, something like MOV EAX,[scrambler] / ADD EAX,ESI / OR AX,80000000h / MOV [ESI],EAX, where scrambler would be a word that sometimes gets loaded with arbitrary values? Compared to the amortized per-byte cost of GC, the cost of four instructions which read a value [scrambler] that could likely be cached would be pretty trivial.Shulman
@supercat: Ah - I thought if there wasn't already a sync block, it was allocated on the first call. No idea about the per-constructor cost.Olaolaf
@JonSkeet: In .NET 1.0, the first request for an identity hash code created a sync block to hold it. That was one of the changes in .NET 2.0. I'm not sure exactly how many bits of the sync-block-index words are available to hold the identity hash value (some other bits indicate other things, like whether the current GC cycle has examined an object yet), but I would expect that enough remain that losing one to indicate whether a string is pre-hashed would be a small loss. Actually, if one could use the type-information word rather than the sync block word, that could offer a bigger advantageShulman
@supercat: Ah, thanks for the info. It's easy to miss that sort of thing :)Olaolaf
@JonSkeet: for strings that would often be compared: if a "special string" included space for a reference to the "oldest" [by some definition] known instance that was known to match, one could achieve some of the speed advantages of string interning without actually having to intern any strings. The GC would have to know about that, though, and for it to work without locking it would be necessary for string objects to have some form of immutable ranking. I wonder if anything like that has been explored?Shulman
D
13

Firstly - there's no knowing if caching this result would actually improve Dictionary<string, ...> et al because they don't necessarily use String.GetHashCode, because it uses an IComparer to get the hashcode for a string.

And if you follow the likely call chain for the StringComparer class, it ends up going through to the System.Globalization.CompareInfo class, which finally terminates at this method:

[SecurityCritical, SuppressUnmanagedCodeSecurity, DllImport("QCall",
   CharSet=CharSet.Unicode)]
private static extern int InternalGetGlobalizedHashCode(IntPtr handle, string
   localeName, string source, int length, int dwFlags);

There's no knowing if that library - which appears to be a native method - doesn't use some form of internal caching based on the underlying .Net object data structure that we can't get at once inside the .Net runtime.

However, the important thing to note with this is that one string can have many different hash codes based on how you chose to interpret the characters. Granted, this implementation is culture-inspecific - which is why it's unsuitable for these comparers.

So, whilst the additional memory storage could be a factor, I actually think it's because to store a hash code along with an instance of the string misleads the caller, and indeed the .Net internal dev team(!), into thinking that the string only has one hash code, when in fact it entirely depends on how you're going to interpret it - as a series of bytes (which most of us do not), or as a series of printable characters.

From a performance point of view, then, if we also accept that these comparers used by Dictionary<,> etc can't be using the internal implementation, not caching this result probably doesn't have much of an impact because, frankly, how often will this method actually get called in the real world: since most of the time a hashcode of a string is most likely calculated via some other mechanism.

EDIT

There's also the point made in Tim's answer(+1 there). If he's right, and I think he is, then there's no guarantee that a string is actually immutable after construction, therefore to cache the result would be wrong.

AN ADDITIONAL EDIT(!)

Dan makes the point that strings are meant to be immutable within the Net sphere and therefore that string should be free to cache it's own hashcode based on this. The problem here is that the .Net framework also provides a legitimate way to change the supposedly immutable string that does not involve privileged reflection or anything else. It's a fundamental problem with strings, it's a pointer to a buffer that you cannot control. Never mind in the C# world, what about in C++, where vectoring over and modifying memory buffers is common-place. Just because you ideally shouldn't do it doesn't mean that the framework should expect you not to.

.Net happens to provide this functionality, and therefore if this was a design decision by the .Net team in response to the kind of binary thuggery suggested by Tim, then they were very wise to have taken it into account. Whether they did, or whether it is by fluke, is another matter entirely! :)

Demandant answered 16/6, 2010 at 14:7 Comment(8)
I knew there was something I just wasn't thinking of, and I think you nailed it here: the hashtable-based classes in .NET don't necessarily use TKey.GetHashCode; they use the IComparer<TKey>.GetHashCode of their internal comparers. That makes sense!Advance
In response to your edit: Take a look at my update at the bottom of the question. (Not saying it isn't an interesting point, of course! I just don't think that could be part of the reason for the implementation of GetHashCode.)Advance
@Dan Tao: Have updated again - while I think you have a point about 'shouldn't modify it', the point is that .Net framework gives the developer the ability to do so within legal bounds - and therefore the .Net team would have been right to have taken that into account.Demandant
@Andras: You make your point well, but I still have to disagree. What is the purpose of having a hash code in the first place? The obvious answer that immediately comes to mind is: to facilitate using an object as a key in a hashtable. But working as a key also requires a valid Equals method; and if we're mutating our keys, suddenly all of our hashtable operations aren't going to work anymore.Advance
@Dan Tao: Yes, but I don't see the problem here (apart from the fact that we both agree modifying the string is a bad move!). If a developer is dumb enough to modify a string in this way and then try and use it as a key to retrieve something originally added by that string (the hash table won't keep the string reference, only the resulting hash), he/she won't get a hit and then they'll be left scratching their head.Demandant
@Andras: I feel like you're being a little inconsistent, though. First you say the designers were right to consider the potentiality for string mutation in their GetHashCode implementation; but then you say that a developer who would mutate a string used in a hashtable deserves no mercy. So for whose benefit was it decided not to cache the hash code, then? Not for a "smart" developer who wouldn't ever mutate a string, nor for a "dumb" developer who would. Also, I'm pretty sure a hashtable does need to keep a reference to the string; otherwise, how could it check if a key exists?Advance
@Dan Tao: Fair enough - thought I'd take a look at HashTable... Guess what, it also uses IEqualityComparer - so that's null and void again. If this decision was indeed taken at all (rather than being an oversight, which I highly doubt) then it's not for the developers' benefit at all. It's for the benefit of the framework such that it behaves consistently in the face of potentially inconsistent usage by a developer which; I say again, the framework also allows he or she to do. The fact that we can't come up with a good reason why you('d) should(/want) (to) mutate string is not importantDemandant
Finally, we have an open-source CLR therefore we can see that InternalGetGlobalizedHashCode doesn't cache anythingMoniz
B
12

I may have made a wrong conclusion here, but isn't it true that while the string is immutable in the context of a .NET String object, it's still possible to change the value?

For instance, if you were so inclined to do this...

String example = "Hello World";

unsafe
{
    fixed (char* strPointer = myString) {
        strPointer[1] = 'a';
    }
} 

...wouldn't example still represent the same String object, but now with a value that would compute a different value for GetHashCode()? I may be off-base here, but since you could easily (if not pointlessly) do this, that would cause some issues as well.

Burnaby answered 16/6, 2010 at 14:21 Comment(6)
+1 - a very good point. It might be hideously bad practise - but you could certainly do it!Demandant
Yes, you could do this, and in fact I've written an entire blog post about how this is possible; but the fact is that the framework relies on you not doing this, for many, many reasons. As just one example, consider string interning of literals. If you type Console.WriteLine("Hello World") after the code you've posted it will output Hallo World. Since the framework relies so heavily on immutable strings, I doubt the designers would've bothered themselves with this possibility.Advance
@Dan: That's definitely a fair argument, given that what I wrote is certainly not something I'd ever expect anyone to do in actual .NET code. Like the other answers mention though, I feel like it's ultimately a case of the arguments against just stacking up, whether they're just theoretical issues or otherwise. As a side note, that was an enjoyable blog post.Burnaby
Good point. I certainly agree with you about the arguments stacking up against. Ultimately, as I imagine generally happens whenever one questions decisions made by the .NET designers on low-level, widely-used code like this (at least when I do, anyway), it looks like -- not surprisingly -- there's a lot more to think about than I originally considered.Advance
Sure, but you can also use tricks like this to assign the integer value 1 to bool1 and 2 to bool2. Then, bool1 == true, bool2 == true but bool1 && bool2 != true. The framework is not designed to defend against these kinds of things. That's for consenting adults only.Porett
Worth noting that back then (before .NET 4.0) strings were mutable for StringBuilder class (and potentially for few more internally). It's not a strong argument since it's within the framework, but perhaps they thought hash code maintenance wasn't worth the headache.Agglomeration
P
1

One more potential reason for this is that interned strings (specifically those that are added as shared readonly data by the compiler) can have exactly the same format as any other string. The fact that these strings are loaded into readonly memory means that those data pages can be shared easily across process, but that the it would not be possible to also have them cache a hashcode.

But as others have mentioned, the primary reason for not caching the value is that the additional memory usage is likely to far outweigh the potential savings of hashcode caching. The execution time of GetHashCode is O(N) on the length of the string so the worst case scenario of repeated hashing is well bounded.

Parmentier answered 16/6, 2010 at 15:25 Comment(0)
D
1

Yes, it will cost memory, but what's more important, it will cost memory even if you don't use this feature.

May be it would be beneficial to have a hashcode-optimized string implementation in framework.

Anyway, it should be trivial to implement your own:

public sealed class InternedString : IEquatable<InternedString>
{
    public InternedString(string s) => String = string.Intern(s);

    public string String { get; }

    public override bool Equals(object obj) => String.Equals(obj);

    public bool Equals(InternedString other) => String.Equals(other?.String);

    public override int GetHashCode() => RuntimeHelpers.GetHashCode(String);

    public static bool operator ==(InternedString l, InternedString r) =>
        l?.String == r?.String;

    public static bool operator !=(InternedString l, InternedString r) => !(l == r);
}

The idea here is to make sure that each wrapped string is interned, so we can rely on the string references of the same strings inside InternedString to be always the same. This approach optimizes both GetHashCode and Equals calls, making this class a perfect candidate for a Dictionary key.

The drawback is the cost of interning. Using it everywhere is an overkill. Typical usage scenario is a Dictionary with a few, but very long string keys.

UPD:

BTW, I have packaged it, and added a benchmark, check it out.

Dominations answered 25/10, 2019 at 17:25 Comment(0)
D
0

Any int value is a valid HashCode. This means there is no default int value like -1 or 0 that we can use to indicate that we haven't computed the HashCode yet. So if a string were to cache its HashCode, it would need to do one of the following:

  • Have an int field for the HashCode, plus a bool field to serve as a flag for whether the HashCode has been computed yet, and then only compute the HashCode the first time it's requested (lazy evaluation), or
  • Have an int field for the HashCode, and always compute the HashCode when the string is constructed.

Both choices have a drawback; the first requires yet more additional memory, and the second has the performance cost of computing HashCodes that may never be needed.

Now consider the case of Dictionary<TKey,TValue>. The HashCode used by Dictionary depends upon which comparer is being used. The default comparer will use the object's normal GetHashCode() method. But you could create a Dictionary that uses a case insensitive comparer for example, and the HashCode used by Dictionary will be produced by that comparer, which is likely to produce an entirely different HashCode than String.GetHashCode(). So which HashCode does the string cache? A string might be in two Dictionaries, with each using a different comparer, neither of which uses the normal string GetHashCode. So the string could be caching a HashCode none of the Dictionaries even use.

In the case of Dictionary<TKey,TValue>, there is an even more important reason that having strings cache their HashCodes will likely provide no performance benefit. The internal implementation of Dictionary does the following when a new entry is added:

  • Computes the HashCode of the key using the GetHashCode() method of the equality comparer provided at construction, or the default comparer if none was specified.
  • Strips the sign bit off the HashCode
  • Stores the new entry, which consists of the modified HashCode from above, the key, the value, and the index of the next entry in the list of entries that map to the same bucket.

When the Dictionary does a Key lookup, it computes the modified (i.e. positive) HashCode of the key being searched for, gets the bucket that HashCode maps to, then looks through the list of entries in that bucket. To check if an entry is a match, it first checks if the modified HashCodes match (if the keys are equal, the HashCodes must be equal too), and if they are equal, checks if the two keys are equal as well. In the case of strings, this algorithm achieves two things; first, it avoids many string comparisons by using a simple integer compare first to see if it's worth doing a string compare, and second, it caches the HashCodes of every key in the Dictionary. The HashCode of each key in the Dictionary is computed only once, when the key/value pair are added to the Dictionary.

(If you're wondering why Dictionary strips the sign bit from the HashCode, it's because it uses a -1 as a marker flag value in the hashCode field for entry slots that are currently empty.)

Domesday answered 22/6, 2012 at 23:11 Comment(6)
Thanks for the thorough and very helpful answer! That's some great info. The one thing I'd just push back on slightly is that it seems you're saying it would not help for a string to cache its hash code since the dictionary caches it anyway. But that's not really true, right? I mean, if I call ContainsKey(str) it isn't like the dictionary magically knows what the hash code for str is just because it happens to be cached in the dictionary. It needs to get it from str before it can look for it internally. Great points otherwise, though.Advance
You're correct on your push back point. I fudged a little on that part by throwing in the qualifier "likely" for just that reason. It's the keys in the Dictionary whose HashCode we know will be needed multiple times (during capacity grows, and during key lookups) and will benefit most from caching. The string "str" in ContainsKey(str), we know less about, and it may very well be that the one GetHashCode() call used by ContainsKey is the only GetHashCode() in the life of str.Domesday
An implementation of String.GetHashCode() could easily be designed so that it would never return zero (or perhaps only return zero for an empty string). If nothing else, an implementation could compute a hash code which could return any possible integer value and then, if the computed value was zero, substitute the character code for the first character. Since only one in 2^32 strings would require such a substitution, it would have almost no effect upon hash distribution.Shulman
Your 2 options paint too bleak a picture, because you miss a simple third one, which has actually been implemented in Java's String.hashCode() method, probably since day 1: Only calculate if the cached/unitialized value is 0. As @Shulman noted, you could make the hash algo avoid or reduce the likelihood of that value. But even if you don't, "recalculate if hash is 0" will in any event require much less calculation than "always recalculate hash".Putscher
@EugeneBeresovsky: IMHO, the right approach for dealing with string hash codes (and actually strings in general) would have been to have String be an abstract class for which derivatives could only be defined in the system, and have different classes for strings with different characteristics. For strings containing entirely ASCII characters [including the vast majority of "machine-readable strings"], using 16 bits per character is very wasteful. For strings which are short, reserving space for a hash code would be very wasteful. For long strings, however, caching 64-bit hash values...Shulman
...could be useful, especially if the hash computation were done in a way that code assembling strings from constituent parts whose hash values were known could make use of that knowledge. If long strings kept 64-bit hash values internally, most comparisons between unequal strings could be resolved very quickly without having to search through the entire string for a mismatch [the scenario of two equal-length strings differing only in a few characters is quite common].Shulman

© 2022 - 2024 — McMap. All rights reserved.