Why do string hash codes change for each execution in .NET?
Asked Answered
K

2

41

Consider the following code:

Console.WriteLine("Hello, World!".GetHashCode());

First run:

139068974

Second run:

-263623806

Now consider the same thing written in Kotlin:

println("Hello, World!".hashCode())

First run:

1498789909

Second run:

1498789909

Why do hash codes for string change for every execution in .NET, but not on other runtimes like the JVM?

Kensell answered 11/10, 2022 at 5:29 Comment(10)
What version of .NET are you using? A debug or release build? Debugger attached?Nuthatch
Docs: "The hash code itself is not guaranteed to be stable. Hash codes for identical strings can differ across .NET implementations, across .NET versions, and across .NET platforms (such as 32-bit and 64-bit) for a single version of .NET. In some cases, they can even differ by application domain. This implies that two subsequent runs of the same program may return different hash codes." (I know that doesn't answer your question as to WHY though)Pennate
Cannot reproduce with .NET framework 4.8. I get the same hash code (243930825) in every run (also the same in Debug and Release)Phylloxera
I can repro in .NET 6. I found the reason: String.GetHashCode uses a random seed value.Nuthatch
@Phylloxera You're probably using .NET Core (see: andrewlock.net/…)Pennate
@Pennate I actually use .NET framework 4.8 (updated my comment).Phylloxera
I meant not using. My bad.Pennate
I've not been able to find a definitive reason for WHY it's like this. I've only found the speculation in the article I linked previously.Pennate
On a side note. I wonder what the use-case for an unstable hash function is. Isn't that effectively just a random generator at that point?Whelm
@Whelm - it's stable within the process whilst that process is running, which is all they ever wanted to guarantee.Nougat
C
34

Why do hash codes for string change for every execution in .NET

In short to prevent hash collision attacks. You can roughly find out the reason from the docs of the <UseRandomizedStringHashAlgorithm> configuration element:

The string lookup in a hash table is typically an O(1) operation. However, when a large number of collisions occur, the lookup can become an O(n²) operation. You can use the configuration element to generate a random hashing algorithm per application domain, which in turn limits the number of potential collisions, particularly when the keys from which the hash codes are calculated are based on data input by users.

but not on other runtimes like the JVM?

Not exactly, for example Python's hash function is random. C# also produces identity hash in .net framework, core 1.0 and core 2.0 when <UseRandomizedStringHashAlgorithm> is not enabled.

For Java maybe it's a historical issue because the arithmetic is public, and it's not good, read this.

Calamondin answered 11/10, 2022 at 7:12 Comment(10)
The question is about string hash codes across program executions. This answer quotes documentation about string hash codes across application domains. Hence my downvote.Evangelize
Each program execution would create one or more AppDomain, so the two terms can be used interchangeably in this case.Clemmer
@TheodorZoulias I need to explain first. The main reason I quote this document is to explain the harm caused by hash collision. My understanding is that the two clauses of the whole question should be parallel. So I think what OP wants to ask is not why the hash code changes, but why it changes in C#, but not in Java. And although the docs says "per application domain", I've tested (in .net framework 4.7.1) it changes per execution when this flag is set. So I think it means "per application domain per execution".Calamondin
The OP hasn't specified the target .NET platform, but it's a safe bet that their observations were on the currently evolving platform (.NET Core and .NET 5+), and not on the stagnate .NET Framework. The .NET Core does not support application domains, so I don't think that the quoted text from the docs is directly relevant to the question asked. It might be insightful, but that's it.Evangelize
@shingo: An appdomain is an isolation region within a process. So of course if you start two processes, you are dealing with two distinct groups of appdomains.Ditchwater
@TheodorZoulias: It might be more correct to say that Core doesn't support multiple appdomains. You get just the one, there are no appdomain boundaries to manage, so no management functions. But anything documented as "per-appdomain" still applies in .NET Core, with one appdomain per process.Ditchwater
@BenVoigt what is the effect of the setting <UseRandomizedStringHashAlgorithm enabled=0|1 /> in a .NET Core application? Does it make any difference if you configure it with 0 or 1? My point is that, assuming that it makes no difference, the quoted text does not apply to the .NET platform that the OP has experimented with. So this answer does not explain directly the results of these experiments.Evangelize
@BenVoigt I didn't catch Jeremy's comment, but now I understand, thanks for your explanation.Calamondin
@TheodorZoulias you don't have to fixate on the effect of this configuration. I found this docs because I remember that randomized string hash was already implemented in .Net Framework, and now it becomes the default behaviour.Calamondin
Shingo your latest edit improved the answer enough to revoke my downvote. I think that it could be improved even further by explaining where is this quoted text coming from, where it applies, and how it relates with the currently evolving .NET platform. Adding this context is needed IMHO, because the text as is can (1) mislead people into believing that an option to prevent the randomization exists (it doesn't), and (2) confuse people who are not familiar with the concept of application domains in .NET.Evangelize
G
14

Why do hash codes change for every execution in .NET?

Because changing the hash code of strings (and other objects!) on each run is a very strong hint to developers that hash codes do not have any meaning outside of the process that generated the hash.

Specifically, the documentation says:

Furthermore, .NET does not guarantee the default implementation of the GetHashCode method, and the value this method returns may differ between .NET implementations, such as different versions of .NET Framework and .NET Core, and platforms, such as 32-bit and 64-bit platforms. For these reasons, do not use the default implementation of this method as a unique object identifier for hashing purposes. Two consequences follow from this:

  • You should not assume that equal hash codes imply object equality.
  • You should never persist or use a hash code outside the application domain in which it was created, because the same object may hash across application domains, processes, and platforms.

By changing the hash code of a given object from one run to the next, the runtime is telling the developer not to use the hash code for anything that crosses a process/app-domain boundary. That will help to insulate developers from bugs stemming from changes to the GetHashCode algorithms used by standard classes.

Having hash codes change from one run to the next also discourages things like persisting the hash code for use as a "did this thing change" short-cut. This both prevents bugs from changes to the underlying algorithms and bugs from assuming that two objects of the same type with the same hash code are equal, when no such guarantee is made (in fact, no such guarantee can be made for any data structure which requires or allows more than 32 bits, due to the pigeonhole principle).

Why do other languages generate stable hash codes?

Without a thorough language-by-language review, I can only speculate, but the major reasons are likely to be some combination of:

  • historical inertia (read: "backwards compatibility")
  • the disadvantages of stable hash codes were insufficiently understood when the language spec was defined
  • adding instability to hash codes was too computationally expensive when the language spec was defined
  • hash codes were less visible to developers
Goldcrest answered 11/10, 2022 at 22:57 Comment(5)
I don't know about .NET, but Perl and Python both generate random hash codes as a defense against denial-of-service attacks. (If the hash codes are static, then providing carefully-selected strings for storage changes the complexity of a hash table from O(1) to O(n) or worse.)Lipchitz
@Mark: it seems that that would rely on the app using hash codes generated out-of-process (eg., using hash codes to see if a data store has the same version of an object that the app does). Which, .NET's documentation explicitly says shouldn't be done, so it seems that defending against DoS attacks is a side-benefit or special case of "don't let hash codes out of the app domain".Goldcrest
It relies on the attacker being able to know in advance what hash codes will be generated inside the process, not that the program accept hash code generated externally.Nougat
Fair point, but that still relies on the attacker knowing other internal details of the code (eg., that a hash table is in use), and that the strings be hashed as-is (eg., not appended with the attacker's username). However: since the docs focus on incorrect assumptions about hash codes and equality, I think it's safe to focus on that aspect.Goldcrest
If stable hash codes are required, e.g. to store a passwords, use the classes derived from the System.Security.Cryptography.HashAlgorithm Class.Marelda

© 2022 - 2024 — McMap. All rights reserved.