Hashing SSNs and other limited-domain information
Asked Answered
G

3

17

I'm currently working on an application where we receive private health information. One of the biggest concerns is with the SSN. Currently, we don't use the SSN for anything, but in the future we'd like to be able to use it to uniquely identify a patient across multiple facilities. The only way I can see to do that reliably is through the SSN. However, we (in addition to our customers) REALLY don't want to store the SSN.

So naturally, I thought of just SHA hashing it since we're just using it for identification. The problem with that is that if an attacker knows the problem domain (an SSN), then they can focus on that domain. So it's much easier to calculate the billion SSNs rather than a virtually unlimited number of passwords. I know I should use a site salt and a per-patient salt, but is there anything else I can do to prevent an attacker from revealing the SSN? Instead of SHA, I was planning on using BCrypt, since Ruby has a good library and it handles scalable complexity and salting automagically.

It's not going to be used as a password. Essentially, we get messages from many facilities, and each describes a patient. The only thing close to a globally unique identifier for a patient is the SSN number. We are going to use the hash to identify the same patient at multiple facilities.

Gilbertson answered 23/7, 2010 at 3:28 Comment(4)
You may not wish to use SSN in this fashion: people may mis-write it on forms, or it may change over time.Caras
I second sarnold. I have seen this use of SSN information just fail horribly -- another scenario is when no SSN is (immediately) available.Navarro
That's a good point, but this is intended to be more of a fuzzy solution, it doesn't have to be correct 100% of the time. Also, when no SSN is available, then the feature just wont work for that patient. The only other proposed option is to use their insurance information, which has its own issues with accuracy and availability.Gilbertson
Pet peeve: A 'site salt' is a secret key, not a salt. If you plan on using one, use HMAC, as naive prepend or append strategies for hashing with a private key have vulnerabilities.Aitch
O
11

The algorithm for generating Social Security Numbers was created before the concept of a modern hacker and as a consequence they are extremely predictable. Using a SSN for authentication is a very bad idea, it really doesn't matter what cryptographic primitive you use or how large your salt value is. At the end of the day the "secret" that you are trying to protect doesn't have much entropy.

If you never need to know the plain text then you should use SHA-256. SHA-256 is a very good function to use for passwords.

Oyster answered 23/7, 2010 at 7:17 Comment(12)
SHA-256 is a good hashing function, but it is quite too fast for this case. I need something where I can control the complexity, such as bcrypt or PBKDF2. My concern is even with all of the salting features, it will still not be enough to prevent reversal.Gilbertson
@Preston Marshall If you desire a slow message digest function then you are confused on why they are so useful. NIST will never approve a slow message digest function. The whole point is that the function is very fast in one direction, but very computationally complex to reverse. The matter of brute force should be addressed using a salt.Oyster
@Preston Marshall It doesn't matter if you use an encryption function when the real value is less than 9999 guesses away.Oyster
@Preston Marshal bcrypt is using blowfish (a very old block cipher) as a message digest function. The properties are the same, and speed is still a describable attribute.Oyster
@The Rook: See this article for why SHA256 is bad and bcrypt is good: chargen.matasano.com/chargen/2007/9/7/…Gilbertson
@Preston Marshall This guy is not a cryptographer. He is suggesting to use a deprecated block cipher, twofish is the replacement for blowfish. He is also in direct conflict with NIST, the which is the authority for cryptographic functions.Oyster
@Preston Marshall a slow message digest is a terrible trade off, you are burning your own resources needlessly and not gaining any appreciable security. A far better trade off would be to hide the salt from the attacker, the attacker must obtain the salt before it can be brute forced. John the Ripper works well for salted hashes using any primitive.Oyster
@Preston Marshall But your ignoring the most important part, its the matter of the weakest link in the chain, your trying to protect 4 numbers from an attacker, it doesn't matter how many times you call sha-512 or blowfish or any combination of the two. Four number cannot hold enough entropy to be a reasonable secret.Oyster
But wouldn't a slow algorithm, especially one where the slowness can be increased help thwart offline attacks?Gilbertson
@Preston Marshall I don't see how this helps at all. Especially when the attacker only has to make 1k guesses and you have to make this calculation every time someone logs in.Oyster
@Preston Marshall thanks for the check, but upvote? Does that mean you know this is the right answer but you don't like it? Haha!Oyster
Something like that. You pretty much nailed it with the fact that SSNs don't have enough entropy to begin with, the rest is irrelevant.Gilbertson
H
6

If you seriously want to hash a social security number in a secure way, do this:

  1. Find out how much entropy is in an SSN (hint: there is very little. Far less than a randomly chosen 9 digit number).
  2. Use any hashing algorithm.
  3. Keep fewer (half?) bits than there is entropy in an SSN.

Result:

  • Pro: Secure hash of an SSN because of a large number of hash collisions.
  • Pro: Your hashes are short and easy to store.
  • Con: Hash collisions.
  • Con: You can't use it for a unique identifier because of Con#1.
  • Pro: That's good because you really really need to not be using SSNs as identifiers unless you are the Social Security Administration.
Huddleston answered 30/7, 2010 at 4:1 Comment(1)
You said Collision is a con, but imagine: You stored the full hash and some sets of records have the same secure data; So the hashes match, but the rest of the data doesn't. Now attackers have a way to identify these records as related and can find patterns. But storing a sub-hash as you suggest will PURPOSELY create collisions, and an attacker really has no idea if there is a relation. This causes a Slight overhead in the end where you receive a subset instead of only your target record(s), and have to later decrypt the subset to narrow it to your targets. Decide per-system if it's worth it.Darra
S
1

First, much applause and praise for storing a hash of the SSN.

It appears as if you're reserving the SSNs as a sort of 'backup username.' In this case, you need another form of authentication besides the username - a password, a driver's license number, a passport number, proof of residence, etcetera.

Additionally, if you're concerned that an attacker is going to predict the top 10,000 SSNs for a patient born in 1984 in Arizona, and attempt each of them, then you can put in an exponentially increasing rate limiter in your application.* For additional defense, build in a notification system that alerts a sys-admin when it appears that there is an unusually high number of failed login attempts.**

*Example exponentially increasing rate limiter: After each failed request, delay the next request by (1.1^N) seconds, where N is the number of failed requests from that IP. Track IP and failed login attempts in a DB table; shouldn't add too much load, depending on the audience of your application (do you work for Google?).

**In the case where an attacker has access to multiple IPs, the notification will alert a sys-admin who can use his or her judgment to see if you have an influx of stupid users or it's a malicious attempt.

Severe answered 23/7, 2010 at 17:27 Comment(4)
It's not going to be used as a password. Essentially, we get messages from many facilities, and each describes a patient. The only thing close to a globally unique identifier for a patient is the SSN number. We are going to use the hash to identify the same patient at multiple facilities.Gilbertson
No points / applause / praise for storing a long hash of a SSN. It is functionally equivalent to storing the SSN.Huddleston
@slartibarfast: Functionally equivalent yes. In the case their database gets compromised or stolen, he just prevented his company from leaking identifying information.Severe
@BenWalther Prevented? No. Delayed? Yes, by about 15 seconds.Perez

© 2022 - 2024 — McMap. All rights reserved.