How unique are the first 8-12 characters of SHA256 hashes?
Asked Answered
C

1

11

Take this hash for example:

ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad

It's too long for my purposes so I intend to use a small chunk from it, such as:

ba7816bf8f01
ba7816bf

Or similar. My intended use case:

  • Video gallery on a website, represented by thumbnails. They are in random order.
  • They play in the lightbox. They don't have a unique ID, only their URL is unique.
  • While the lightbox is open I add something to the end of the page URL with JS History API.

//example.com/video-gallery/lightbox/ba7816bf8f01

  • The suffix needs to be short and simple, definitely not a URL.
  • People share the URL.
  • The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
  • Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.

I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?

I got the idea from URLs and Hashing by Google.

Collaborationist answered 11/3, 2018 at 19:4 Comment(6)
you could use a function that produces a shorter hash like md5... or should be fine just truncating your sha256. You increase the chance of collision from almost impossible to slightly less almost impossible.Briticism
maybe also consider base64 encoding to increase the number of bits encoded in your id. I see this done all the time- look at image links on imgurBriticism
I've thought about that. Why is base_convert("ba7816bf8f01", 16, 36); shorter (when it uses less possible characters) than base64_encode("ba7816bf8f01");?Collaborationist
that is the string representation of those bytes in hex. A hex character is 4 bits but the string representation of that character is 16 bits. You need to base64 encode the bytes, not the string representing those bytesBriticism
well... depends on what you are using. In java char is 16 bits, in C its 8, but in any case its more than the underlying bytes you are representing.Briticism
Consider base58 encoding instead of base64. It has a number advantages when generating human-readable identifiers.Chabot
P
9

The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).

Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.

Perth answered 11/3, 2018 at 19:18 Comment(2)
Would I Base64 the hash or the source URL?Collaborationist
@Firsh: The hash. In PHP, that looks like base64_encode(hash('sha256', $input, true)) – note the true to give raw instead of hex-encoded hash output.Perth

© 2022 - 2024 — McMap. All rights reserved.