What's the best way to create a short hash, similar to what tiny Url does?
Asked Answered
M

14

54

I'm currently using MD5 hashes but I would like to find something that will create a shorter hash that uses just [a-z][A-Z][0-9]. It only needs to be around 5-10 characters long.

Is there something out there that already does this?

Update 1:

I like the CRC32 hash. Is there a clean way of calculating it in .NET?

Update 2:

I'm using the CRC32 function from the link Joe provided. How can I convert the uInt into the characters defined above?

Megavolt answered 12/7, 2009 at 20:37 Comment(2)
I think you shouldn't use any short hash, so no truncated CRC32 either...Laboured
TinyURL does not use hashes. What are you using your "hash" for? Are you trying to create a hash or a URL shortener; the two are different.Quadrature
Z
68

.NET string object has a GetHashCode() function. It returns an integer. Convert it into a hex and then to an 8 characters long string.

Like so:

string hashCode = String.Format("{0:X}", sourceString.GetHashCode());

More on that: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx

UPDATE: Added the remarks from the link above to this answer:

The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

Notes to Callers

The value returned by GetHashCode is platform-dependent. It differs on the 32-bit and 64-bit versions of the .NET Framework.

Zoroastrian answered 11/1, 2012 at 18:52 Comment(5)
Short and sweet. Like .NET intended.Beheld
The only problem with String.GetHashCode is that it will generate different values on different platforms (32-bit vs. 64-bit). If you're expecting the hash code to be produced and consumed by different applications, you'll need to be careful.Balky
As Brenda stated, GetHashCode() is different on 32 and 64 systems. And, is even different between .net 1.1 and 2.0 CLRs. But most importantly, GetHashCode() is not guaranteed unique! You can get the same hash from two different strings (I know, it happened to me in a production environment).Compensatory
GetHashCode() is not suitable for such tasks. It's not guarantied to have the same value in next .NET version.Enfold
This is a very bad idea, as the exact algorithm by which hash codes are generated for a given class is an implementation detail which should never be persisted, because it can change between .NET versions. In fact, it HAS changed between .NET versions.Vudimir
C
40

Is your goal to create a URL shortener or to create a hash function?

If your goal is to create a URL shortener, then you don't need a hash function. In that case, you just want to pre generate a sequence of cryptographically secure random numbers, and then assign each url to be encoded a unique number from the sequence.

You can do this using code like:

using System.Security.Cryptography;

const int numberOfNumbersNeeded = 100;
const int numberOfBytesNeeded = 8;
var randomGen = RandomNumberGenerator.Create();
for (int i = 0; i < numberOfNumbersNeeded; ++i)
{
     var bytes = new Byte[numberOfBytesNeeded];
     randomGen.GetBytes(bytes);
}

Using the cryptographic number generator will make it very difficult for people to predict the strings you generate, which I assume is important to you.

You can then convert the 8 byte random number into a string using the chars in your alphabet. This is basically a change of base calculation (from base 256 to base 62).

Chondrule answered 12/7, 2009 at 22:24 Comment(1)
"difficult for people to predict the strings you generate, which I assume is important to you" -- aha, that might be true, given Arron's "It only needs to be around 5-10 characters long". This would not be like TinyURL.com then, so it's about time Arron gives us some more details!Laboured
I
17

I dont think URL shortening services use hashes, I think they just have a running alphanumerical string that is increased with every new URL and stored in a database. If you really need to use a hash function have a look at this link: some hash functions Also, a bit offtopic but depending on what you are working on this might be interesting: Coding Horror article

Inscribe answered 12/7, 2009 at 20:45 Comment(0)
A
13

Just take a Base36 (case-insensitive) or Base64 of the ID of the entry.

So, lets say I wanted to use Base36:

(ID - Base36)
1 - 1
2 - 2
3 - 3
10 - A
11 - B
12 - C
...
10000 - 7PS
22000 - GZ4
34000 - Q8C
...
1000000 - LFLS
2345000 - 1E9EW
6000000 - 3KLMO

You could keep these even shorter if you went with base64 but then the URL's would be case-sensitive. You can see you still get your nice, neat alphanumeric key and with a guarantee that there will be no collisions!

Agronomics answered 13/7, 2009 at 1:4 Comment(3)
I like this. :) +1 but how do we do it in .net- quickly?Beheld
Thank you for mention about Base36Gullah
@PiotrKula this is how you do it in .NET #924271Exempt
L
7

You cannot use a short hash as you need a one-to-one mapping from the short version to the actual value. For a short hash the chance for a collision would be far too high. Normal, long hashes, would not be very user-friendly (and even though the chance for a collision would probably be small enough then, it still wouldn't feel "right" to me).

TinyURL.com seems to use an incremented number that is converted to Base 36 (0-9, A-Z).

Laboured answered 12/7, 2009 at 21:6 Comment(2)
Of course you can. Maybe you shouldn't, but it's perfectly possible.Merlon
You're very right indeed. :-) One surely shouldn't use a short hash in this situation though. I'll edit my answer and rewrite my "You cannot create a short hash".Laboured
L
5

First I get a list of random distinct numbers. Then I select each char from base string, append and return result. I'm selecting 5 chars, that will amount to 6471002 permutations out of base 62. Second part is to check against db to see if any exists, if not save short url.

 const string BaseUrlChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

 private static string ShortUrl
 {
     get
     {
         const int numberOfCharsToSelect = 5;
         int maxNumber = BaseUrlChars.Length;

         var rnd = new Random();
         var numList = new List<int>();

         for (int i = 0; i < numberOfCharsToSelect; i++)
             numList.Add(rnd.Next(maxNumber));

         return numList.Aggregate(string.Empty, (current, num) => current + BaseUrlChars.Substring(num, 1));
      } 
  }
Lucie answered 23/9, 2012 at 16:47 Comment(1)
I like how this gives you easy control over the characters, allowing you to exclude characters that are visually ambiguous, like 0, O, l, I, 1, etc.Elderly
C
3

You can decrease the number of characters from the MD5 hash by encoding them as alphanumerics. Each MD5 character is usually represented as hex, so that's 16 possible values. [a-zA-Z0-9] includes 62 possible values, so you could encode each value by taking 4 MD5 values.

EDIT:

here's a function that takes a number ( 4 hex digits long ) and returns [0-9a-zA-Z]. This should give you an idea of how to implement it. Note that there may be some issues with the types; I didn't test this code.

char num2char( unsigned int x ){
    if( x < 26 ) return (char)('a' + (int)x);
    if( x < 52 ) return (char)('A' + (int)x - 26);
    if( x < 62 ) return (char)('0' + (int)x - 52);
    if( x == 62 ) return '0';
    if( x == 63 ) return '1';
}
Centrifuge answered 12/7, 2009 at 20:45 Comment(5)
See codymanix's answer, #1117360Laboured
Hmmm, wouldn't the variable length make it hard to reverse the encoding? When invoking num2char multiple times for longer numbers, the result would need some separator between each encoded value, to tell them apart while decoding again. That makes the result much longer than when using a fixed-length encoding. If one doesn't mind using the + and / characters, then Base 64 encoding is easier I guess.Laboured
According to the question, he's looking for some hash that's shorter than the MD5 that he's currently using, and that uses alphanumerics. So, the current hash is irreversible; I think that's a requirement, or at the least not a problem. And this doesn't have 'variable length' - you take 4 hex digits from the MD5 hash, then pass it to num2char. Then take the next 4, pass that number to num2char, etc. The MD5 hash has 32 hex digits. The string you get out of my algorithm uses 32/4=8 alphanumeric characters.Centrifuge
Of course, the MD5 is irreversible, but isn't the idea that your mapping should be able to decode back to that MD5 value? As for variable length: I was wrong indeed. (I thought 0 would yield "a0", while 25 would yield "a25", but that's obviously "a" and "z" -- don't know how I could be so confused.) However, returning "0" and "1" for 62 and 63 will yield duplicates from the 3rd if(..), right? Base 64 needs the + and / characters for a reason... ;-) (And I guess the 3rd if reads (int)x - 52 instead?)Laboured
hmmm. I didn't consider that my mapping should be decodable back to MD5... I do realize that returning "0" and "1" for 62 and 63 create possible duplicates, which could be a problem, but I was just outlining an idea here. If I can think of a better way, that's easy to interpret and/or elegant, I'll edit my post. Thanks for pointing out my error on the third if statement btw :)Centrifuge
C
2

You can use CRC32, it is 8 bytes long and similar to MD5. Unique values will be supported by adding timestamp to actual value.

So its will look like http://foo.bar/abcdefg12.

Circumscription answered 12/7, 2009 at 20:41 Comment(5)
or, from another way, you can use alphabetical increment. The keys gonna be like this: /a, /b, ... /z, /a0, /aa, /ab, /ac, ... /az, /aba, /abb, /abc, ...Circumscription
check this article - damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_netCircumscription
When prefixing or suffixing a timestamp to the hashed value, then what is the use of the hash?Laboured
@Simeon Pilgrim: yes, he can use CRC32 with a timestamp if his collision expectations are low. A timestamp that includes microseconds alone may be enough to guarantee uniqueness. Ideally, a fast hash like MD5 would be better than CRC.Elderly
@VictorStoddard if the collision expectations are low, he can use the last decimal digit, or the last bit. The point is you want zero collision. Because "expectations" and "will not happen" are not equal.Phytopathology
H
2

If you're looking for a library that generates tiny unique hashes from inters, I can highly recommend http://hashids.org/net/. I use it in many projects and it works fantastically. You can also specify your own character set for custom hashes.

Hexangular answered 11/11, 2015 at 10:44 Comment(0)
C
0

If you don't care about cryptographic strength, any of the CRC functions will do.

Wikipedia lists a bunch of different hash functions, including length of output. Converting their output to [a-z][A-Z][0-9] is trivial.

Clino answered 12/7, 2009 at 20:43 Comment(6)
-1: a CRC only provides error checking, not unique collision avoidance.Phytopathology
If you don't need cryptographic guarantees, they do a pretty good job for damn cheap in terms of CPU.Clino
but two urls will make the same CRC, and therefore have the same short-url, which is useless for a shorting service.Phytopathology
Two urls could also conceivably make the same md5, or sha1, or sha256. In practice these are rare occurrences, but are possible for all hashing schemes given the pigeon-hole principle. More likely with a non-cryptographic hash than with one certainly, but its a case you have to handle regardless of hash function.Clino
That's why tinyUrl etc dont hash or crc the url, they just assign the next number. Really depends on what actually trying to be solved here.Phytopathology
True, URL shortening services almost certainly publish a counter not a hash; but the question is for a hash function with a short output.Clino
A
0

You could encode your md5 hash code with base64 instead of hexadecimal, this way you get a shorter url using exactly the characters [a-z][A-Z][0-9].

Allopatric answered 12/7, 2009 at 21:42 Comment(1)
Though we do not know what Arron wants to use this for: if the URLs are to be entered by humans, then I would make them case insensitive without special characters. Base 36 does this (if the script on the server treats them as such). Unfortunately, Base 36 encoding yields longer URLs than Base 64, but they are less prone to errors. Again: if humans would need to type them.Laboured
W
0

There's a wonderful but ancient program called btoa which converts binary to ASCII using upper- and lower-case letters, digits, and two additional characters. There's also the MIME base64 encoding; most Linux systems probably have a program called base64 or base64encode. Either one would give you a short, readable string from a 32-bit CRC.

Wilheminawilhide answered 13/7, 2009 at 0:28 Comment(0)
E
-1

You could take the first alphanumeric 5-10 characters of the MD5 hash.

Enamelware answered 12/7, 2009 at 20:42 Comment(3)
That's not very unique. The following snippet of code shows that the sequence of numbers from 1 - 1000 has 30 collisions in the first 5 characters: for f in seq 0 10000` ; do md5 -s $f ; done | awk '{print substr($4, 0, 5)}' | sort | uniq -c | sort -n`Insist
Since he's looking for hash with a length of only 5 characters, I thought that uniqueness is not a strong requirement.Enamelware
Well, referring to TinyURL.com suggest a 100% uniqueness requirement to me. So: no short hashes (or any hash if I'd program it).Laboured
K
-2

If you need the hash to change on every call, you can do something like:

string hash = String.Format("{0:X}", DateTime.Now.GetHashCode());
Kimmi answered 27/10, 2021 at 19:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.