How Does The Google URL Shortener Generate A 5 Digit Hash Without Collisions
Asked Answered
S

3

12

How can the Google URL shortener generate a unique hash with five characters without collisions. Seems like there are bound to be collisions, where different urls generate the same hash.

stackoverflow.com => http://goo.gl/LQysz

What's also interesting, is the same URL, generates a completely different hash each time:

stackoverflow.com => http://goo.gl/Dl7sz

So, doing some math, using lower-case characters, upper-case characters, and digits, the total number of combinations are 62^5 = 916,132,832 clearly collisions bound to happen.

How does Google do this?

Synthiasyntonic answered 3/11, 2011 at 1:56 Comment(1)
It keeps track of previously-generated paths, presumably.Vitus
B
9

They have a database which tracks all previously generated URLs and the longer URL that each of those maps to. Easy to make sure that newly generated URLs don't already exist in that table. A little tricky to scale out (they surely have multiple servers so each one needs to be assigned a bucket of values from which it can give out to users). If they ever reach the point of having generated 916,132,832 URLs, they'll just add another character.

Bolden answered 3/11, 2011 at 2:4 Comment(2)
So your saying the hash is basically just a random character/digits generator, and it checks the database if that hash has already been created. If so, simply tries to generate another random character/digit hash. Seems really inefficient.Synthiasyntonic
For those who are wondering. 62^5 = 916,132,832Eisenhower
A
1

They have a hash table with hash to url.

Count the number of rows in that table and encrypt it with a stream cipher then encode with base62.

Using a stream cipher instead of a hash will give you a short pseudo random output that doesn't collide with any previous output so you don't need to check the table.

Apogamy answered 24/10, 2019 at 15:35 Comment(1)
Can you please elaborate more this answer. Sounded good for what i want.Hydrophyte
F
-1
  1. It keeps track of previously used long URLs. This means that, when someone goes to create a short URL, if the place they are pointing to already has a short URL, it will just give them the pre-existing short URL.

  2. Actually, it would be inefficient to have a system dedicated to creating 'hashes' based on a given set of data. Rather, the short URL is simply a random set of characters which has already been identified as ten digits, plus 26 lowercase letters, plus 26 uppercase letters = 916132832 permutations (not combinations). Random short URLs is the most efficient way to make it work, and that is why they are always different (though I suppose there could be some other component in the algorithm like the time of day, but I don't think it's worth it....there's no point in making it that complex; spending all of that processing power just to make a silly 5 character string which any monkey could do by pressing a button the right way on a permutation calculator).

Frank answered 9/12, 2011 at 20:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.