How to generate Unique Integer Only IDs like Facebook Twitter
Asked Answered
W

4

18

After searching SO and other sites, I've failed to come up with conclusive evidence to how Facebook, Twitter and Pinterest generate their ID's. The reason this is needed is to avoid url collisions. Moving to an entirely different ID will prevent this because there wont be quadrillions of records.

  • Facebook.com/username/posts/362095193814294
  • Pinterest.com/pin/62487513549577588
  • Twitter.com/#!/username/status/17994686627061761

If you look at Pinterest as an example, the first few digits relate to the user id, and the last 6 or so digits represent the save id which possibly could be an auto increment.

To create a similar ID, but not unique I was able to use: base_convert(user_id.save_id, 16, 10). The problem here is that it's not unique, ex: base_convert(15.211, 16, 10) vs. base_convert(152.11, 16, 10). These two are the same. Simply just merging two unique sets of numbers will still produce duplicate results. Throwing uniqid() into the mix will essentially fix the duplicates, but this doesn't seem like a great practice.

Update: Twitter appears to use this: https://github.com/twitter/snowflake

Any suggestions on generating a unique ID like the above examples?

Whitefaced answered 15/2, 2012 at 21:38 Comment(4)
Here's how Flickr does this.Bechtold
How about just generating a random big integer? Then, if you get a conflict attempting to insert it into the database (which would be very rare), you'd generate a new one.Permissible
Sorry, but... "url collisions" between what? Your ID and a Facebook/Pinterest/Twitter ID?Archaeornis
@OhCaN Url Collisions between past and current links. For instance, at an link /2342/ could have different data than /6922/ if instead of using a new Unique ID I were to use another auto incremented integer starting at 1. This is the reason to move towards using a new type of ID.Whitefaced
D
3

The Flickr comment up above was very useful. We use sharding as well. We have an bigint (int64) locator field. It is generated by combining an int (int32) database id and an int (int32) identity field.

If you know you will have an int16 number of database max (quite likely), you could combine an int16 (smallint) database id and an int32 (int) user id and an int16 (smallint) action id. I don't know reasonable numbers for your application. But reserve some part for the database id, even if it's just tinyint, so you know you're future safe if you add more databases.

Dekker answered 15/2, 2012 at 23:9 Comment(1)
For anyone coming to this answer, it's worth checking out this article to further explain about reserving bits in your IDs to future-proof your ID structure. engineering.pinterest.com/blog/…Whitefaced
D
7

Suppose your IDs are all numeric. Delimit them by a character A (since it surely does not appear in the original IDs) and do a base conversion from base-11 to base-10.

For the example you did we now get different results:

echo base_convert("15A211", 11, 10); //247820
echo base_convert("152A11", 11, 10); //238140
Draff answered 15/2, 2012 at 22:40 Comment(3)
*"we now get different results", sorry :)Draff
Is there any chance for duplicate results here? I don't see a chance at all considering on both sides of the delimiter are integers only...Whitefaced
1. Those numbers are different, no matter what is the base used to encode them. So if they are different in base 11, they will be different in base 10, too. 2. There are no two different integers that share representation in the same base. All in all this means, the result is immune to duplicates.Draff
P
3

Actually, if you look at (for example) the IDs of users on your Friends (on Facebook), you'd notice that they are sequential among all users, exactly like an AUTO_INCREMENT database field. However, they probably don't start at 1. My friends list, for example, has some numbers in the millions, then suddenly jump to 1 trillion and something, so my guess is that the auto_increment value was bumped up - this may be done to "hide" exactly how many users there are.

Anyway, to generate unique IDs, just create them sequentially with that AUTO_INCREMENT field. Optionally, set the initial value to something high.

Peatroy answered 15/2, 2012 at 22:33 Comment(8)
thanks for your response. This is an option but not ideal as you mentioned. It allows for someone to find out how many users are on the site or how many posts are on the site.Whitefaced
That's why you set the initial AUTO_INCREMENT value. Provided nobody knows for certain what the "first" one was, it's no problem.Peatroy
To set the autoincrement: ALTER TABLE tbl AUTO_INCREMENT = 100. You can also set it in the CREATE TABLE.Schulman
It's somewhat obvious when you reset auto incremented values. Going from 1 to 10000000 or other large jumps are apparent that the increment was changed. Unless you start with values such as 11139239. Also, another indication is when you get thousands of values that only change by + 1. Ex: 25665312866, 25665312867, 25665312868, 25665312869, etc.Whitefaced
Thanks @MarcusAdams - I knew this was possible and had seen the command somewhere, but I've never used it myself and I forgot it :DPeatroy
@stwhite: On Facebook, how likely is it that one person (or group of people) will see every single ID, or even enough of them to deterimine much of a pattern?Peatroy
Also, don't forget about auto_increment_increment system variable.Schulman
The reason I'm not a fan of int IDs is the easiness of url hacking to change the user id, etcDekker
D
3

The Flickr comment up above was very useful. We use sharding as well. We have an bigint (int64) locator field. It is generated by combining an int (int32) database id and an int (int32) identity field.

If you know you will have an int16 number of database max (quite likely), you could combine an int16 (smallint) database id and an int32 (int) user id and an int16 (smallint) action id. I don't know reasonable numbers for your application. But reserve some part for the database id, even if it's just tinyint, so you know you're future safe if you add more databases.

Dekker answered 15/2, 2012 at 23:9 Comment(1)
For anyone coming to this answer, it's worth checking out this article to further explain about reserving bits in your IDs to future-proof your ID structure. engineering.pinterest.com/blog/…Whitefaced
S
2

Twitter uses snowflake IDs: https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake

They are generated by merging 41 bits of timestamp (calculated from a custom epoch) with 5 bits of data center ID, 5 bits of process / worker ID, 12 bits of sequence and 1 bit is constant, reserved, totaling for 64 bits. You can achieve 4096 unique IDs / ms / worker / data center.

Spiculate answered 4/10, 2023 at 13:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.