I recently botched a job interview by poorly answering a straightforward question: how do sites like LinkedIn efficiently show the relationship distance (1st/2nd/3rd) from you to every person displayed on a page (e.g. in people search results, list of people working in a company, etc.)?
<EDIT> I got the essential "trick" of the solution: finding "distance from me" is a common operation (e.g. 20x+ on a single page, 100's per login session), so you can do part of the "distance of me to X", cache it, and then re-use that cached partial result many times in order to make other operations much cheaper. I also guessed that the partial result was likely to be my second-level connections, because "cache all 3rd-level connections" would be too costly in RAM and CPU.</EDIT>
But when trying to convert this insight into a solution, I came up with a bumbling answer involving creating persistent caches of 2nd-level connections of everyone on the site (which would have been hugely epensive in perf and complex to maintain), and I took an inexplicable detour into using Bloom Filters in an way that made little technical sense. I wouldn't have hired myself after an answer like that!
Later, as I thought about the problem without the pressure of an interview hanging over my head, I came up a more reasonable answer.
Build a very fast way to get the first-level connections for each of batch of user IDs (batch size up to ~1000?). This probably means a dedicated cluster of lots-of-RAM servers which can cache the entire network's 1st-level connections in memory. Luckily, 50M members x avg. 100 connections per member x 4 bytes per member ID = <25GB to cache in RAM, which is doable with reasonably-priced hardware. And the number of changes per day is going to be under 1%, so keeping the cache up-to-date is not too hard. (Note that a relational database would probably be a bad choice to implement this cache because the "lots of random I/O" access pattern kills relational DB performance.)
when a user logs in, cache his or her 2nd-level connections by fetching 1st-level connections of every 1st-level connections, and stick in a hashtable (key = 2nd-level ID, value = array of 1st-level connections which connect you). Also cache your first-level connections too so you can pull back both 1st- and 2nd-level via a single call back to your remote cache server. User IDs are easily partitionable, so a distributed cache like memcached may work well for this.
for any user ID, to find whether it's in your "network" and what relationship it is to you (1st, 2nd, 3rd), do the following:
- if the ID is in your first-level connections, stop.
- try looking up the ID in your cached 2nd-level connections hashtable. If found, return the array of connections which connect you.
- fetch the ID's first level connections, and repeat step #2 for each of them. Aggregate all results into a single array and return them.
- <EDIT> refactor into a batch implementation ("look up distance from me to N different users") so you can get all the remote results from step #3 without having to make up to N remote calls.</EDIT>
But I'm sure there are better answers to this. What's yours? If you want extra challenge, try simulating an inteview situation (can't look up solutions on the Web).
Note that the question was about an optimal solution, regardless of how LinkedIn actually does it today, which I looked up after I wrote my own answer above.