I’m preparing a system design interview, i was expected to be asked such kind of question in the interview, so I want to show my design process about this. In addition, I would like what are the best practices to solve some difficulties in the process. I would like to think in terms of scalability and how would i handle heavy read and write on database. Please correct me if i’m wrong in any thoughts.
First, I want to build a function subscribe/ unsubscribe. For user, I want to design mark feeds read/unread. How can i design a system like this ? At first glance, the first problems i can see is that If I put every data in database, it could involved tons of read /write operation on database once thousands of users subscribe/unsubscribe from certain source or media source like CNN posts feed every 5 - 10 mins.
Obviously, database would be a bottleneck once the user grows into certain point. How can I solve the issue of this ? What’s the thoughts of solving this problem. Although database is a bottleneck in this point of view, we still need to have database but have a better design right ? I saw a lot of articles talking about denormalized data.
Question: What’s the best way to store like subscribers for each source ?
In database, i can think of a table have “source_id” “user_id”, which user_id subscribes to source_id. Is this a good design or bad? If tons of user subscribe to new source, then database will be a burden. Approach I can think of this is using Redis, which provides fast write and fast read. Advantages:
- Fast read and write operations.
- Provide multiple data structures rather than simple key-value store.
Disadvantages:
Data need to fit in memory ⇒ solution to this: sharding. Sharding I can use twemproxy to manage cluster.
If data loss, we loss data ⇒ solution: replication, embrace “master-slave” setup. Write to master, read from slaves and backup data to disk(data persistency). Additionally, take snapshot on hourly basis.
Now I listed pros and cons of moving to redis cluster, how do I store the relation between source and subscriber in redis ? Is it a good design if I have a hashmap, stored each source and each point to a list of subscriber.
For example,
Cnn ⇒ (sub1, sub2, sub3, sub4….) Espn ⇒ (sub1,sub2,sub3,sub4..) …
In terms of scalability, we can shard each source and user into each dedicated redis nodes.
<<== this is at least i can think of right now.
In addition, we can store user info (what sources does user subscribes to) in redis as well and shard user to multiple cluster
User1 ⇒ (source1, source2, source4..) User2 ⇒ (source1, source2, source4..) ...
For feed and posts from single source, i can have both database table and redis data structure (basically, my idea is to store everything in redis and database as backup, is it a good design consideration in this case? Maybe not everything, only active user in redis or recent feeds)
Database: i want to make as compact as possible, storing only a copy of it. feedID, sourceID, created_timestamp, data
Redis: store feedID, source_id and content, and find subscriber based on source_id.
For read/unread part, I don’t have clear idea how to design around these limitation. Every user have join timestamp and server will push the feed (at most 10 feeds per source) if user haven’t read it. What’s the good design to tell if user read data or not? My initial thoughts of this is to keep track of every feed user read or unread. But the table could grow linearly to the size of feed . In redis, i can design similar structure.
Userid, feedid, status User1, 001, read User1, 002, read User1, 003, unread
At this point, my initial idea of setting up structure of data are like above. Redis runs “master-slave” setup and backup to disk on hourly basis.
Now i’m going to think about how the process of subscribe/ unsubscribe function works. User click subscribe button on media page, for instance, CNN. web server receive request saying “user X” subscribes to “source Y”. On application layer logic, find the machine that has data of user X, this could be achieved by installing sharding map on every application server. Work like this user_id mod shard = machineid.
Once application lookup the server ip that has his (user X) data, application server talks to redis node and update the user structure with new source_id. Subscribe function is the same thing.
For read/unread of specific feed on user X, the application lookup the redis node and update the structure of it, and redis asynchronously make updates to database. (here i embrace eventual consistency).
Let’s thinking about how to design push/pull model. For pushing notification, once there’s a recent feed, i can store it most recent feed in redis and update only the active user (the reason is to avoid as much write operation on database as possible).
For pull model, only updates the user once they reload their home feed page, which also avoid lots of disk seek time.
Some points:
- Only put active user in redis (logged in last 30 days)
- If a user is not active for 6 months, and recently logged back and would like to check feeds. There’s another service reconstruct data from database and put into redis and serve the user.
- Store recent feed in redis and only push notification to active subscribers at this time. This is to avoid disk seek time on database.
- In order to make feed sortable, design timestamp in feedID. For example, first 10 bits of feedID is timestamp, and also we can have another 10 bits for sourceID embedded in ID as well. This make feed sortable in nature.
- Application server can horizontally scale and hide behind load-balancer.
- Application server connect to redis cluster, and database is for storing back and reconstruct data if possible (like inactive user case)
- Redis apply “master-slave” setup. Write to master, read from slave and replicate data asynchronously. Data backup to disk on timely basis. Also updates the database asynchronously.
Questions:
- Is updating database asynchronously using redis when new events coming a feasible solution ? or just keeping the replication is enough ?
I know it’s a long post and want to hear back from the community. Please correct me if i’m wrong or point out any so we can discuss more about approach.