What is the optimal way to model one-to-many relationships in Cassandra?

CREATE TABLE user ( userId int, name varchar, userDetail1, userDetail2, ..., PRIMARY KEY(userId) ); CREATE TABLE post ( postId int, postDetail1, postDetail2, ..., userId int, PRIMARY KEY(postId) );

CREATE TABLE user ( userId int, name varchar, userDetail1, userDetail2, ..., PRIMARY KEY(userId) ); CREATE TABLE post ( postId int, postDetail1, postDetail2, ..., userId int, PRIMARY KEY(postId) ); CREATE TABLE user_to_post ( userId int, postId int, userDetail1, userDetail2, ..., postDetail1, postDetail2, ..., PRIMARY KEY(userId, postId) );

It depends highly all the requests you are trying to achieve. If I understand correctly, you want to be able to:

Get a specific user by its ID
Get the list of posts for an user

I will base most of my advice from the excellent page Basic Rules of Cassandra Data Modeling from DataStax. You have to understand first that there is no definite answer to that question. It highly depend on the queries you are trying to run, and on the tradeoffs you are ready to make. For example: do you expect the number of posts for a specific user to be really high (thousands, or millions)? What is the most frequent query (i.e. the one to model the data around)?

The first model seems to break the rule 2: minimize the number of partition reads. The partition key for the posts table being the post ID (that I will suppose to be random, such as an UUID), the result will be that posts are spread across the cluster. Consequently, supposing that you have the list of posts for a specific user (which actually requires a very inefficient cluster scan), your request will have to hit every server in the cluster if the number of posts per user is sufficiently large. This is the worst case, and definitely not something you want.
The second model is inherently better, because every request can be achieved using a single request. You are trading storage for read performance, which is usually a very good thing to do. I may just suggest looking at Materialized Views (Cassandra 3.0+) which do help a lot in maintaining such a table for you – although doing exactly what you propose with MVs is complicated as you can only provide one table as the view source (i.e. the posts).

I can also suggest an alternative model, which fixes the design flaw from the first proposal without the data duplication (which is, again, not a problem) the key here is to use for the posts the User ID as partition key, and the Post ID as clustering key. This allows all the post for a specific user to be stored on the same node, therefore providing good performance for requesting the posts from a specific user.

CREATE TABLE user (
   userId int,
   name varchar,
   userDetail1,
   userDetail2,
   ...,
   PRIMARY KEY(userId)
);

CREATE TABLE post (
   userId int,
   postId int,
   postDetail1,
   postDetail2,
   PRIMARY KEY(userId, postId)
);

The main drawback of this solution is that it complexifies slightly the process of retrieving a single post: you have to pass know the user ID in addition to the post ID. This may not be a problem as both are inherently linked.

Once again, remember that except for very simple cases, an optimal way of doing anything in computer science is very unlikely to exist. It depends what set of metrics you are trying to maximize, the tradeoffs you are ready to make, and more importantly for storage systems, the workload you will be running.

Recommended topics

Hot tags