What is the optimal way to model one-to-many relationships in Cassandra?
Asked Answered
M

1

8

Say I want to design a system where users can create posts, where each post belongs to one user but a user may have multiple posts. Also assume I want to support finding all posts given a userID in addition to simply finding a post by postId. I also want to store user-specific account details like the date of account creation.

One way of modeling this would be as follows:

CREATE TABLE user (
   userId int,
   name varchar,
   userDetail1,
   userDetail2,
   ...,
   PRIMARY KEY(userId)
);

CREATE TABLE post (
   postId int,
   postDetail1,
   postDetail2,
   ...,
   userId int,
   PRIMARY KEY(postId)
);

From what I've read, this is supposedly not optimal as querying for posts made by a specific user becomes memory inefficient. Is this correct? And is the reason that Cassandra would not support indexing the post table on userId?

So would the ideal solution be as follows?

CREATE TABLE user (
   userId int,
   name varchar,
   userDetail1,
   userDetail2,
   ...,
   PRIMARY KEY(userId)
);

CREATE TABLE post (
   postId int,
   postDetail1,
   postDetail2,
   ...,
   userId int,
   PRIMARY KEY(postId)
);

CREATE TABLE user_to_post (
   userId int,
   postId int,
   userDetail1,
   userDetail2,
   ...,
   postDetail1,
   postDetail2,
   ...,
   PRIMARY KEY(userId, postId)
);

Using a composite key, querying for posts for a specific user is much more efficient. But with this design, would having a table for posts specifically be redundant? Again, in this design I want lookups for posts made by a specific user, and also would like to quickly link to a specific user given a post. I have done a lot of reading but am very confused as how to exactly design a one-to-many relationship in Cassandra.

Mallorie answered 10/7, 2017 at 20:26 Comment(0)
B
13

It depends highly all the requests you are trying to achieve. If I understand correctly, you want to be able to:

  1. Get a specific user by its ID
  2. Get the list of posts for an user

I will base most of my advice from the excellent page Basic Rules of Cassandra Data Modeling from DataStax. You have to understand first that there is no definite answer to that question. It highly depend on the queries you are trying to run, and on the tradeoffs you are ready to make. For example: do you expect the number of posts for a specific user to be really high (thousands, or millions)? What is the most frequent query (i.e. the one to model the data around)?

  • The first model seems to break the rule 2: minimize the number of partition reads. The partition key for the posts table being the post ID (that I will suppose to be random, such as an UUID), the result will be that posts are spread across the cluster. Consequently, supposing that you have the list of posts for a specific user (which actually requires a very inefficient cluster scan), your request will have to hit every server in the cluster if the number of posts per user is sufficiently large. This is the worst case, and definitely not something you want.

  • The second model is inherently better, because every request can be achieved using a single request. You are trading storage for read performance, which is usually a very good thing to do. I may just suggest looking at Materialized Views (Cassandra 3.0+) which do help a lot in maintaining such a table for you – although doing exactly what you propose with MVs is complicated as you can only provide one table as the view source (i.e. the posts).

I can also suggest an alternative model, which fixes the design flaw from the first proposal without the data duplication (which is, again, not a problem) the key here is to use for the posts the User ID as partition key, and the Post ID as clustering key. This allows all the post for a specific user to be stored on the same node, therefore providing good performance for requesting the posts from a specific user.

CREATE TABLE user (
   userId int,
   name varchar,
   userDetail1,
   userDetail2,
   ...,
   PRIMARY KEY(userId)
);

CREATE TABLE post (
   userId int,
   postId int,
   postDetail1,
   postDetail2,
   PRIMARY KEY(userId, postId)
);

The main drawback of this solution is that it complexifies slightly the process of retrieving a single post: you have to pass know the user ID in addition to the post ID. This may not be a problem as both are inherently linked.

Once again, remember that except for very simple cases, an optimal way of doing anything in computer science is very unlikely to exist. It depends what set of metrics you are trying to maximize, the tradeoffs you are ready to make, and more importantly for storage systems, the workload you will be running.

Barbados answered 11/7, 2017 at 1:37 Comment(2)
i will propose the following addition: Add Year to partition key to table POST so ...PRIMARY KEY((userId,Year), Time,PostID) WITH CLUSTERING ORDER BY (Time DESC, POSTID) Why? This makes your partition a bounded partition (limited by something). In this scenario, it will be Year so partitions do not grow excessively large Its just an example, something else could be used to "bound" the partition too but generally you want partitions small and bounded.Hydrazine
@Hydrazine Wouldn't Timeuuid with ((...), partId) as type for postId be preferred over ((..), time, postId)? Up to my understanding, your version wouldn't allow to query by postId without knowing the exact time of the post.Consideration

© 2022 - 2024 — McMap. All rights reserved.