Recommended Cassandra Schema to store application logs
Asked Answered
N

1

2

I have been tasked to come up with a schema to store our application logs using Cassandra. I am quite new to Cassandra but from what I have read and learned so far, it could be the best approach for our use case.

Our application send thousands of SMS each day (provided by 3 local service providers) and we would love to keep a log each time an SMS is sent (for reconciliation purposes at each month's end among other things). We intend to store the information below:

id                  text,      // uuid
phone_number        text,      // recipient of the SMS
message             text,      // Message sent
status              boolean,   // if the SMS was sent or not
response            text,      // Request response
service_provider    text,      // e.g Twilio, Telnyx, Venmo etc
date                timestamp, // Time SMS is sent

We would like to query the following reports at any one time:

  1. Total number of SMS sent
  2. Total SMS sent for a given period of time (between 2 dates)
  3. Total SMS sent by a specific service provider (also within a given time period)
  4. Total SMS sent to a specific recipient phone number (also within a given time period)
  5. Total failed or successful SMS sent (also within a given period of time)

I have come up with the following tables (3) but I feel like I am over engineering or over thinking it? Perhaps it could be done simpler? I would appreciate any advice in getting this to work efficiently.

create table sms_logs_by_id
(
    id                  text,
    phone_number        text,
    message             text,
    status              boolean,
    response            text,
    provider            text,
    service_provider    text,
    date                timestamp,
    primary key (id, date)
) with clustering order by (date DESC);

create table sms_logs_by_service_provider
(
    id                  text,
    phone_number        text,
    message             text,
    status              boolean,
    response            text,
    provider            text,
    service_provider    text,
    date                timestamp,
    primary key (service_provider, date)
) with clustering order by (date DESC);

create table sms_logs_by_phone_number
(
    id                  text,
    phone_number        text,
    message             text,
    status              boolean,
    response            text,
    provider            text,
    service_provider    text,
    date                timestamp,
    primary key (phone_number, date)
) with clustering order by (date DESC);

create table sms_logs_by_status
(
    id                  text,
    phone_number        text,
    message             text,
    status              boolean,
    response            text,
    provider            text,
    service_provider    text,
    date                timestamp,
    primary key (status, date)
) with clustering order by (date DESC);

Queries run pretty well so far. I am not sure if this is the most optimum way of modelling the data. I would appreciate any advice on how I can improve this data model. Thank you!

Norikonorina answered 4/1, 2023 at 19:24 Comment(2)
what are your key service level objectives - throughput and latency requirements here for this use case? Also, I'd encourage your to practice data modeling via this & this courses as you had mentioned you're new to using C*.Dorey
Thanks @Madhavan, taking a look at these now. Key objective is latency. Being able to fetch thousands of records quickly (< 5s).Norikonorina
S
1

The only potential issue I see is that for the last 3 tables (logs by status, phone number and provider), is that the partitions will get larger over time. It's important to remember that Cassandra has a mathematical limit of 2 billion cells per partition (where a "cell" == a column value or key). But you want to model you data so that you don't get anywhere near that limit, because your table will start getting slow long before that.

For these three, I'd recommend a "bucketing" approach:

sms_logs_by_service_provider
...
primary key (service_provider, date)

For this one, my other concern is the fact that you're tracking 3 service providers. So in addition to the partitions growing with each message, there's only 3 partitions. So data isn't distributing very well. With thousands of messages sent per day, I'm thinking that your "bucket" is going to need to be fairly precise...probably by "day." Maybe you could get away with a "week_bucket," but I'll use day for this example:

id                  text,
provider            text,
service_provider    text,
day_bucket.         int,
date                timestamp,
PRIMARY KEY ((service_provider, day_bucket), date, id)

This way, you're creating a partition for each combination of service_provider and day. That will give you plenty of data distribution, plus your partitions won't grow beyond the activity which happens in a single day. I've kept date as a DESC clustering key (good idea) but added id as a "tie-breaker," just in case two messages have the exact same timestamp.

create table sms_logs_by_phone_number (
...
primary key (phone_number, date)

So for this one, I'd take a similar approach. But as we're talking about individual users, we can use a much larger bucket. Based on a quick Google search, the average person sends 85 text messages per day, which comes out to 31,025 per year. That's probably ok to store by year.

id                  text,
phone_number        text,
year_bucket         int,
date                timestamp,
PRIMARY KEY ((phone_number, year_bucket), date, id)

Partitioning by phone_number already gives you some good distribution. Adding year_bucket in there will ensure that the partition won't have unbound growth. Also, id for a tie-breaker.

create table sms_logs_by_status(
...
primary key (status, date)

Logs by status is going to have a similar problem to the "provider" table, in that you probably only have a few statuses, so the data distribution will be limited. For this one, you're probably also going to want to use a bucket that's small, like by day.

id                  text,
status              text,
day_bucket.         int,
date                timestamp,
PRIMARY KEY ((status, day_bucket), date, id)

The unfortunate part, is these changes this complicate your query patterns. But they're necessary to save you from problems down the road.

Snell answered 20/1, 2023 at 14:40 Comment(1)
Hey @Aaron! Thanks so much for the guidance. I have adopted my model to follow the bucket principle. Makes sense from the way you explained it. Especially about data distribution. Thanks so much for taking the time to explain these concepts. I appreciate it :)Norikonorina

© 2022 - 2024 — McMap. All rights reserved.