I have been tasked to come up with a schema to store our application logs using Cassandra. I am quite new to Cassandra but from what I have read and learned so far, it could be the best approach for our use case.
Our application send thousands of SMS each day (provided by 3 local service providers) and we would love to keep a log each time an SMS is sent (for reconciliation purposes at each month's end among other things). We intend to store the information below:
id text, // uuid
phone_number text, // recipient of the SMS
message text, // Message sent
status boolean, // if the SMS was sent or not
response text, // Request response
service_provider text, // e.g Twilio, Telnyx, Venmo etc
date timestamp, // Time SMS is sent
We would like to query the following reports at any one time:
I have come up with the following tables (3) but I feel like I am over engineering or over thinking it? Perhaps it could be done simpler? I would appreciate any advice in getting this to work efficiently.
create table sms_logs_by_id
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (id, date)
) with clustering order by (date DESC);
create table sms_logs_by_service_provider
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (service_provider, date)
) with clustering order by (date DESC);
create table sms_logs_by_phone_number
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (phone_number, date)
) with clustering order by (date DESC);
create table sms_logs_by_status
(
id text,
phone_number text,
message text,
status boolean,
response text,
provider text,
service_provider text,
date timestamp,
primary key (status, date)
) with clustering order by (date DESC);
Queries run pretty well so far. I am not sure if this is the most optimum way of modelling the data. I would appreciate any advice on how I can improve this data model. Thank you!
The only potential issue I see is that for the last 3 tables (logs by status, phone number and provider), is that the partitions will get larger over time. It's important to remember that Cassandra has a mathematical limit of 2 billion cells per partition (where a "cell" == a column value or key). But you want to model you data so that you don't get anywhere near that limit, because your table will start getting slow long before that.
For these three, I'd recommend a "bucketing" approach:
sms_logs_by_service_provider
...
primary key (service_provider, date)
For this one, my other concern is the fact that you're tracking 3 service providers. So in addition to the partitions growing with each message, there's only 3 partitions. So data isn't distributing very well. With thousands of messages sent per day, I'm thinking that your "bucket" is going to need to be fairly precise...probably by "day." Maybe you could get away with a "week_bucket," but I'll use day for this example:
id text,
provider text,
service_provider text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((service_provider, day_bucket), date, id)
This way, you're creating a partition for each combination of service_provider
and day
. That will give you plenty of data distribution, plus your partitions won't grow beyond the activity which happens in a single day. I've kept date
as a DESC
clustering key (good idea) but added id
as a "tie-breaker," just in case two messages have the exact same timestamp.
create table sms_logs_by_phone_number (
...
primary key (phone_number, date)
So for this one, I'd take a similar approach. But as we're talking about individual users, we can use a much larger bucket. Based on a quick Google search, the average person sends 85 text messages per day, which comes out to 31,025 per year. That's probably ok to store by year.
id text,
phone_number text,
year_bucket int,
date timestamp,
PRIMARY KEY ((phone_number, year_bucket), date, id)
Partitioning by phone_number
already gives you some good distribution.
Adding year_bucket
in there will ensure that the partition won't have unbound growth. Also, id
for a tie-breaker.
create table sms_logs_by_status(
...
primary key (status, date)
Logs by status is going to have a similar problem to the "provider" table, in that you probably only have a few statuses, so the data distribution will be limited. For this one, you're probably also going to want to use a bucket that's small, like by day.
id text,
status text,
day_bucket. int,
date timestamp,
PRIMARY KEY ((status, day_bucket), date, id)
The unfortunate part, is these changes this complicate your query patterns. But they're necessary to save you from problems down the road.