My goal is to do a sum of the messages_sent and emails_sent per each DISTINCT provider_id value for a given time range (fromDate < stats_date_id < toDate), but without specifying a provider_id. In other words, I need to know about any and all Providers within the specified time range and to sum their messages_sent and emails_sent.
I have a Cassandra table using an express-cassandra schema (in Node.js) as follows:
module.exports = {
fields: {
stats_provider_id: {
type: 'uuid',
default: {
'$db_function': 'uuid()'
}
},
stats_date_id: {
type: 'timeuuid',
default: {
'$db_function': 'now()'
}
},
provider_id: 'uuid',
provider_name: 'text',
messages_sent: 'int',
emails_sent: 'int'
},
key: [
[
'stats_date_id'
],
'created_at'
],
table_name: 'stats_provider',
options: {
timestamps: {
createdAt: 'created_at', // defaults to createdAt
updatedAt: 'updated_at' // defaults to updatedAt
}
}
}
To get it working, I was hoping it'd be as simple as doing the following:
let query = {
stats_date_id: {
'$gt': db.models.minTimeuuid(fromDate),
'$lt': db.models.maxTimeuuid(toDate)
}
};
let selectQueries = [
'provider_name',
'provider_id',
'count(direct_sent) as direct_sent',
'count(messages_sent) as messages_sent',
'count(emails_sent) as emails_sent',
];
// Query stats_provider table
let providerData = await db.models.instance.StatsProvider.findAsync(query, {select: selectQueries});
This, however, complains about needing to filter the results:
Error during find query on DB -> ResponseError: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance
.
I'm guessing you can't have a primary key and do date range searches on it? If so, what is the correct approach to this sort of query?
So while not having used Express-Cassandra, I can tell you that running a range query on your partition key is a hard "no." The reason for this, is that Cassandra can't determine a single node for that query, so it has to poll every node. As that's essentially a full scan of your table across multiple nodes, it throws that error to prevent you from running a bad query.
However, you can run a range query on a clustering key, provided that you are filtering on all of the keys prior to it. In your case, if I'm reading this right, your PRIMARY KEY looks like:
PRIMARY KEY (stats_date_id, created_at)
That primary key definition is going to be problematic for two reasons:
stats_date_id
is a TimeUUID. This is great for data distribution. But it sucks for query flexibility. In fact, you will need to provide that exact TimeUUID value to return data for a specific partition. As a TimeUUID has millisecond precision, you'll need to know the exact time to query down to the millisecond. Maybe you have the ability to do that, but usually that doesn't make for a good partition key.
Any rows underneath that partition (created_at
) will have to share that exact time, which usually leads to a lot of 1:1 cardinality ratios for partition:clustering keys.
My advice on fixing this, is to partition on a date column that has a slightly lower level of cardinality. Think about how many provider messages are usually saved within a certain timeframe. Also pick something that won't store too many provider messages together, as you don't want unbound partition growth (Cassandra has a hard limit of 2 billion cells per partition).
Maybe something like: PRIMARY KEY (week,created_at)
So then your CQL queries could look something like:
SELECT * FROM stats_provider
WHERE week='201909w1'
AND created_at > '20190901'
AND created_at < '20190905';
TL;DR;