database cassandra cql cassandra-3.0 cqlengine

Cassandra CQLEngine Allow Filtering

I'm using Python Cassandra Cqlengine extension. I create many-to-many table but I receive error in user_applications model query filtering process. I'm readed different resource for this problem, but I did not fully understand this problem.

Sources: https://ohioedge.com/2017/07/05/cassandra-primary-key-partitioning-key-clustering-key-a-simple-explanation/

Cassandra Allow filtering

Is ALLOW FILTERING in Cassandra for following query efficient?

Database Model:

class UserApplications(BaseModel):
    __table_name__ = "user_applications"

    user_id = columns.UUID(required=True, primary_key=True, index=True)
    application_id = columns.UUID(required=True, primary_key=True, index=True)
    membership_id = columns.UUID(required=True, primary_key=True, index=True)

Error Message:

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

Python CQLEngine Code:

q = UserApplications.filter(membership_id=r.membership_id,
                                    user_id=r.user_id,
                                    application_id=r.application_id)

CQLEngine SQL Statements:

SELECT "id", "status", "created_date", "update_date" FROM db.user_applications WHERE "membership_id" = %(0)s AND "user_id" = %(1)s AND "application_id" = %(2)s LIMIT 10000

Describe Table Result:

CREATE TABLE db.user_applications (
    id uuid,
    user_id uuid,
    application_id uuid,
    membership_id uuid,
    created_date timestamp,
    status int,
    update_date timestamp,
    PRIMARY KEY (id, user_id, application_id, membership_id)
) WITH CLUSTERING ORDER BY (user_id ASC, application_id ASC, membership_id ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';
CREATE INDEX user_applications_membership_id_idx ON db.user_applications (membership_id);

Waiting your helps.

Solution

The reason you are getting this error is that you are not adding ALLOW FILTERING flag to your query, if you add ALLOW FILTERING to the end of your query it should work.

Using ALLOW FILTERING in Cassandra queries actually allows cassandra to filter out some rows after loading them (maybe after it loads all rows from a table). For example in the case of your query the only way Cassandra can execute this query is by retrieving all the rows from the table UserApplications and then by filtering out the ones which do not have the requested value for the each of the columns your are restricting.

Using ALLOW FILTERING can have unpredictable performance outcomes and the actual performance depends on data distribution inside your table. If your table contains for example a 1 million rows and 95% of them have the requested value for the columns your are specifying the query will still be relatively efficient and you should use ALLOW FILTERING. On the other hand, if your table contains 1 million rows and only 2 rows contain the requested values , your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. In general if your queries require adding ALLOW FILTERING then probably you should rethink about your schema or add secondary indexes for the columns you are querying often.

In your case I suggest making columns membership_id, user_id, application_id as a composite partition key. If you do so you will no longer need to filter out any rows after loading because all rows having the same values for the three column will reside on the same partition (in the same physical node), and you should provide the three values in the query (you are already doing so in the query you added in the question). Here is the way you can do so:

CREATE TABLE db.user_applications (
    user_id uuid,
    application_id uuid,
    membership_id uuid,
    created_date timestamp,
    status int,
    update_date timestamp,
    PRIMARY KEY ((user_id, application_id, membership_id))
);