databasecassandracqlnon-relational-database

CONTAINS statement that accepts multiple data in Cassandra


I have a question on Cassandra as a newbie. I was wondering if it's possible somehow to make a query with a contains statement (i found out that an in statement might not be the best for lot's of data) that accepts a list of things.

Something like this:

SELECT * FROM table WHERE column CONTAINS('valueA', 'valueB');

Searching online i read that it shouldn't be possible but if i work with too many values I would have to write 60 CONTAINS and the query would be too heavy i guess.

So i was wondering if somehow that is possible in Cassandra or there is a workaround. Thanks!


Solution

  • SELECT * FROM table WHERE column CONTAINS('valueA', 'valueB');

    So while you did mention the IN statement, that is basically what you're asking it to do here. And that will certainly work. If I have a table named times with a single PRIMARY KEY of id, I can do this with the IN operator:

    SELECT * FROm times WHERE id IN (2,4);
    
     id | time_timestamp                  | time_uuid
    ----+---------------------------------+--------------------------------------
      2 | 2038-01-19 03:14:07.000000-0600 | 5d83408f-6944-11fe-7f7f-7f7f7f7f7f7f
      4 | 2038-01-18 21:14:07.000000-0600 | 12e8d08f-6912-11fe-7f7f-7f7f7f7f7f7f
    
    (2 rows)
    

    The CONTAINS operator is a little different. I'm pretty sure that it only accepts one parameter. But CONTAINS only works on collection columns; columns of type List, Set, or Map. So if I have a table playlists with a column tags LIST<TEXT> and an index on the tags column, then this works:

    SELECT * FROM playlists WHERE tags CONTAINS 'Progressive Rock';
    
     name         | song                | artist | tags
    --------------+---------------------+--------+------------------------------
     favorite 70s | La Villa Strangiato |   Rush | ['Rock', 'Progressive Rock']
     favorite 70s |              Xanadu |   Rush | ['Rock', 'Progressive Rock']
    
    
    (2 rows)
    

    As for the part about "not the best for lots of data," you're right. Neither approach really fits with Casandra. The first is what we call a "multi-key query," and the second is a "secondary index query." Cassandra works well in large row scenarios because it can figure out which nodes have the data by running a hash on the partition key.

    When Cassandra can't pinpoint the nodes which contain the data (as is the case with a both multi-key and secondary index queries). It picks a node to be something called a "coordinator node." That node is then responsible for contacting (in the case of the secondary index query) all of the other nodes in the cluster, assembling the result set, and returning it to the calling application. This introduces network time into the equation.

    If you're running a 6 node cluster, that may not be too bad. However, a 60 node cluster will not be as forgiving. Also, if the query pulls too many results back, it may use up all of the available heap memory on the coordinator node, causing it to crash.

    That's why running queries based on the PRIMARY KEY is the recommended way to go. Because results can be returned after contacting only 1 or 2 nodes, and not all nodes need to be polled. So stick with your PRIMARY KEY. If that doesn't work for your query, then create a new table with a different PRIMARY KEY as that will be A) faster and B) easier on the cluster.