apache-sparkpysparkcassandraetlcassandra-driver

Which is more efficient between the Cassandra's library query and PySpark's Cassandra query?


I am currently building an ETL on PySpark and in the transformation stage I need to make validations with some of the data saved in a Cassandra table but my approach is making the processing too slow, it processes just 900 records in 30 minutes.

The way I approached it eating a function that uses the Cluster.execute() method like this:

def select_test_table():
   cluster = Cluster(['localhost'], port=9042)
   session = cluster.connect('test_keyspace')
   r = session.execute('select * from test_keyspace.test_table')
   return r

And while researching I found that I can do that with the Spark's own library:

spark.read.format("org.apache.spark.sql.cassandra").options(table="test_table", keyspace="test_keyspace").load().collect()

Solution

  • Your first query is just a normal full table scan using the driver. That's not going to be performant and would not be something that I would recommend. Instead, use spark. Spark will break the query down into partition range queries, use multiple executors, and distribute the query across coordinators.

    By the way, that full table scan may not even work. If the table is large enough it will timeout.