pysparkcassandraspark-cassandra-connector

Pyspark cassandra connector generates tombstones during writing


I understand that when inserting data, tombstones might be created because of existing null values in the columns of the dataframe. To mitigate this issue and minimize tombstones, insertion queries should exclude columns with null values.

Currently, I'm working with the spark-cassandra-connector in pyspark-jupyter notebook environment and I've come across the "com.datastax.spark.connector.types.CassandraOption" trait for scala, How can I leverage this trait or any other method to address the tombstone problem?


Solution

  • WriteConf has a parameter ignoreNulls which you can set to true so that null values are not inserted when writing to Cassandra.

    You can also configure the SparkConf object by setting the spark.cassandra.output.ignoreNulls to true.

    For details, see the Globally treating all nulls as Unset section and the Configuration Reference in the docs. Cheers!