So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am using the Spark-Cassandra Connector 3.0.0 for doing a repartitionByCassandraReplica.JoinWithCassandraTable
and then some SparkML analysis takes place. My question is what happens eventually with the spark partitions?
1st scenario
My PartitionsPerHost
parameter of repartitionByCassandraReplica
is numberofSelectedCassandraPartitionkeys which means if I choose 4 partition keys I get 4 partitions per Host. This gives me 64 spark partitions because I have 16 hosts.
2nd scenario
But, according to the Spark Cassandra connector documentation, information from system.size_estimates
table should be used in order to calculate the spark partitions. For example from my system.size_estimates
:
estimated_table_size = mean_partition_size x number_of_partitions
= (24416287.87/1000000) MB x 332
= 8106.2 MB
spark_partitions = estimated_table_size / input.split.size_in_mb
= 8106.2 MB / 64 MB
= 126.6593 partitions
so, when does the 1st scenario takes place and when the second? Am I calculating something wrong? Is there specific cases where the 1st scenario happens and other cases the 2nd?
Those are two completely different paths by which the number of Spark partitions are calculated.
If you're calling repartitionByCassandraReplica()
, the number of Spark partitions are determined by both partitionsPerHost
and the number of Cassandra nodes in the local DC.
Otherwise, the connector will use input.split.size_in_mb
to determine the number of Spark partitions based on the estimated table size. Cheers!