apache-sparkcassandraspark-cassandra-connectorreplication-factor

What is the impact of replication factor using RepartitionByCassandraReplica?


I have at my disposal 16 nodes and I am using Spark, Cassandra and Spark-Cassandra Connector(SCC). I want to evaluate the performance of this cluster from time perspective when a specific statistical test is implemented on some specific data. So, in one of my scenarios I kept the Spark nodes up to 16 and started adding nodes to the Cassandra ring. Every Cassandra node that is added, has already a Spark installation, and with the RepartitionByCassandraReplica(RBCR) I make sure that data locality is achieved. The only thing I changed was the replication factor.

The times were as follows:

number of Spark - Cassandra nodes | replication factor | Time
16 - 1                            |        1           | 1.883 min
16 - 2                            |        1           | 2.333 min
16 - 3                            |        3           | 0.933 min
16 - 4                            |        3           | 0.9 min 
...

My question is why in the 2nd case where I have 2 Cassandra nodes it takes more time than the 1st case with 1 node. I thought that the more Cassandra nodes, the more simultaneous reads. So does the replication factor play a role? If so, how?

I am using the RBCR, which means that when I fetch data from Cassandra, the SCC will ask the data from the node that is actually stored in. Therefore, I cannot see how the replication factor affects that.

EDIT

I think that if I had replication factor 2 for the case 16 - 2, I would get a lower time, something like 1.5, but that is something I cannot test right now.


Solution

  • It appears to me that your testing is flawed. You need to have a one-to-one mapping of Spark workers/executors and Cassandra nodes.

    As you already know, you can only achieve data locality when BOTH the Spark JVM and Cassandra JVM are co-located in the same operating system instance (OSI). In your environment, there is no guarantee that the scheduled worker/executor are on the same OSI as the Cassandra node. Cheers!