I have at my disposal 16 nodes and I am using Spark, Cassandra and Spark-Cassandra Connector(SCC). I want to evaluate the performance of this cluster from time perspective when a specific statistical test is implemented on some specific data. So, in one of my scenarios I kept the Spark nodes up to 16 and started adding nodes to the Cassandra ring. Every Cassandra node that is added, has already a Spark installation, and with the RepartitionByCassandraReplica(RBCR) I make sure that data locality is achieved. The only thing I changed was the replication factor.
The times were as follows:
number of Spark - Cassandra nodes | replication factor | Time
16 - 1 | 1 | 1.883 min
16 - 2 | 1 | 2.333 min
16 - 3 | 3 | 0.933 min
16 - 4 | 3 | 0.9 min
...
My question is why in the 2nd case where I have 2 Cassandra nodes it takes more time than the 1st case with 1 node. I thought that the more Cassandra nodes, the more simultaneous reads. So does the replication factor play a role? If so, how?
I am using the RBCR, which means that when I fetch data from Cassandra, the SCC will ask the data from the node that is actually stored in. Therefore, I cannot see how the replication factor affects that.
EDIT
I think that if I had replication factor 2 for the case 16 - 2, I would get a lower time, something like 1.5, but that is something I cannot test right now.
It appears to me that your testing is flawed. You need to have a one-to-one mapping of Spark workers/executors and Cassandra nodes.
As you already know, you can only achieve data locality when BOTH the Spark JVM and Cassandra JVM are co-located in the same operating system instance (OSI). In your environment, there is no guarantee that the scheduled worker/executor are on the same OSI as the Cassandra node. Cheers!