I use standalone cluster with 2 workers. Use spark kafka cassandra hdfs stream
val stream = kafkaUtils.createDirectStream...
stream.map(rec => Row(rec.offset, rev.value)).saveToCassandra(...)
stream.map(_.value).foreachRDD(rdd => {saving to HDFS})
I send to Kafka approximately 40000 msg/sec
the first thing that is saveToCassandra works slowly, because if i comment stream.saveToCassandra
it works very good and fast.
in spark driver UI i see that for 5MB output it takes approximately 20s.
I tried to tune spark-cassandra options, but it also takes minimum 14s.
And the second is than i mentioned, that my one worker is do nothing, it log i see something like this:
10:05:33 INFO remove RDD#
and etc.
but if i stop another worker it begin to work.
I don't use spark-submit, just
startSpark extends App {
and the hole code, and then start it with
scala -cp "spark libs:kafka:startSpark.jar" startSpark
and in conf to workers i use ssc.sparkContext.addJars(pathToNeedableJars)
How can i boost writing to Cassandra and how to get my workers work together?
I really bad reading official spark kafka integration guide, the problem, that i use for my topic 1 partition
1:1 correspondence between Kafka partitions and Spark partitions