apache-sparkcassandraspark-cassandra-connector

Spark standalone application implementes PCA, then hangs for 10-12 minutes and only then removes RDD from memory


I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96 and Spark-Cassandra-Connector 3.1.0. I am doing a Spark-Join(broadcastHashJoin) between a dataset and a Cassandra table and then implement a PCA from SparkML library. Inbetween, I persist a dataset and I unpersist it only after the computations of the PCA are finished. According to the stages tab from SparkUI, everything is finished in less than 10 minutes and generally no executor is doing anything:

enter image description here

but the persisted dataset is still persisted and stays like that for another 10-12 minutes as shown below from the Storage tab of SparkUI:

enter image description here

This is the last lines of stderr from one of the nodes where you can see there is a difference of 10 minutes in the last 2 lines:

22/09/15 11:41:09 INFO MemoryStore: Block taskresult_1436 stored as bytes in memory (estimated size 89.3 MiB, free 11.8 GiB)
22/09/15 11:41:09 INFO Executor: Finished task 3.0 in stage 33.0 (TID 1436). 93681153 bytes result sent via BlockManager)
22/09/15 11:51:49 INFO BlockManager: Removing RDD 20
22/09/15 12:00:24 INFO BlockManager: Removing RDD 20

While in the main console where the application runs I only get:

1806703 [dispatcher-BlockManagerMaster] INFO  org.apache.spark.storage.BlockManagerInfo  - Removed broadcast_1_piece0 on 192.168.100.237:46523 in memory (size: 243.7 KiB, free: 12.1 GiB)
1806737 [block-manager-storage-async-thread-pool-75] INFO  org.apache.spark.storage.BlockManager  - Removing RDD 20

If I try to print the dataset after PCA is complete and before I unpersist it, then it still takes ~20 minutes, then it prints it and then unpersists it. Why? Would that have to do maybe with the query and the Cassandra table?

I have not enabled MLlib Linear Algebra Acceleration as I have ubuntu 20.04 which has incompatibility issues with libgfortran5, etc..but I am also not sure it would help. I am not sure where to look or for what to look in order to reduce these 20 minutes to 10. Any ideas what might be happening? Let me know if you want any more information.


Solution

  • It seems that activating the Linear Algebra Acceleration libraries of Apache Spark ML does make a difference! It reduced the PCA calculation time by 10 minutes, so no more Spark hanging!