scala apache-spark rdd data-lineage spark-checkpoint

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following:

In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this case and just begin computing based on the persisted RDD. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persist()ed. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.

So, I decided to try to see this in action with a simple program (below):

val pairs = spark.sparkContext.parallelize(List((1,2)))
val x   = pairs.groupByKey()
x.toDebugString  // before collect
x.collect()
x.toDebugString  // after collect

spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
x.collect()
x.toDebugString  // after checkpoint

I did not see what I expected after reading the above paragraph from the Spark book. I saw the exact same output of toDebugString each time I invoked this method -- each time indicating two stages (where I would have expected only one stage after the checkpoint was supposed to have truncated the lineage.) like this:

scala>     x.toDebugString  // after collect
res5: String =
(8) ShuffledRDD[1] at groupByKey at <console>:25 []
 +-(8) ParallelCollectionRDD[0] at parallelize at <console>:23 []

I am wondering if the key thing that I overlooked might be the word "may", as in the "schedule MAY truncate the lineage". Is this truncation something that might happen given the same program that I wrote above, under other circumstances ? Or is the little program that I wrote not doing the right thing to force the lineage truncation ? Thanks in advance for any insight you can provide !

Solution

I think that you should do persist/checkpoint before you do first collect. From that code for me it looks correct what you get since when spark does first collect it does not know that it should persist or save anything.

Also probably you need to save result of x.persist and then use it... I propose - try it:

val pairs = spark.sparkContext.parallelize(List((1,2)))
val x   = pairs.groupByKey()

x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)

// **Also maybe do val xx = x.persist(...) and use xx later.**

x.toDebugString  // before collect
x.collect()
x.toDebugString  // after collect

spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.collect()
x.toDebugString  // after checkpoint