I'm trying to understand Spark's in memory feature. In this process i came across Tachyon which is basically in memory data layer which provides fault tolerance without replication by using lineage systems and reduces re-computation by check-pointing the data-sets. Now where got confused is, all these features are also achievable by Spark's standard RDDs system. So i wonder does RDDs implement Tachyon behind the curtains to implement these features? If not than what is the use of Tachyon where all of its job can be done by standard RDDs. Or am i making some mistake in relating these two? a detailed explanation or link to one will be a great help. Thank you.
What is in the paper you linked does not reflect the reality of what is in Tachyon as a release open source project, parts of that paper have only ever existed as research prototypes and never been fully integrated into Spark/Tachyon.
When you persist data to the OFF_HEAP
storage level via rdd.persist(StorageLevel.OFF_HEAP)
it uses Tachyon to write that data into Tachyon's memory space as a file. This removes it from the Java heap thus giving Spark more heap memory to work with.
It does not currently write the lineage information so if your data is too large to fit into your configured Tachyon clusters memory portions of the RDD will be lost and your Spark jobs can fail.