apache-sparkapache-spark-sqlrddalluxio

Difference between Alluxio(Tachyon) and Tungsten in Spark?


Tachyon is a distributed, in-memory storage system that is developed separately from Spark which could be used as an off-heap persistence storage during a Spark application

Tungsten is a new Spark SQL component that provides more efficient Spark operations by working directly at the byte level. Since Tungsten no longer depends on working with Java objects, we can use either on-heap (in the JVM) or off-heap storage

In off-heap mode, both reduces garbage collection overhead, since data is not stored as Java objects.

So could I simply consider Tachyon brings benefits to general RDD whereas spark-sql benefits from Tungsten ?

Suppose following code

val df = spark.range(10)

val rdd = df.rdd

df.persist(StorageLevel.OFF_HEAP) // in Tungsten format(bytes)?

df.show

rdd.persist(StorageLevel.OFF_HEAP) // in Tachyon storage ?

rdd.count

Solution

  • In short both yours statements are incorrect: