scalaapache-sparkapache-spark-dataset

What is the difference between createOrReplaceTempView(viewName) and cache() on a DataSet


Both of them are meant for fast access to the DataSet. What is the difference between the two?


Solution

  • createOrReplaceTempView registers a DataFrame as a table that you can query using SQL (bound to the lifecycle of the SparkSession that registers it - hence the Temp part of the name). Note, however, that this method does not allow you to achieve any performance improvement.


    cache (or persist) marks the DataFrame to be cached after the following action, making it faster for access in the subsequent actions. DataFrames, just like RDDs, represent the sequence of computations performed on the underlying (distributed) data structure (what is called its lineage). Whenever you perform a transformation (e.g.: applying a function to each record via map), you are returned an updated lineage. Whenever you actually perform an action on the DataFrame, some kind of computation for which the lineage must be executed, it will be re-executed every time, unless it's already been cached and it's thus available.

    This means that using cache or persist will help you optimize such cases where you need to access the content of the DataFrame more than once.