Both of them are meant for fast access to the DataSet. What is the difference between the two?
createOrReplaceTempView
registers a DataFrame
as a table that you can query using SQL (bound to the lifecycle of the SparkSession
that registers it - hence the Temp
part of the name). Note, however, that this method does not allow you to achieve any performance improvement.
cache
(or persist
) marks the DataFrame
to be cached after the following action, making it faster for access in the subsequent actions. DataFrame
s, just like RDD
s, represent the sequence of computations performed on the underlying (distributed) data structure (what is called its lineage). Whenever you perform a transformation (e.g.: applying a function to each record via map
), you are returned an updated lineage. Whenever you actually perform an action on the DataFrame
, some kind of computation for which the lineage must be executed, it will be re-executed every time, unless it's already been cached and it's thus available.
This means that using cache
or persist
will help you optimize such cases where you need to access the content of the DataFrame
more than once.