apache-sparkcachinghivepersist

How to cache dataframes that will be retained after spark session terminates


I'm new to data engineering, so this might be a basic question. But I haven't been able to clarify

--context--
I have a spark job executed by an Azure Data Factory pipeline every 10 minutes. On every execution, the spark job connects to a database(Azure SQL Server) and loads 3 tables. All three are small tables(1.9M, 72K, 72K). But sometimes it fails to connect to the database, thus the whole pipeline fails.

What I want to do:
I want to cache the 3 tables (once a day if possible) so that spark doesn't have to connect to the database every single execution.

At first I thought df.persist(StorageLevel.DISK_ONLY) could retain the dataframes even after the spark session is terminated, but couldn't confirm. Another option I thought of was saving the tables in Hive, but Hive is usually for big data so I'm not sure if using it is the right choice in this situation.

Any advise is welcome.


Solution

  • When you df.persist(StorageLevel.DISK_ONLY), Spark caches data in the location defined byspark.local.dir property. Caching is useful for iterative jobs where Spark can avoid recomputing the entire DAG each time an action such as count() is called. Instead, it starts from the point when the data was cached (see Figure 1 in Spark research paper for further explanations). But once the app is stopped with spark.stop(), the cached data is lost and cannot be accessed by new applications.

    The issue you're facing is with unreliable connection to database, not caching. I'd recommend adding a few retries in case the connection fails. Also, consider writing data to external storage (e.g. HDFS, local file system) so that it can be read when the connection to DB is down.