apache-sparkhivehive-metastore

Cannot read persisted spark warehouse databases on subsequent sessions


I am trying to create a locally persisted spark warehouse database that will be present/loaded/accessible to future spark sessions created by the same application.

I have configured the spark session conf with:

.config("spark.sql.warehouse.dir", "C:/path/to/my/long/lived/mock-hive")

When I create the databases, I see the mock-hive folder get created, and underneath two distinct databases that I create have folders: db1.db and db2.db

However, these folders are EMPTY after the session completes, despite the databases being successfully created and subsequently queried in the run that stands them up.

On a subsequent run with the same configured spark session, if I baseSparkSession.catalog.listDatabases().collect() I only see the default database. The two I created did not persist into the second spark session.

What is the trick to get these local persisted databases to be available to read in subsequent execution?

I've noticed that spark.sql.warehouse.dir *.db folders empty after creation, which might have something to do with it...

Spark Version: 3.0.1


Solution

  • Turns out spark.sql.warehouse.dir is not where local db data is stored... it's in the derby database stored in metastore_db. To relocate that, you need to change a system param:

    System.setProperty("derby.system.home", derbyPath)

    I didn't even have to set spark.sql.warehouse.dir, just relocate the derbyPath to a common location all spark sessions use.

    NOTE - You don't need to specify the "metastore_db" portion of the derbyPath, it will be auto appended to the location.