azuredatabricksazure-databricks

Where does databricks store the managed tables?


I am new to databricks and currently learning about managed tables. I have created a managed table in databricks and on inspection, it is getting created in below location:

dbfs:/user/hive/warehouse/demo.db/race_results_python

The microsoft documentation states that this is a root directory.

So I have two questions:

  1. Does this mean that databricks is storing tables in the default Storage Account created during the creation of Databricks workspace ?
  2. If the answer to above question is Yes, then is it a good practice to store tables here or should we store it in a separate Storage Account?

For reference I have looked in the below documentation: https://learn.microsoft.com/en-us/azure/databricks/dbfs/root-locations


Solution

  • Answering your two sub questions individually below:

    Does this mean that databricks is storing tables in the default Storage Account created during the creation of Databricks workspace ?

    If the answer to above question is Yes, then is it a good practice to store tables here or should we store it in a separate Storage Account?

    1. Specifying the location where all the tables within the schema/database would get stored. (This location could be an external location). In this case all the tables that get created inside that database/schema where you don't specify the path would get created as managed tables but will still point to the location that was specified at the database/schema level.
    2. If you don't specify the location at the time of creation of schema/database then it will point to default storage location and all tables that you create within that database without specifying the path would get stored at that default location and would be created as managed tables.

    Just to make sure you understand the difference between managed tables and external tables, I am describing that below:

    Managed Tables are one in which when you drop the tables, it deletes the tables meta data information in the hive_metastore as well as deletes the actual data files. you don't need to handle separately the deletion of the data files if that is what your use case is.

    External tables are one in which when you drop the tables, it only deletes the tables meta data information but does not delete the actual data files. In case of external tables if you want to delete the actual data files you will have to have an external process to delete those data files if that is what your use case wants.