databricksdatabricks-sql

Databricks: managed tables vs. external tables


managed tables are fully managed by the Databricks workspace, where Databricks handles the storage and metadata of the table, including the lifecycle of the data. When you create or delete a managed table, Databricks automatically manages the underlying data files.
External tables on the other hand, store their data outside of the Databricks-managed storage, often in external systems like cloud storage (e.g., AWS S3 or Azure Blob Storage). While Databricks manages the metadata for external tables, the actual data remains in the specified external location, providing flexibility and control over the data storage lifecycle. This setup allows users to leverage existing data storage infrastructure while utilizing Databricks' processing capabilities.

I can't understand the use of the managed tables, why would anyone want to have the tables in Databricks and if this table is deleted the data is lost ? I find it much more logical to always have your data in your storage and from there you build your tables, if these are deleted the data always remains.

What use case could justify the use of managed tables ?


Solution

  • The decision to use managed table or external table depends on your use case and also the existing setup of your delta lake, framework code and workflows.

    Your understanding of the Managed tables is partially correct based on the explanation that you have given. For managed tables, databricks handles the storage and metadata of the tables, including the entire life cycle of the data is correct statement but it does not mean that the data is not stored in the cloud storage. You can create a database in databricks by specifying the cloud storage location and all the tables that gets created inside that database gets stored under that nested directory location.

    Also, if you are exposing those delta tables to internal users who are owners of that data and don't have much exposure to databricks and have strong sql knowledge and you want to give users that same experience, then managed tables make more sense because running the drop table deletes the metadata as well as actual data from the storage location.

    It also reduces the cleanup that you need to do because dropping the table does not actually delete the table data and one needs to remember and learn that step to be more efficient.

    Also, if you are using databricks only and there is no other external systems or tools that you are using to modify or query your data then managed tables are the way to go.

    Now coming to external tables: Yes, external tables are useful when you are using the data using multiple tools because of the wide user base or the complexity of the use case or different end user groups who are consuming your data using different frameworks. It also has an additional overhead of making sure that the data files need to be deleted explicitly and failing to do so, you will keep on get charging for that storage on cloud.