apache-sparkhivemapreducebigdatahiveddl

Any production scenario where External table in Hive is definitely needed?


I know when we try to delete the table, it only deletes meta data but not data. But what could be the particular production use case in industry where internal tables may not be useful and it can only be achieved through external tables?

  1. If we think that through external table, we can make sure data is safe and not deleted by anyone, this can be achieved in managed table by setting no_drop option which restricts table deletion.

  2. If we think that using same underlying data, many tables can be created, even that can be easily achieved through managed tables.

Is there any particular industry scenario where managed table may not be useful and we need external table there?


Solution

  • no_drop just restricts dropping the table, but there are scenarios when you may need to drop/create the table keeping the data.

    1. You may need to change table definition and recreate it without deleting the data. For example you have some JSON files loaded by upstream process, upstream process added one more attribute to the JSON and you want to re-create your table definition to include new column and have all the data as is. Or add more columns which are being parsed from semi-structured files (Using RegexSerDe), or just change regex in table DDL to improve extraction.

    2. Scenario 2: You are loading data into partition folders daily. and some downstream process needs only latest partition, and you can not pass the parameter to it, while some other process wants the whole table. In this case you can create partitioned table and not partitioned latest one on top of the latest partition dir in the upstream process, dropping and re-creating each time and changing location. Downstream process can use latest table (whithout knowing in advance which is the latest partition and no need in finding it) and the whole partitioned one also can be used.

    3. You may want to load data into location outside configured managed tables location root dir, see https://stackoverflow.com/a/67073849/2700344 And there are policies in your warehouse restricting loading raw data into managed tables dir. On Cloudera distribution you can not create managed table and specify location other than default one. This is why you may need EXTERNAL table.

    4. Tables which are using JDBC storage handler, or Kafka storage handler, HBASE can be created as External only.

    Of course in some cases you can use ALTER TABLE instead of DROP/CREATE but it is convenient to store table DDL file, versioned in the repo without ALTERs.

    And maybe there are some other possible use-cases.