databricksazure-databricksexternal-tables

Can I access a row's file information in a Databricks external table?


I have an external table defined within my Databricks database that's pointing to a particular folder within an S3 bucket. There are multiple csv files in the folder, the contents of which all appear as rows in my table. This all works perfectly as expected.

What I want to know is, for a row in my external table, is there any way to extract information about the particular csv file that it came from? Ideally, I'd like to be able to extract the write date of the file so I can distinguish which rows are the most recently written.


Solution

  • Thanks to a hint from @DileepRajNarayanThumula I have an answer to my own question:

    There's a hidden column in external tables called _metadata . This is a STRUCT containing several fields, including the file update date. So to get the update date for each row of my table, I can write something like this:

    SELECT *,
           _metadata.file_modification_time
    FROM my_external_table