azuredatabricksdatabricks-sqldatabricks-unity-catalog

ingesting data from ADLS directory containing .txt, .txt.parquet and .parquet files with databricks autoloader


I have a location in ADLS and need to ingest data from said location into Unity Catalog. This directory in ADLS has a mixture of .txt, .txt.parquet and .parquet. I am using autoloader and parquet option to ingest this data.

CREATE STREAMING LIVE TABLE Example_raw
TBLPROPERTIES ("quality" = "bronze") 
AS SELECT * FROM cloud_files("/mnt/Example", "parquet");

But the presence of .txt and txt.parquet files is causing the ingestion to fail. Can autoloader handle multiple file types in a ingestion?

Thanks


Solution

  • Answer to your specific question:

    The autoloader can only work with one file format at a time.

    Possible workaround:

    To ingest multiple file formats into the same table using autoloader, you could try using a UNION ALL with a corresponding glob filter (pathGlobFilter option) for each file format. I haven't tested the code below, but hopefully the concept conveys:

    CREATE STREAMING LIVE TABLE Example_raw
    TBLPROPERTIES ("quality" = "bronze") 
    AS
    SELECT *
      FROM cloud_files(
        "/mnt/Example",
        "parquet",
        map("pathGlobFilter", "*.parquet"))
      UNION ALL
    SELECT *
      FROM cloud_files(
        "/mnt/Example",
        "text",
        map("pathGlobFilter", "*.txt"))
    ;
    

    Files matching the filter must match the defined format. So, *.txt.parquet files must actually be Parquet files or the ingestion will fail.

    See AutoLoader syntax and pathGlobFilter under File Format Options for additional details.