apache-sparkaws-gluedelta-lakeaws-glue-data-catalogdata-lake

Can Glue Crawler crawl the deltalake files to create tables in aws glue catalogue?


We have an existing infrastructure where we are crawling the S3 directories through aws crawlers. These S3 directories are created as part of AWS datalake and dumped through the spark job. Now in order to implement the delta feature, we were doing a POC on deltalake. So when I wrote these deltalake files in the S3 through our spark-delta Jobs, my crawlers are not able to create tables from these crawlers.

Can we crawl delta lake files using AWS crawlers ?


Solution

  • As per this doc you should not be using Glue crawler.You should be using manifest files to integrate delta files with Athena.

    Warning

    Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.