amazon-web-servicesamazon-s3databricksspark-structured-streaming

DataBricks Auto loader vs Input source files deletion detection


While ingesting files from a source s3 bucket continuously, I would like to be able to detect the case where files are being deleted. As far as I can tell the Autoloader can not handle the detection of files deleted in the source folder. Hence the case can't be supported. I want to confirm that first, and if it is indeed the case, inquire about the approach or work around that people use to handle that scenario. 


Solution

  • According to the Databricks documentation at this time, no, AutoLoader triggers on object creation only (e.g. ObjectCreated events) and therefore does not support detection of deleted files; neither does Databricks Workflows through a means such as File Arrival Trigger.

    The ideal solution will depend on what your objective is to do with the deleted files. However, a possible workaround generically speaking would be to create your own AWS Lambda Function that triggers off of the s3:ObjectRemoved:* events (you can trigger Lambda functions using S3 Event Notifications). Depending on what you need to do with the deleted file, you may prefer to do the processing entirely in this Lambda function. Or you may implement the Lambda function to simply copy the file to a different location, which you could have a Databricks workflow process either using AutoLoader or File Arrival Trigger.