I am continuously receiving and storing multiple feeds of uncompressed JSON objects, partitioned to the day, to different locations of an Amazon S3 bucket (hive-style: s3://bucket/object=<object>/year=<year>/month=<month>/day=<day>/object_001.json
), and was planning to incrementally batch and load this data to a Parquet data lake using AWS Glue:
This design pattern & architecture seemed to be quite a safe approach as it was backed up by many AWS blogs, here and there.
I have a crawler configured as so:
{
"Name": "my-json-crawler",
"Targets": {
"CatalogTargets": [
{
"DatabaseName": "my-json-db",
"Tables": [
"some-partitionned-json-in-s3-1",
"some-partitionned-json-in-s3-2",
...
]
}
]
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
And each table was "manually" initialized as so:
{
"Name": "some-partitionned-json-in-s3-1",
"DatabaseName": "my-json-db",
"StorageDescriptor": {
"Columns": [] # i'd like the crawler to figure this out on his first crawl,
"Location": "s3://bucket/object=some-partitionned-json-in-s3-1/",
"PartitionKeys": [
{
"Name": "year",
"Type": "string"
},
{
"Name": "month",
"Type": "string"
},
{
"Name": "day",
"Type": "string"
}
],
"TableType": "EXTERNAL_TABLE"
}
}
First run of the crawler is, as expected, an hour-ish long, but it successfully figures out the table schema and existing partitions. Yet from that point onward, re-running the crawler takes the exact same amount of time as the first crawl, if not longer; which lead me to believe that the crawler is not only crawling for new files / partitions, but recrawling all the entire S3 locations each time.
Note that the delta of new files between two crawls is very small (few new files are to be expected each time).
AWS Documentation suggests running multiple crawlers, but I am not convinced that this would solve my problem on the long run. I also considered updating the crawler exclude pattern after each run, but then I would see too few advantages using Crawlers over manually updating Tables partitions through some Lambda boto3 magic.
Am I missing something there ? Maybe an option I would have misunderstood regarding crawlers updating existing data catalogs rather than crawling data stores directly ?
Any suggestions to improve my data cataloging ? Given that indexing these JSON files in Glue tables is only necessary to me as I want my Glue Job to use bookmarking.
Thanks !
AWS Glue Crawlers now support Amazon S3 event notifications natively, to solve this exact problem.
See the blog post.