I have a glue job that reads from an S3 bucket does transformations and uploads the result in another S3 bucket.
Here's what my aws glue get-job-bookmark --job-name xx
returns
JobBookmark": "{\"datasource0\":{\"jsonClass\":\"HadoopDataSourceJobBookmarkState\",\"timestamps\":{\"RUN\":\"4\",\"HIGH_BAND\":\"900000\",\"CURR_LATEST_PARTITION\":\"1618957000000\",\"CURR_LATEST_PARTITIONS\":\"s3://XXYY/2021/04/20/16/\",\"CURR_RUN_START_TIME\":\"2021-04-20T22:43:19.304Z\",\"INCLUDE_LIST\":\"\"}}}"
As you can see my S3 is structured as bucketname/yyyy/mm/dd/HH. And the above shows the bookmark is set at the prefix 2021/04/20/16.
Now if another file is added at the same exact prefix, it is processed.
However if there's a newer partition, say, 2021/04/20/17 and there's a file in it - it doesn't get picked up by the bookmark.
My script is very straightforward, most of it is auto-generated since I am only testing this feature.
The location of my table is specified as S3://xxyy at the very top level.
Thanks for reading.
This was because glue is blissfully unaware of newer partitions until we add it in Athena. We could either repair the table, or run crawler again on newer folders ($$) or alter the table and add a partition. Option 3 works best for schema that doesn't change quite often.
alter table xxyy
add partition (partition_0=2021,partition_1=04,partition_2=21,partition_3=22)
location 's3://xxyy/2021/04/21/22/'
And the best part is we can "pre-fill" the table with newer partitions even when such a partition doesn't exist in S3 yet.