amazon-web-servicesamazon-s3aws-glueaws-glue-data-catalogaws-glue-spark

Glue - Bookmark doesn't recognize files in newer partitions


I have a glue job that reads from an S3 bucket does transformations and uploads the result in another S3 bucket.

Here's what my aws glue get-job-bookmark --job-name xx returns

JobBookmark": "{\"datasource0\":{\"jsonClass\":\"HadoopDataSourceJobBookmarkState\",\"timestamps\":{\"RUN\":\"4\",\"HIGH_BAND\":\"900000\",\"CURR_LATEST_PARTITION\":\"1618957000000\",\"CURR_LATEST_PARTITIONS\":\"s3://XXYY/2021/04/20/16/\",\"CURR_RUN_START_TIME\":\"2021-04-20T22:43:19.304Z\",\"INCLUDE_LIST\":\"\"}}}"

As you can see my S3 is structured as bucketname/yyyy/mm/dd/HH. And the above shows the bookmark is set at the prefix 2021/04/20/16.

Now if another file is added at the same exact prefix, it is processed.

However if there's a newer partition, say, 2021/04/20/17 and there's a file in it - it doesn't get picked up by the bookmark.

My script is very straightforward, most of it is auto-generated since I am only testing this feature.

The location of my table is specified as S3://xxyy at the very top level.

Thanks for reading.


Solution

  • This was because glue is blissfully unaware of newer partitions until we add it in Athena. We could either repair the table, or run crawler again on newer folders ($$) or alter the table and add a partition. Option 3 works best for schema that doesn't change quite often.

    alter table xxyy
    add partition (partition_0=2021,partition_1=04,partition_2=21,partition_3=22)
    location 's3://xxyy/2021/04/21/22/'
    

    And the best part is we can "pre-fill" the table with newer partitions even when such a partition doesn't exist in S3 yet.