amazon-web-servicesaws-glueaws-glue-data-catalogaws-glue-spark

Manually setting AWS Glue ETL Bookmark


My project is undergoing a transition to a new AWS account, and we are trying to find a way to persist our AWS Glue ETL bookmarks. We have a vast amount of processed data that we are replicating to the new account, and would like to avoid reprocessing.

It is my understanding that Glue bookmarks are just timestamps on the backend, and ideally we'd be able to get the old bookmark(s), and then manually set the bookmarks for the matching jobs in the new AWS account.

It looks like I could get my existing bookmarks via the AWS CLI using:

get-job-bookmark --job-name <value>

(Source)

However, I have been unable to find any possible method of possibly setting the bookmarks in the new account.

As far as workarounds, my best bets seem to be:

  1. Add exclude patterns for all of our S3 data sources on our Glue crawler(s), though this would no longer allow us to track any of our existing unprocessed data via the Glue catalog (which we currently use to track record and file counts). This is looking like the best bet so far...
  2. Attempt to run the Glue ETL jobs prior to crawling our old (replicated) data in the new account, setting the bookmark past the created-time of our replicated S3 objects. Then once we crawl the replicated data, the ETL jobs will consider them older than the current bookmark time and not process them on the next run. However, it appears this hack doesn't work as I ended up processing all data when testing this.

Really at a loss here and the AWS Glue forums are a ghost town and have not been helpful in the past.


Solution

  • I was not able to manually set a bookmark or get a bookmark to manually progress and skip data using the methods in the question above.

    However, I was able to get the Glue ETL job to skip data and progress its bookmark using the following steps:

    1. Ensure any Glue ETL schedule is disabled

    2. Add the files you'd like to skip to S3

    3. Crawl S3 data

    4. Comment out the processing steps of your Glue ETL job's Spark code. I just commented out all of the dynamic_frame steps after the initial dynamic frame creation, up until job.commit().

      sc = SparkContext()
      glueContext = GlueContext(sc)
      spark = glueContext.spark_session
      job = Job(glueContext)
      job.init(args['JOB_NAME'], args)
      
      # Create dynamic frame from raw glue table
      datasource0 = 
      glueContext.create_dynamic_frame.from_catalog(database=GLUE_DATABASE_NAME, 
      table_name=JOB_TABLE, transformation_ctx="datasource0")
      
      # ~~ COMMENT OUT ADDITIONAL STEPS ~~ #
      
      job.commit()
      
    5. Run glue etl job with bookmark enabled as usual

    6. Revert Glue ETL Spark code back to normal

    Now, the Glue ETL job's bookmark has been progressed and any data that would have been processed on that job run in step 5 will have been skipped. Next time a file is added to S3 and crawled, it will be processed normally by the Glue ETL job.

    This can be useful if you know you will be getting some data that you don't want processed, or if you are transitioning to a new AWS account and are replicating over all your old data like I did. It would be nice if there was a way to manually set bookmark times in Glue so this was not necessary.