amazon-s3yamlaws-cloudformationaws-glue

how to load a ETL script to S3 bucket using yaml CloudFormation stack


I have been writing CloudFormation Stack using yaml and deploying it to AWS Infrastructure ( For legacy reasons, I can not switch to CDK unfortunately ;))

Following yaml code is a part of the cloudformation stack. The yaml code is creating a Glue job. it loads etl script from S3 bucket (name transform_json_to_parquet.py) as a part of the Cloudformation stack (see line ScriptLocation below).

A major limitation of approach is

It expects that transform_json_to_parquet.py script should be present in S3-bucket-name-1. Therefore, I have to manually upload transform_json_to_parquet.py file to S3-bucket-name-1. I am just wondering is there any way that allow me to load transform_json_to_parquet.py file when I deploy cloudformation stack to AWS

 TransformJsonDataJob:
    Type: "AWS::Glue::Job"
    Properties:
      Role: !Ref AWSGlueETLJobRole  
      Name: "TransformJsonToParquet"
      Description: "Trasform JSON to Parquet"
      Timeout: 5
      WorkerType: G.1X
      NumberOfWorkers: 2
      MaxRetries: 0
      Command:
        "Name": "glueetl"
        "ScriptLocation" : !Sub s3://<S3-bucket-name-1>/transform_json_to_parquet.py
      DefaultArguments: 
        "--s3_json_path" : !Sub s3://<S3-bucket-name-2>/
        "--s3_parquet_path" : !Sub s3://<S3-bucket-name-3>/

Solution

  • There are two ways to achieve this:

    1. Using the "aws cloudformation package" command from AWS CLI. In your original cloudformation YAML file, you can refer to the glue script locally. Doc: https://docs.aws.amazon.com/cli/latest/reference/cloudformation/package.html

    2. Using CloudFormation custom resource. This involves creating a Lambda function for the resource, and you can put the glue script inline with the Lambda function code. Doc: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/template-custom-resources-lambda.html

    I'd recommend to try option 1 first as using custom resource can create more complexities.