terraformaws-glueterraform-provider-awsamazon-athenaaws-glue-data-catalog

How to define the AWS Athena s3 output location using terraform when using aws_glue_catalog_database and aws_glue_catalog_table resources


Summary

The terraform config below creates aws_glue_catalog_database and aws_glue_catalog_table resources, but does not define an s3 bucket output location which is necessary to use these resources in the context of Athena. I can add the s3 output location manually through the AWS console, but need to do it programmatically using terraform.

Detail

Minimal example terraform config which creates an aws glue database and table:

resource "aws_glue_catalog_database" "GlueDB" {
  name = "gluedb"
}

resource "aws_glue_catalog_table" "GlueTable" {
  name          = "gluetable"
  database_name = aws_glue_catalog_database.gluedb.name
  table_type    = "EXTERNAL_TABLE"

  parameters = {
    EXTERNAL = "TRUE"
  }

  storage_descriptor {
    location      = var.GLUE_SOURCE_S3_LOCATION
    input_format  = "org.apache.hadoop.mapred.TextInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat"

    ser_de_info {
      name                  = "jsonserde"
      serialization_library = "org.openx.data.jsonserde.JsonSerDe"

      parameters = {
        "serialization.format" = "1"
      }
    }

    columns {
      name    = "messageId"
      type    = "string"
      comment = ""
    }
  }
}

The aim is to be able to access the table either via the Athena query editor (AWS console), or using the python boto3 library (boto3.client('athena')).

However, before Athena access works in either case I need to define an output location for the query. This is easy to do in the AWS console (Amazon Athena -> Query Editor -> Manage settings -> Location of query result), but I need to do this via terraform so that the entire aws infrastructure stack can be setup in one go.

There is a terraform resource called aws_athena_workgroup which has an output_location property, but it is unclear how a separate aws_athena_workgroup resource would be related to the aws_glue_catalog_database already defined (there doesn't seem to be any way to link these two resources).

This answer suggests importing the existing primary workgroup into terraform and modifying it. But what I need is a terraform implementation which sets everything up from scratch in one go.

Any suggestions on how to wire up an s3 output-location in terraform so the above glue resources can be used in the context of Athena would be greatly appreciated!


Solution

  • AWS Glue and Athena are two independent services. Glue doesn't have to know the Athena query output location configuration at all. It just stores the query results run in Athena.

    You can simply create a new resources for aws_athena_workgroup next to Glue resources and define the result configuration bucket.

    resource "aws_athena_workgroup" "example" {
      name = "example"
    
      configuration {
        enforce_workgroup_configuration    = true
        publish_cloudwatch_metrics_enabled = true
    
        result_configuration {
          output_location = "s3://${aws_s3_bucket.example.bucket}/output/"
    
          encryption_configuration {
            encryption_option = "SSE_KMS"
            kms_key_arn       = aws_kms_key.example.arn
          }
        }
      }
    }