apache-sparkamazon-redshiftaws-glue

Refresh Redshift materialized view using a Glue Job script


I'm trying to refresh a materialized view with a glue job, connecting to the Redshift cluster using boto3 authenticating with the database username. The execution timeouts have no errors in CloudWatch. I'm sure the problem is within the connection, but I can't find what I am doing wrong. Also, I configured a connection to the database in AWS Glue, which I use successfully with other Visual ETL jobs, as well as this one.

I am aware I could schedule a refresh using the code editor v2 query scheduler, but I had too many issues configuring the security, with no success, the schedule executes but I see no runs in the history, but this would be for another post.

Here's my script so far:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

# Initialize Glue context
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Redshift connection details
redshift_cluster_id = 'my-warehouse'
db_name = 'my-db'
db_user = 'my-user'

# SQL to refresh materialized views
refresh_sql_1 = "REFRESH MATERIALIZED VIEW mv_name1"
refresh_sql_2 = "REFRESH MATERIALIZED VIEW mv_name2"

def refresh_materialized_view(sql):
    client = boto3.client('redshift-data')
    response = client.execute_statement(
        ClusterIdentifier=redshift_cluster_id,
        Database=db_name,
        DbUser=db_user,
        Sql=sql,
        WithEvent=True
    )
    return response

# Refresh the materialized views and commit
refresh_materialized_view(refresh_sql_1)
refresh_materialized_view(refresh_sql_2)

job.commit()

During execution, the job loops running these commands until timeout (15 mins):


Solution

  • I did it with the Query Editor v2 scheduler, carefully following the documentation: https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-schedule-query.html

    The key, in my case, was to authenticate using Temporary credentials, allowing the IAM Role to access database credentials of a specific dbuser:

    {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "UseTemporaryCredentialsForAllDbUsers",
            "Effect": "Allow",
            "Action": "redshift:GetClusterCredentials",
            "Resource": [
                "arn:aws:redshift:*:*:dbuser:*/awsuser",
                "arn:aws:redshift:*:*:dbuser:*/myuser"
            ]
        }
      ]
    }