My Dynamodb table has both PK and SK. it has huge data set(500 GB).
I'm using below syntax for querying data based on PK in Glue, But it does a full table scan leading to the glue timeout. Have checked the RCU, that is also very high.
dyf = glueContext.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.region": region,
"dynamodb.input.tableName": tablename,
"dynamodb.query.filterExpression": f"prt_key = :pv",
"dynamodb.query.expressionAttributeValues": '{":pv": {"S": "'+ id + '"}}'
}
)
df= dyf.toDF()
Can someone please suggest a solution to avoid full table scan using create_dynamic_frame.from_options?
I already have boto3 based solution for this, looking for a solution with create_dynamic_frame.from_options.
AWS Glue will always do a full table scan. If you only want data for a given partition key, use boto3 query to get the data you need.