I am pretty baffled and I don't know what is going on with this one.
I'm using DuckDB to query parquet files in an s3 bucket.
import pandas as pd
import duckdb
query = """
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-west-2';
SET s3_access_key_id='key';
SET s3_secret_access_key='secret';
SELECT
FROM read_parquet('s3://bucket/folder/file.parquet')
cursor = duckdb.connect()
cursor.execute(query).df()
I have an IAM user with admin access. I am able to query this parquet file with programatic access keys. I also have a role that I want to use in an application that I have also given admin access just for testing purposes.
When I assume the role and create temporary credentials and input those into the code above
export $(printf "AWS_ACCESS_KEY_ID=%s AWS_SECRET_ACCESS_KEY=%s AWS_SESSION_TOKEN=%s" \
$(aws sts assume-role \
--role-arn arn:aws:iam::<account-id>:role/<role-name> \
--role-session-name test-session \
--query "Credentials.[AccessKeyId,SecretAccessKey,SessionToken]" \
--output text))
I get the error
duckdb.Error: Invalid Error: Unable to connect to URL "s3://bucket/folder/file.parquet": 403 (Forbidden)
However, when I use my IAM user, I am able to access this s3 object and query the data just fine. Is there something I am missing about the difference between roles and IAM users?
If it helps, what I am trying to do is create a role for a lambda function and then access the environmental variables AWS_ACCESS_KEY_ID
, and AWS_SECRET_ACCESS_KEY
with os.getenviron()
in the code above. I believe if I can get the role working by writing in the temporary credentials it should work when I use os.getenv()
in the lambda function.
I had a very similar issue, after also setting the s3_session_token
via SET s3_session_token='sessiontoken';
it worked. Also, be aware that S3 is not a global service, which means that you need to make sure to set the correct s3_region
.
The code would be changed to
import pandas as pd
import duckdb
query = """
INSTALL httpfs;
LOAD httpfs;
SET s3_region='us-west-2';
SET s3_access_key_id='key';
SET s3_secret_access_key='secret';
SET s3_session_token='session-token';
SELECT
FROM read_parquet('s3://bucket/folder/file.parquet')
cursor = duckdb.connect()
cursor.execute(query).df()