Goal: Use S3 Select to extract columns from a .parquet
on S3.
I've tried various queries. Including the key
in the query
makes no difference.
Code:
s3 = boto3.client('s3')
s3_uri = 's3://my-bucket/my-folder/'
bucket, prefix = s3_uri[5:].rstrip('/').split('/', 1)
key = f"{prefix}/{idx.split('-')[0]}.parquet"
cols = ['col1', 'col2']
query = f"SELECT {', '.join(cols) if isinstance(cols, list) else str(idx)} FROM s3object s WHERE s.key = '{key}'"
# 1 cols could be either one string, one-to-may members in list
print(bucket, prefix)
print(key)
print(query)
response = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={'Parquet': {}},
OutputSerialization={'JSON': {}}
)
data = ''.join(
event['Records']['Payload'].decode()
for event in response['Payload']
if 'Records' in event
)
df = pl.read_json(io.StringIO(data), lines=True)
Traceback:
(venv) me@laptop:~/BitBucket/project$ python extract_parquet_column.py
my-bucket my-folder
my-folder/01A450012.parquet
SELECT 01A450012-5-1767-474-256-256 FROM s3object s WHERE s.key = 'my-folder/01A450012.parquet'
Traceback (most recent call last):
File "/home/me/BitBucket/project/extract_parquet_column.py", line 48, in getitem
response = s3.select_object_content(
File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ParseSelectMissingFrom) when calling the SelectObjectContent operation: Missing FROM after SELECT list at line 1, column 17.
All other variables were correct.
However, using either single-quotes or double quotes changes the behaviours of what the SQL query returns :/
query = f"""SELECT "{', '.join(idx) if isinstance(idx, list) else idx}" FROM S3Object s"""
>>> ... {"01A450012-5-1767-474-256-256":215}
query = f'''SELECT '{", ".join(idx) if isinstance(idx, list) else idx}' FROM S3Object s'''
>>> ... {"_1":"01A450012-5-1767-474-256-256"}
Not sure if this is a feature or a bug.