sqlpython-3.xboto3pyarrowamazon-s3-select

botocore.exceptions.ClientError: An error occurred (ParseSelectMissingFrom) when calling the SelectObjectContent operation: Missing FROM after SELECT


Goal: Use S3 Select to extract columns from a .parquet on S3.

I've tried various queries. Including the key in the query makes no difference.

Code:

    s3 = boto3.client('s3')
    s3_uri = 's3://my-bucket/my-folder/'
    bucket, prefix = s3_uri[5:].rstrip('/').split('/', 1)

    key = f"{prefix}/{idx.split('-')[0]}.parquet"
    
    cols = ['col1', 'col2']
    query = f"SELECT {', '.join(cols) if isinstance(cols, list) else str(idx)} FROM s3object s WHERE s.key = '{key}'"
    # 1 cols could be either one string, one-to-may members in list

    print(bucket, prefix)
    print(key)
    print(query)

    response = s3.select_object_content(
        Bucket=bucket,
        Key=key,
        ExpressionType='SQL',
        Expression=query,
        InputSerialization={'Parquet': {}},
        OutputSerialization={'JSON': {}}
    )

    data = ''.join(
        event['Records']['Payload'].decode()
        for event in response['Payload']
        if 'Records' in event
    )
    df = pl.read_json(io.StringIO(data), lines=True)

Traceback:

(venv) me@laptop:~/BitBucket/project$ python extract_parquet_column.py 
my-bucket my-folder
my-folder/01A450012.parquet
SELECT 01A450012-5-1767-474-256-256 FROM s3object s WHERE s.key = 'my-folder/01A450012.parquet'

Traceback (most recent call last):
  File "/home/me/BitBucket/project/extract_parquet_column.py", line 48, in getitem
    response = s3.select_object_content(
  File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/me/miniconda3/envs/venv/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ParseSelectMissingFrom) when calling the SelectObjectContent operation: Missing FROM after SELECT list at line 1, column 17.

Solution

  • All other variables were correct.

    However, using either single-quotes or double quotes changes the behaviours of what the SQL query returns :/

    query = f"""SELECT "{', '.join(idx) if isinstance(idx, list) else idx}" FROM S3Object s"""
    >>> ... {"01A450012-5-1767-474-256-256":215}
    
    query = f'''SELECT '{", ".join(idx) if isinstance(idx, list) else idx}' FROM S3Object s'''
    >>> ... {"_1":"01A450012-5-1767-474-256-256"}
    

    Not sure if this is a feature or a bug.