sqlamazon-s3boto3amazon-s3-select

S3 select query not recognizing data


I generate a dataframe, write the dataframe to S3 as CSV file, and perform a select query on the CSV in S3 bucket. Based on the query and data I expect to see '4' and '10' printed but I only see '4'. For some reason S3 is not seeing the '10'.

It works fine for filtering between date.

import pandas as pd
import s3fs
import boto3

# dataframe
d = {'date':['1990-1-1','1990-1-2','1990-1-3','1999-1-4'], 'speed':[0,10,3,4]}
df = pd.DataFrame(d)

# write csv to s3
bytes_to_write = df.to_csv(index=False).encode()
fs = s3fs.S3FileSystem()
with fs.open('app-storage/test.csv', 'wb') as f:
    f.write(bytes_to_write)

# query csv in s3 bucket
s3 = boto3.client('s3',region_name='us-east-1')

resp = s3.select_object_content(
    Bucket='app-storage',
    Key='test.csv',
    ExpressionType='SQL',
    Expression="SELECT s.\"speed\" FROM s3Object s WHERE s.\"speed\" > '3'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
    OutputSerialization = {'CSV': {}},
)

for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Solution

  • Just needed to cast the string to float in the SQL statement.

    "SELECT s.\"speed\" FROM s3Object s WHERE cast(s.\"speed\" as float) > 3"
    
    

    Not it works without a problem.