amazon-web-servicesamazon-s3amazon-s3-select

s3-select querying data on field name


I am trying to query data from json data in s3-select.

   {
    person = [
    {
            "Id": 1,
            "Name": "Anshu",
            "Address": "Templestowe",
            "Car": "Jeep"
    }
    {
            "Id": 2,
            "Name": "Ben Mostafa",
            "Address": "Las Vegas",
            "Car": "Mustang"
    }
    {
                    "Id": 3,
                    "Name": "Rohan Wood",
                    "Address": "Wooddon",
                    "Car": "VW"
    }
]
}

QUERY = "select * from S3Object s"
QUERY = "select s.person from S3Object s"
QUERY = "select s.person[0] from S3Object s"
QUERY = "select s.person[0].Name from S3Object s"

All these queries work fine and returns the respective object as desired but when i am trying to search data on name/Car, it doesn't work.

QUERY = "select * from S3Object s where s.person.Name = \"Anshu\" "

error: com.amazonaws.services.s3.model.AmazonS3Exception: The column index at line 1, column 32 is invalid.

There is not much related content available on s3-select online. Wondering whether we can query on field name or not! There are no examples of select query with where clause for s3-select given in the documentation


Solution

  • I can't find this in any of AWS documentations, but I was just playing around and discovered a working syntax:

    QUERY = "select * from S3Object s where 'Anshu' in s.person[*].Name"
    

    Based on some deductions:

    1. I know that syntax like WHERE ('blah' in s.tags) work when tags property is an array of string.
    2. AWS documentation also say that s.person[#] should work when # is a valid index/digit. Based on this, I discovered that using star (*) between square bracket, as in s.person[*].Name, also work. This is after failed testing of various syntax like s.Person[], s.Person[#], s.Person[?], etc...

    Proof with Python and Boto3:

    import boto3
    
    S3_BUCKET = 'your-bucket-name'
    
    s3 = boto3.client('s3')
    
    r = s3.select_object_content(
            Bucket=S3_BUCKET,
            Key='your-file-name.json',
            ExpressionType='SQL',
            Expression="select * from s3object s where 'Anshu' in s.person[*].Name",
            InputSerialization={'JSON': {"Type": "Lines"}},
            OutputSerialization={'JSON': {}}
    )
    
    for event in r['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            print(records)
    

    Weird, I know. Remember to set [default] credentials in ~/.aws/credentials file.