amazon-s3hiveparquetpython-polars

Can't get polars to read hive layout parquet files from S3 404 Not found


I'm struggling to read data from S3 via polars and just keep getting an unhelpful

Client error with status 404 Not Found

The data is layed out in S3 with what I believe to be Hive Partitioning (although this is the first time we've used it so it is possible we missed something). See Notes at the end.

Credentials are coming from boto3. I'm certain they are correct in boto3 since I can use boto3 for other actions on this same data:

import boto3.session
import polars

session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options = {
    "aws_access_key_id": credentials.access_key,
    "aws_secret_access_key": credentials.secret_key,
    "region": _session.region_name,
    "session_token": credentials.token,
}

url = "s3://my-example-bucket/staging/extract/contracts/*.parquet"

frame = polars.scan_parquet(url, storage_options=storage_options)

Neither of these work:

result = frame.filter(polars.col("record_date") == date(year=2024, month=1, day=1)).collect()
result = frame.collect()

The error is:

polars.exceptions.ComputeError: 'parquet scan' failed
The reason: Object at location staging/extract/contracts not found: Client error with status 404 Not Found: No Body:

An Example key in the bucket is:

staging/extract/contracts/record_date=2024-01-01/contracts_0_0_2024-02-15T16:21:51.975005+00:00.parquet

Notes:

This is the first time we've worked with Hive partitioning, so can't fully rule out a problem there. As far as we are aware, the only thing required by polars is for the parquet files to exist there. IE: there are no other meta files present. Adding additional meta files or changing keys are options if the problem is with how we've laid out the data.


Solution

  • As pointed out by jqurious, the answer was the glob pattern needs the correct number of stars to match the number of partition divisions (n+1).

    So for a single split of partitions (record_date=...) there must be two stars: s3://foo/bar/*/*.parquet not s3://foo/bar/*.parquet