pandasdataframeamazon-s3pyarrowaws-data-wrangler

How do I use awswrangler to read only the first few N rows of a parquet file stored in S3?


I am trying to use awswrangler to read into a pandas dataframe an arbitrarily-large parquet file stored in S3, but limiting my query to the first N rows due to the file's size (and my poor bandwidth).

I cannot see how to do it, or whether it is even possible without relocating.

Could I use chunked=INTEGER and abort after reading the first chunk, say, and if so how?

I have come across this incomplete solution (last N rows ;) ) using pyarrow - Read last N rows of S3 parquet table - but a time-based filter would not be ideal for me and the accepted solution doesn't even get to the end of the story (helpful as it is).

Or is there another way without first downloading the file (which I could probably have done by now)?

Thanks!


Solution

  • You can do that with awswrangler using S3 Select. For example:

    import awswrangler as wr
    
    df = wr.s3.select_query(
            sql="SELECT * FROM s3object s limit 5",
            path="s3://amazon-reviews-pds/parquet/product_category=Gift_Card/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet",
            input_serialization="Parquet",
            input_serialization_params={},
            use_threads=True,
    )
    

    would return 5 rows only from the S3 object.

    This is not possible with other read methods because the entire object must be pulled locally before reading it. With S3 select, the filtering is done on the server side instead