pythonpandasamazon-s3python-s3fs

Load CSV file into Pandas from s3 using chunksize


I'm trying to read a very big file from s3 using...

import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)

But even after giving the chunk size it is taking for ever. Does the chunksize option work when fetching file from s3 ? If not is there any better way in loading big files from s3?


Solution

  • https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html Clearly says that

    filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

    When reading in chunk, pandas return you iterator object, you need to iterate through it.. Something like:

    for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
        process df chunk..
    

    And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:

    for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
        print(df.head())
        break