[SOLVED] Load CSV file into Pandas from s3 using chunksize

Load CSV file into Pandas from s3 using chunksize

I'm trying to read a very big file from s3 using...

import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)

But even after giving the chunk size it is taking for ever. Does the chunksize option work when fetching file from s3 ? If not is there any better way in loading big files from s3?

Solution

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html Clearly says that

filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

When reading in chunk, pandas return you iterator object, you need to iterate through it.. Something like:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
    process df chunk..

And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
    print(df.head())
    break