I'm trying to read a very big file from s3 using...
import pandas as pd
import s3fs
df = pd.read_csv('s3://bucket-name/filename', chunksize=100000)
But even after giving the chunk size it is taking for ever. Does the chunksize
option work when fetching file from s3 ? If not is there any better way in loading big files from s3?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html Clearly says that
filepath_or_bufferstr, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.
If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.
When reading in chunk, pandas return you iterator object, you need to iterate through it.. Something like:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
process df chunk..
And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
print(df.head())
break