I have a Dask dataframe of 100 million rows of data.
I am trying to iterate over this dataframe without loading the entire dataframe to RAM.
For an experiment, trying to access row of index equal to 1.
%time dask_df.loc[1].compute()
The time it took is whopping 8.88 s (Wall time)
Why is it taking it so long?
What can I do to make it faster?
Thanks in advance.
Per request, here is the code. It is just reading 100 million rows of data and trying to access a row.
`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`
Dask DataFrame Structure: avg_user_prod_aff_score internalItemID internalUserID npartitions=1
float32 int16 int32
len(dask_df)
100,000,000
%time dask_df.loc[1].compute()
There are just 3 columns with datatypes of float32, int16 and int32.
The dataframe is indexed starting at 0.
Writing time is actually very good which is around 2 minutes.
I must be doing something wrong here.
It looks like there is a performance issues with Dask when trying access 10 million rows. It took 2.28 secs to access first 10 rows.