[SOLVED] randomly accessing a row of Dask dataframe is taking a long time

randomly accessing a row of Dask dataframe is taking a long time

I have a Dask dataframe of 100 million rows of data.

I am trying to iterate over this dataframe without loading the entire dataframe to RAM.

For an experiment, trying to access row of index equal to 1.

%time dask_df.loc[1].compute()

The time it took is whopping 8.88 s (Wall time)

Why is it taking it so long?

What can I do to make it faster?

Thanks in advance.

Per request, here is the code. It is just reading 100 million rows of data and trying to access a row.

`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`

Dask DataFrame Structure: avg_user_prod_aff_score internalItemID internalUserID npartitions=1
float32 int16 int32

len(dask_df)

100,000,000

%time dask_df.loc[1].compute()

There are just 3 columns with datatypes of float32, int16 and int32.

The dataframe is indexed starting at 0.

Writing time is actually very good which is around 2 minutes.

I must be doing something wrong here.

Solution

It looks like there is a performance issues with Dask when trying access 10 million rows. It took 2.28 secs to access first 10 rows.

With 100 million rows, it takes whopping 30 secs.