pythondaskdask-dataframe

randomly accessing a row of Dask dataframe is taking a long time


I have a Dask dataframe of 100 million rows of data.

I am trying to iterate over this dataframe without loading the entire dataframe to RAM.

For an experiment, trying to access row of index equal to 1.

%time dask_df.loc[1].compute()

The time it took is whopping 8.88 s (Wall time)

Why is it taking it so long?

What can I do to make it faster?

Thanks in advance.

Per request, here is the code. It is just reading 100 million rows of data and trying to access a row.

`dask_df = dd.read_parquet("/content/drive/MyDrive/AffinityScore_STAGING/staging_affinity_block1.gzip", chunksize=10000000)`

Dask DataFrame Structure: avg_user_prod_aff_score internalItemID internalUserID npartitions=1
float32 int16 int32

len(dask_df)

100,000,000

%time dask_df.loc[1].compute()

There are just 3 columns with datatypes of float32, int16 and int32.

The dataframe is indexed starting at 0.

Writing time is actually very good which is around 2 minutes.

I must be doing something wrong here.


Solution

  • Accessing 10 Million rows from Dask

    It looks like there is a performance issues with Dask when trying access 10 million rows. It took 2.28 secs to access first 10 rows.

    Accessing 100 Million rows from Dask With 100 million rows, it takes whopping 30 secs.