pandasparallel-processingdaskdask-dataframedask-ml

Dask still Slower than Pandas on Large Dataset 3.2 Go


I am currently Trying Dask locally (parallel processing) for the first Time on a large Dataset (3.2 Go). I am comparing Dasks speed with pandas on simple computations. Using Dask seems to result in slower execution time in any task beside reading and transforming data.

example:

#pandas code
import numpy as np
import pandas as pd
import time

T=pd.read_csv("transactions_train.csv")

Data reading is slow, it takes about 1.5 minutes to read the data.

After trying simple Computations

%%time
T.price.mean()

this executes in about 3 seconds

as for dask:

from dask.distributed import Client, progress,LocalCluster
client = Client()
client

import dask.dataframe as dd

DT = dd.read_csv("transactions_train.csv")

executed in 0.551 seconds

%%time
DT.price.mean().compute()

Takes 25 seconds to run this.

It gets worse for heavy computation like modelling.

Any help would be appreciated as I am new to dask and not sure if I am not using it right.

My pc has 4 cores


Solution

  • Avoid calling compute repeatedly: For example for these simple operations, do something like this

     xmin, xmax = dask.compute(df.x.min(), df.x.max())