I am currently Trying Dask locally (parallel processing) for the first Time on a large Dataset (3.2 Go). I am comparing Dasks speed with pandas on simple computations. Using Dask seems to result in slower execution time in any task beside reading and transforming data.
example:
#pandas code
import numpy as np
import pandas as pd
import time
T=pd.read_csv("transactions_train.csv")
Data reading is slow, it takes about 1.5 minutes to read the data.
After trying simple Computations
%%time
T.price.mean()
this executes in about 3 seconds
as for dask:
from dask.distributed import Client, progress,LocalCluster
client = Client()
client
import dask.dataframe as dd
DT = dd.read_csv("transactions_train.csv")
executed in 0.551 seconds
%%time
DT.price.mean().compute()
Takes 25 seconds to run this.
It gets worse for heavy computation like modelling.
Any help would be appreciated as I am new to dask and not sure if I am not using it right.
My pc has 4 cores
Avoid calling compute repeatedly: For example for these simple operations, do something like this
xmin, xmax = dask.compute(df.x.min(), df.x.max())