pythonpandasdaskdask-ml

Running two dask-ml imputers simultaneously instead of sequentially


I can impute the mean and most frequent value using dask-ml like so, this works fine:

mean_imputer = impute.SimpleImputer(strategy='mean')
most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) 
df.iloc[:, [0,1]] = mean_imputer.fit_transform(df.iloc[:,[0,1]])
df.iloc[:, [2]] = most_frequent_imputer.fit_transform(df.iloc[:,[2]])
print(df)


    Weight  Age   Height
0   100.0   2.0   5.0
1   85.0    4.5   5.0
2   70.0    7.0   5.0

But what if I have 100 million rows of data it seems that dask would do two loops when it could have done only one, is it possible to run both imputers simultaneously and/or in parallel instead of sequentially? What would be a sample code to achieve that?


Solution

  • You can used dask.delayed as suggested in docs and Dask Toutorial to parallelise the computation if entities are independent of one another.

    Your code would look like:

    from dask.distributed import Client
    
    client = Client(n_workers=4)
    
    from dask import delayed
    import numpy as np
    import pandas as pd
    from dask_ml import impute
    
    mean_imputer = impute.SimpleImputer(strategy='mean')
    most_frequent_imputer = impute.SimpleImputer(strategy='most_frequent')
    
    def fit_transform_mi(d):
        return mean_imputer.fit_transform(d)
    def fit_transform_mfi(d):
        return most_frequent_imputer.fit_transform(d)
    def setdf(a,b,df):
        df.iloc[:, [0,1]]=a
        df.iloc[:, [2]]=b
        return df
    
    data = [[100, 2, 5], [np.nan, np.nan, np.nan], [70, 7, 5]]
    df = pd.DataFrame(data, columns = ['Weight', 'Age', 'Height']) 
    a = delayed(fit_transform_mi)(df.iloc[:,[0,1]])
    b = delayed(fit_transform_mfi)(df.iloc[:,[2]])
    c = delayed(setdf)(a,b,df)
    df= c.compute()
    print(df)
    client.close()
    

    The c object is a lazy Delayed object. This object holds everything we need to compute the final result, including references to all of the functions that are required and their inputs and relationship to one-another.