pythonpandasdataframemodin

Does Modin speedup Pandas Apply function?


I have tried to find answer in many places, but never got direct answer yet. Does modin Speedup apply on Dataframes? Is it having intelligency to parallerize apply function across Dataframe rather than doing typical row by row?

Or

Should we go for Spark Dataframe to speedup apply function?

Apologies in case there was easy answer available, I always get answers around how fast modin is in reading or some functions, rarely on apply.


Solution

  • To understand how Modin speed up Pandas operation a few words about its archetecture. Modin Frame is 2D array of partitions, where each partition is a Pandas DataFrame (link to doc with explainfull images). Usually DataFrame splits in N_cores partitions, so when we're doing some operation under our Modin Frame it's doing it in parallel on every partition, that's how Modin is speeding up Pandas computations.

    Modin has a flexible mechanism of partitioning, it could repartition a frame on the fly depending on the operation. For example, when we're performing an operation that requires knowledges about the whole row (like in df.apply(fn), where fn expects to get the row, so we need knowledge about whole of it) the Modin Frame will be repartitioned in only row partitions, so

    modin_df.apply(fn)
    

    will perform something like this (explainfull img). As we see from the image, if we have a frame with shape (100000, 64) and apply a function, we'll get N parralel executions of .apply() under (100000/N, 64) shape frames, which gives a decent speed up.