pythondaskdask-dataframe

Reducing tasks to complete when creating child dataframes in Dask


I am trying to understand the "best practice" when creating child dataframes in dask in which I haven't found within the dask documentation and best practice articles.

Let's say I have a really big dask dataframe called df which has many tasks behind it in order to create it. Then I need to perform some operation on it which I want to store in a child dataframe called child_df then join child_df back to df.

When the join has completed I then need to use .compute() to get the data back into pandas and carry on my work.

I believe that child_df will be duplicating the amount of tasks that it takes to create df and thus I am wondering is there a way I can create child_df without rerunning the tasks that create df? Is my thinking correct that child_df doubles the work?

This is a very simplified view of what I am trying to achieve so I understand I could call df.compute() then work off that on the child_df but in my case that will not work due to df not being able to fit in memory and being filtered down further on in the process.

Hope this makes sense :)

snip of task duplication


Solution

  • No, you will not be duplicating the tasks. The "child" does indeed need all the upstream tasks if you were to compute it alone, but when you join it back with the "parent" dataframe, dask uses the unique keys associated with every operation and combination of arguments to only calculate each intermediate result only once and use it as many times as necessary.

    (In come cases, you may genuinely get some duplication and in fact want this to be the case, should for example one of your workers become slower than the others. This technique to improve performance and parallelism happens relatively rarely and not at all if you are pressured for memory in the system)