[SOLVED] Ways of Creating List from Dask dataframe column

Ways of Creating List from Dask dataframe column

I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.

Can i use dask.bag or Multiprocessing to create a list?

Solution

when you try to convert a column to a list or set with the regular list/set Python will load that into memory, that's why you get a memory limit issue.

I believe that by using dask.bag you might solve that issue since dask.bag will lazy load your data, although I'm not sure if the df[column] won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.

Using a dask.bag allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.

You can try to run this to see if it filters the list/bag as you expect.

import dask.bag as db

bag = db.from_sequence(df[column], npartitions=5) 

bag.filter(lamdba list_element: list_element == "filtered row")

Since this is just an example, you will need to change the npartitions and the lambda expression to fit your needs.

Let me know if this helps