I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column])
and set(df[column])
but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.
Can i use dask.bag
or Multiprocessing to create a list?
when you try to convert a column to a list or set with the regular list/set
Python will load that into memory, that's why you get a memory limit issue.
I believe that by using dask.bag
you might solve that issue since dask.bag
will lazy load your data, although I'm not sure if the df[column]
won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.
Using a dask.bag
allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.
You can try to run this to see if it filters the list/bag as you expect.
import dask.bag as db
bag = db.from_sequence(df[column], npartitions=5)
bag.filter(lamdba list_element: list_element == "filtered row")
Since this is just an example, you will need to change the npartitions
and the lambda expression to fit your needs.
Let me know if this helps