pythondaskdask-dataframe

Ways of Creating List from Dask dataframe column


I want to create a list/set from Dask Dataframe column. Basically, i want to use this list to filter rows in another dataframe by matching values with a column in this dataframe. I have tried using list(df[column]) and set(df[column]) but it takes lot of time and ends up giving error regarding creating cluster or sometimes it restarts kernel when memory limit is reached.

Can i use dask.bag or Multiprocessing to create a list?


Solution

  • when you try to convert a column to a list or set with the regular list/set Python will load that into memory, that's why you get a memory limit issue.

    I believe that by using dask.bag you might solve that issue since dask.bag will lazy load your data, although I'm not sure if the df[column] won't have to be read first. Also, be aware that turning that column into a bag will take a while depending on how big the data is.

    Using a dask.bag allows you to run map, filter and aggregate so it seems it could be a good solution for your problem.

    You can try to run this to see if it filters the list/bag as you expect.

    import dask.bag as db
    
    bag = db.from_sequence(df[column], npartitions=5) 
    
    bag.filter(lamdba list_element: list_element == "filtered row")
    
    

    Since this is just an example, you will need to change the npartitions and the lambda expression to fit your needs.

    Let me know if this helps