pythonpandascsvfilterdask

Efficiently filtering comma separated strings in pandas/dask


I have some data which has the following shape (with header)

Name,Signal,Date
MyName,"1,2,3,4,5,6,7,8,9,10",19-04-2024
MyName,"1,2,3,4,5,6,7,8,9,10",19-04-2024

I'm interested in filtering the rows based on the sum of the array in "Signal". So I tried the following :

df = read_csv("my_csv.csv", dtype={"Signal" : "string"}, parse_dates=True)

for i in df["Signal"]:
   t = np.array([int(x) for x in i.split(",")])
   if t.sum() == 100:
       #etc

This approach raises some issues though :

  1. How can I then record the index of the current row to then filter/drop it from my dataframe
  2. Can this operation be sped up/done more efficiently? I was thinking of allocating a 2d numpy array and then parsing in the numbers to only allocate once but not sure this would make a difference
  3. When using dask, which lacks a global row index, is there a more efficient way to filter the rows without allocating all the data into numpy arrays?

Solution

  • Having read in your dataframe, you can then convert the Signal values into a list of integers using this code:

    df['Signal'] = df['Signal'].apply(lambda s:list(map(int, s.split(','))))
    

    Output:

         Name                           Signal        Date
    0  MyName  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  19-04-2024
    1  MyName  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]  19-04-2024
    

    Then you can play with your values to filter the dataframe. For example, to filter on sums:

    sums = df['Signal'].apply(sum)
    # 0    55
    # 1    55
    # Name: Signal, dtype: int64
    
    mask = sums == 100
    # 0    False
    # 1    False
    # Name: Signal, dtype: bool
    
    df_filtered = df[mask]
    

    If you want to use the sums more than once, it's probably best to save them in the dataframe e.g.

    df['sum'] = df['Signal'].apply(sum)