pythonpandasnumpy

Group list of lists based on a condition


I have a list of lists and I need to create groups: each group should be start by a pattern (the word "START" in this case) and should be end the line before the next pattern, here below the example:

lst = [
    ["abc"],
    ["START"],
    ["cdef"],
    ["START"],
    ["fhg"],
    ["cdef"],
]

group_a = [
    ["START"],
    ["cdef"],
]

group_b = [
    ["START"],
    ["fhg"],
    ["cdef"],
]

I tried with numpy and pandas too without any success. Many thanks in advance for your support. Regards Tommaso


Solution

  • if you want to use pandas you could create a boolean mask to identify the occurences of START . Then take the cumsum() to assign a unique group number to each occurrence. Then groupby the group number, excluding all groups before the first occurrence of START :

    import pandas as pd
    import numpy as np
    
    lst = [
        ["abc"],
        ["START"],
        ["cdef"],
        ["START"],
        ["fhg"],
        ["cdef"],
    ]
    
    df = pd.DataFrame(lst, columns=['Input'])
    
    #create boolean mask
    mask = df['Input'].eq('START')
    
    #Intermediate Result
    0    False
    1     True
    2    False
    3     True
    4    False
    5    False
    Name: Input, dtype: bool
    
    
    
    #assign group number to each occurrence of start
    df['Group'] = mask.cumsum()
    
    #Intermediate Result
     Input  Group
    0    abc      0
    1  START      1
    2   cdef      1
    3  START      2
    4    fhg      2
    5   cdef      2
    
    
    
    
    #create list for each group in groupby excluding groups before the 
    #first occurrence of 'START'
    grouped_lists = [group['Input'].tolist() for _, group in df[df['Group'] > 0].groupby('Group')]
    
    
    
    print(grouped_lists)
    [['START', 'cdef'], ['START', 'fhg', 'cdef']]