pythonpandaslistsubsetpartial-matches

subset pandas df columns with partial string match OR match before "?" using lists of names


I hope someone might help me.

I have a dataframe that inculdes columns with similar names (see example data)

I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)

I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.

It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.

thank you in advance!

example data

df = {'id': ['1','2','3','4'],
        'ab? op':  ['green', 'red', 'blue', 'None'],
        'ab? 1': ['red', 'yellow', 'None', 'None'],
        'cd': ['L', 'XL', 'M','L'],
        'efab? cba' : ['husband', 'wife', 'husband', 'None'],
        'efab? 1':['son', 'grandparent', 'son', 'None'],
        'efab? 2':['None', 'son', 'None', 'None'],
        'fab? 4':['9', '10', '5', '3'], 
        'fab? po':['England', 'Scotland', 'Wales', 'NA'] }


df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])

list of column names in other 3 data frames


df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']

desired output

df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']

Solution

  • You can form a dynamic regex for each df lists:

    df_lists = [df1_lst, df2_lst, df3_lst]
    
    result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]
    

    e.g., for the first list, the regex is \b(ab|cd)\?? i.e. look for either ab or cd but they should be standalone from the left side (\b) and there might be an optional ? afterwards.

    The desired entries are in the result list e.g.

    >>> result[1]
    
      efab? cba      efab? 1 efab? 2
    0   husband          son    None
    1      wife  grandparent     son
    2   husband          son    None
    3      None         None    None