I hope someone might help me.
I have a dataframe that inculdes columns with similar names (see example data)
I have 3 additional lists of column names which include the original names of the columns (i.e. the string occurring before the question mark (see lists of column names)
I need to subset the df dataframe into 3 separate dataframes, based on matching the first part of the column names present in the 3 lists. The expected output at the bottom.
It has to be in lists (or something programmatic) as I have lots and lots of columns like this. I tried pattern matching but because some names are very similar, they match to multiple lists.
thank you in advance!
example data
df = {'id': ['1','2','3','4'],
'ab? op': ['green', 'red', 'blue', 'None'],
'ab? 1': ['red', 'yellow', 'None', 'None'],
'cd': ['L', 'XL', 'M','L'],
'efab? cba' : ['husband', 'wife', 'husband', 'None'],
'efab? 1':['son', 'grandparent', 'son', 'None'],
'efab? 2':['None', 'son', 'None', 'None'],
'fab? 4':['9', '10', '5', '3'],
'fab? po':['England', 'Scotland', 'Wales', 'NA'] }
df = pd.DataFrame(df, columns = ['id','ab? op', 'ab? 1', 'cd', 'efab? cba', 'efab? 1', 'efab? 2', 'fab? 4', 'fab? po'])
list of column names in other 3 data frames
df1_lst = ['ab', 'cd']
df2_lst = ['efab']
df3_lst = ['fab']
desired output
df1 = ['ab? op', 'ab? 1', 'cd']
df2 = ['efab? cba', 'efab? 1', 'efab? 2']
df3 = ['fab? 4', 'fab? po']
You can form a dynamic regex for each df lists:
df_lists = [df1_lst, df2_lst, df3_lst]
result = [df.filter(regex=fr"\b({'|'.join(names)})\??") for names in df_lists]
e.g., for the first list, the regex is \b(ab|cd)\??
i.e. look for either ab
or cd
but they should be standalone from the left side (\b
) and there might be an optional ?
afterwards.
The desired entries are in the result
list e.g.
>>> result[1]
efab? cba efab? 1 efab? 2
0 husband son None
1 wife grandparent son
2 husband son None
3 None None None