Let's say I'm starting with the following input table below:
date | id | val1 | val2 | val3 | val4 | val5 | val_final |
---|---|---|---|---|---|---|---|
2023-08-29 | B3241 | 496C | |||||
2023-09-08 | A3290 | 349C | 078F | 274F | |||
2023-09-12 | D2903 | 349C | 072F | 307C | 170F | 201D | |
2023-09-14 | I13490 | 497C | 0349 | 303F | 101A |
The code to create the initial input table is below:
import pandas as pd
df = pd.DataFrame({'date':["2023-08-29","2023-09-08","2023-09-12", "2023-09-14"],'id':["B3241","A3290","D2903", "I13490"],'val1':["496C","349C","349C", "497C"], 'val2':["","078F","072F", "0349"], 'val3':["","274F","307C", "303F"], 'val4':["","","170F", "101A"], 'val5':["","","201D",""]})
I want to look at columns "val1" through "val5" and see which rows contain a value from my code list. I want to populate the "val_final" column accordingly (if the value is in the list).
code_list = ['349C', '303F', '201D', '497C']
If multiple columns contain values from code_list I want to pick the one that's on the right most column.
Given the above logic, my desired output table would look like this:
date | id | val1 | val2 | val3 | val4 | val5 | val_final |
---|---|---|---|---|---|---|---|
2023-08-29 | B3241 | 496C | |||||
2023-09-08 | A3290 | 349C | 078F | 274F | 349C | ||
2023-09-12 | D2903 | 349C | 072F | 307C | 170F | 201D | 201D |
2023-09-14 | I13490 | 497C | 349C | 303F | 101A | 303F |
I searched all over StackOverflow to try and solve this problem but to no avail. I assume doing if then else is an option starting with column "val5" and going down to column "val1" but I want to learn a more efficient way to do this.
IIUC you can use apply for each row with a function that returns the first value that exists in the code_list starting with the right-most columns:
# returns the first match
def lambda_func(row, code_list):
for x in row:
if x in code_list:
return x
return None
# define your list of "val" columns
val_cols = [x for x in df.columns if x.startswith('val')][::-1]
df['val_final'] = df.apply(lambda row: lambda_func(row[val_cols], code_list), axis=1)
print(df)
Output:
date id val1 val2 val3 val4 val5 val_final
0 2023-08-29 B3241 496C None
1 2023-09-08 A3290 349C 078F 274F 349C
2 2023-09-12 D2903 349C 072F 307C 170F 201D 201D
3 2023-09-14 I13490 497C 0349 303F 101A 303F