I have two dataframes say df1
and df2
, for example
import pandas as pd
col_1= ["A", ["B","C"], ["A","C","D"], "D"]
col_id = [1,2,3,4]
col_2 = [1,2,2,3,3,4,4]
d1 = {'ID': [1,2,3,4], 'Labels': col_1}
d2 = {'ID': col_2, }
d_2_get = {'ID': col_2, "Labels": ["A", "B", "C", "A", "C", "D", np.nan] }
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df_2_get = pd.DataFrame(data=d_2_get)
df1
looking like
ID col2
0 1 A
1 2 [B, C]
2 3 [A, C, D]
3 4 D
and df2
looking like
ID
0 1
1 2
2 2
3 3
4 3
5 4
6 4
I want to add a column Labels
to df2
, taken from df1
, in such a way that:
i
, start with the first value in df1
df2["ID"]
has a repeated entry, get the next value in df1
, if it exists. If not, set NaN
.Given df1
and df2
, the output should look like df_2_get
below
ID Labels
0 1 A
1 2 B
2 2 C
3 3 A
4 3 C
5 4 D
6 4 NaN
My current clumsy attempt is below,
from collections import Counter
def list_flattener(list_of_lists):
return [item for row in list_of_lists for item in row]
def my_dataframe_filler(df1, df2):
list_2_fill = []
repeats = dict(Counter(df2["ID"]))
for k in repeats.keys():
available_labels_list = df1[df1["ID"]==k]["Labels"].tolist()
available_labels_list+=[[np.nan]*10]
available_labels_list = list_flattener(available_labels_list)
list_2_fill+=available_labels_list[:repeats[k]]
return list_2_fill
and then use as
df2["Labels"] = my_dataframe_filler(df1, df2)
but I would like to learn how a pandas black belt would handle the problem, thanks
IIUC, you could explode
and perform a merge
after deduplication with groupby.cumcount
:
out = (df2
.assign(n=df2.groupby('ID').cumcount())
.merge(df1.explode('Labels').assign(n=lambda x: x.groupby('ID').cumcount()),
on=['ID', 'n'], how='left'
)
#.drop(columns='n')
)
Output:
ID n Labels
0 1 0 A
1 2 0 B
2 2 1 C
3 3 0 A
4 3 1 C
5 4 0 D
6 4 1 NaN
Alternatively, a pure python approach using iterators, map
and next
:
# for each list, build an iterator
d = dict(zip(df1['ID'], map(iter, df1['Labels'])))
# take the appropriate list and get the next available item
# default to None if exhausted
df2['Labels'] = df2['ID'].map(lambda x: next(d[x], None))
Output:
ID Labels
0 1 A
1 2 B
2 2 C
3 3 A
4 3 C
5 4 D
6 4 None