I have a file with some family relationship data and I would like to create a family id column based on id
and sibling_id
. My data looks like the following:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'id': [1, 1, 2, 2, 3, 3, 4, 5, 6, 7],
'field_a': list('AABBCCDEFG'),
'sibling_id': [2, 3, 1, 3, 1, 2, np.nan, np.nan, 7, 6],
'sibling_field_a': ['B', 'C', 'A', 'C' , 'A', 'B', np.nan, np.nan, 'G', 'F']
})
df['sibling_id'] = df['sibling_id'].astype('Int64')
id field_a sibling_id sibling_field_a
0 1 A 2 B
1 1 A 3 C
2 2 B 1 A
3 2 B 3 C
4 3 C 1 A
5 3 C 2 B
6 4 D <NA> NaN
7 5 E <NA> NaN
8 6 F 7 G
9 7 G 6 F
My expected output is
id field_a sibling_id sibling_field_a family_id
0 1 A 2 B 0
1 1 A 3 C 0
2 2 B 1 A 0
3 2 B 3 C 0
4 3 C 1 A 0
5 3 C 2 B 0
6 4 D <NA> NaN 1
7 5 E <NA> NaN 2
8 6 F 7 G 3
9 7 G 6 F 3
Thank you for the help in advance.
This is a graph problem, it cannot be solved easily with pandas only if you have complex relationships (e.g. half-sibling of half-siblings).
You can use networkx
's connected_components
after converting the DataFrame to graph:
# pip install networkx
import networkx as nx
# convert to graph
G = nx.from_pandas_edgelist(df,
source='field_a',
target='sibling_field_a')
# ensure NaNs are not a node
G.remove_nodes_from([np.nan])
# compute the groups/families
groups = {n: i for i, c in enumerate(nx.connected_components(G)) for n in c}
# map the groups to the ids
df['family_id'] = df['field_a'].map(groups)
Output:
id field_a sibling_id sibling_field_a family_id
0 1 A 2 B 0
1 1 A 3 C 0
2 2 B 1 A 0
3 2 B 3 C 0
4 3 C 1 A 0
5 3 C 2 B 0
6 4 D <NA> NaN 1
7 5 E <NA> NaN 2
8 6 F 7 G 3
9 7 G 6 F 3
Graph: