[SOLVED] Creating a link id between two related columns

Creating a link id between two related columns

I have a file with some family relationship data and I would like to create a family id column based on id and sibling_id. My data looks like the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [1, 1, 2, 2, 3, 3, 4, 5, 6, 7],
    'field_a': list('AABBCCDEFG'),
    'sibling_id': [2, 3, 1, 3, 1, 2, np.nan, np.nan, 7, 6],
    'sibling_field_a': ['B', 'C', 'A', 'C' , 'A', 'B', np.nan, np.nan, 'G', 'F']
})

df['sibling_id'] = df['sibling_id'].astype('Int64')

   id field_a  sibling_id sibling_field_a
0   1       A           2               B
1   1       A           3               C
2   2       B           1               A
3   2       B           3               C
4   3       C           1               A
5   3       C           2               B
6   4       D        <NA>             NaN
7   5       E        <NA>             NaN
8   6       F           7               G
9   7       G           6               F

My expected output is

   id field_a  sibling_id sibling_field_a  family_id
0   1       A           2               B          0
1   1       A           3               C          0
2   2       B           1               A          0
3   2       B           3               C          0
4   3       C           1               A          0
5   3       C           2               B          0
6   4       D        <NA>             NaN          1
7   5       E        <NA>             NaN          2
8   6       F           7               G          3
9   7       G           6               F          3

Thank you for the help in advance.

Solution

This is a graph problem, it cannot be solved easily with pandas only if you have complex relationships (e.g. half-sibling of half-siblings).

You can use networkx's connected_components after converting the DataFrame to graph:

# pip install networkx
import networkx as nx

# convert to graph
G = nx.from_pandas_edgelist(df,
                            source='field_a',
                            target='sibling_field_a')
# ensure NaNs are not a node
G.remove_nodes_from([np.nan])

# compute the groups/families
groups = {n: i for i, c in enumerate(nx.connected_components(G)) for n in c}

# map the groups to the ids
df['family_id'] = df['field_a'].map(groups)

Output:

   id field_a  sibling_id sibling_field_a  family_id
0   1       A           2               B          0
1   1       A           3               C          0
2   2       B           1               A          0
3   2       B           3               C          0
4   3       C           1               A          0
5   3       C           2               B          0
6   4       D        <NA>             NaN          1
7   5       E        <NA>             NaN          2
8   6       F           7               G          3
9   7       G           6               F          3

Graph: