python-3.xpandas

Creating a link id between two related columns


I have a file with some family relationship data and I would like to create a family id column based on id and sibling_id. My data looks like the following:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'id': [1, 1, 2, 2, 3, 3, 4, 5, 6, 7],
    'field_a': list('AABBCCDEFG'),
    'sibling_id': [2, 3, 1, 3, 1, 2, np.nan, np.nan, 7, 6],
    'sibling_field_a': ['B', 'C', 'A', 'C' , 'A', 'B', np.nan, np.nan, 'G', 'F']
})

df['sibling_id'] = df['sibling_id'].astype('Int64')

   id field_a  sibling_id sibling_field_a
0   1       A           2               B
1   1       A           3               C
2   2       B           1               A
3   2       B           3               C
4   3       C           1               A
5   3       C           2               B
6   4       D        <NA>             NaN
7   5       E        <NA>             NaN
8   6       F           7               G
9   7       G           6               F

My expected output is

   id field_a  sibling_id sibling_field_a  family_id
0   1       A           2               B          0
1   1       A           3               C          0
2   2       B           1               A          0
3   2       B           3               C          0
4   3       C           1               A          0
5   3       C           2               B          0
6   4       D        <NA>             NaN          1
7   5       E        <NA>             NaN          2
8   6       F           7               G          3
9   7       G           6               F          3

Thank you for the help in advance.


Solution

  • This is a graph problem, it cannot be solved easily with pandas only if you have complex relationships (e.g. half-sibling of half-siblings).

    You can use networkx's connected_components after converting the DataFrame to graph:

    # pip install networkx
    import networkx as nx
    
    # convert to graph
    G = nx.from_pandas_edgelist(df,
                                source='field_a',
                                target='sibling_field_a')
    # ensure NaNs are not a node
    G.remove_nodes_from([np.nan])
    
    # compute the groups/families
    groups = {n: i for i, c in enumerate(nx.connected_components(G)) for n in c}
    
    # map the groups to the ids
    df['family_id'] = df['field_a'].map(groups)
    

    Output:

       id field_a  sibling_id sibling_field_a  family_id
    0   1       A           2               B          0
    1   1       A           3               C          0
    2   2       B           1               A          0
    3   2       B           3               C          0
    4   3       C           1               A          0
    5   3       C           2               B          0
    6   4       D        <NA>             NaN          1
    7   5       E        <NA>             NaN          2
    8   6       F           7               G          3
    9   7       G           6               F          3
    

    Graph:

    pandas form families connected components