networkxgraph-tool

NetworkX largest connected component sharing attributes


I know there exist functions for computing the size of the connected components of a graph in NetworkX. You can add attributes to a node. In Axelrod's model for dissemination of culture, an interesting measurement is the size of the largest connected component whose nodes share several attributes. Is there a way of doing that in NetworkX? For example, let's say we have a population represented through a network. Each node has attributes of hair color and skin color. How can I get the size of the largest component of nodes such that in that subgraph each and every node has the same hair and skin color? Thank you


Solution

  • For general data analysis, it's best to use pandas. Use a graph library like networkx or graph-tool to determine the connected components, and then load that info into a DataFrame that you can analyze. In this case, the pandas groupby and nunique (number of unique elements) features will be useful.

    Here's a self-contained example using graph-tool (using this network). You could also compute the connected components via networkx.

    import numpy as np
    import pandas as pd
    import graph_tool.all as gt
    
    # Download an example graph
    # https://networks.skewed.de/net/baseball
    g = gt.collection.ns["baseball", 'user-provider']
    
    # Extract the player names
    names = g.vertex_properties['name'].get_2d_array([0])[0]
    
    # Extract connected component ID for each node
    cc, cc_sizes = gt.label_components(g)
    
    # Load into a DataFrame
    players = pd.DataFrame({
        'id': np.arange(g.num_vertices()),
        'name': names,
        'cc': cc.a
    })
    
    # Create some random attributes
    players['hair'] = np.random.choice(['purple', 'pink'], size=len(players))
    players['skin'] = np.random.choice(['green', 'blue'], size=len(players))
    
    # For the sake of this example, manipulate the data so
    # that some groups are homogenous with respect to some attributes.
    players.loc[players['cc'] == 2, 'hair'] = 'purple'
    players.loc[players['cc'] == 2, 'skin'] = 'blue'
    
    players.loc[players['cc'] == 4, 'hair'] = 'pink'
    players.loc[players['cc'] == 4, 'skin'] = 'green'
    
    # Now determine how many unique hair and skin colors we have in each group.
    group_stats = players.groupby('cc').agg({
        'hair': 'nunique',
        'skin': ['nunique', 'size']
    })
    
    # Simplify the column names
    group_stats.columns = ['hair_colors', 'skin_colors', 'player_count']
    
    # Select homogenous groups, i.e. groups for which only 1 unique
    # hair color is present and 1 unique skin color is present
    homogenous = group_stats.query('hair_colors == 1 and skin_colors == 1')
    
    # Sort from large groups to small groups
    homogenous = homogenous.sort_values('player_count', ascending=False)
    print(homogenous)
    

    That prints the following:

        hair_colors  skin_colors  player_count
    cc
    4             1            1             4
    2             1            1             3