I know there exist functions for computing the size of the connected components of a graph in NetworkX. You can add attributes to a node. In Axelrod's model for dissemination of culture, an interesting measurement is the size of the largest connected component whose nodes share several attributes. Is there a way of doing that in NetworkX? For example, let's say we have a population represented through a network. Each node has attributes of hair color and skin color. How can I get the size of the largest component of nodes such that in that subgraph each and every node has the same hair and skin color? Thank you
For general data analysis, it's best to use pandas
. Use a graph library like networkx
or graph-tool
to determine the connected components, and then load that info into a DataFrame
that you can analyze. In this case, the pandas groupby
and nunique
(number of unique elements) features will be useful.
Here's a self-contained example using graph-tool
(using this network). You could also compute the connected components via networkx
.
import numpy as np
import pandas as pd
import graph_tool.all as gt
# Download an example graph
# https://networks.skewed.de/net/baseball
g = gt.collection.ns["baseball", 'user-provider']
# Extract the player names
names = g.vertex_properties['name'].get_2d_array([0])[0]
# Extract connected component ID for each node
cc, cc_sizes = gt.label_components(g)
# Load into a DataFrame
players = pd.DataFrame({
'id': np.arange(g.num_vertices()),
'name': names,
'cc': cc.a
})
# Create some random attributes
players['hair'] = np.random.choice(['purple', 'pink'], size=len(players))
players['skin'] = np.random.choice(['green', 'blue'], size=len(players))
# For the sake of this example, manipulate the data so
# that some groups are homogenous with respect to some attributes.
players.loc[players['cc'] == 2, 'hair'] = 'purple'
players.loc[players['cc'] == 2, 'skin'] = 'blue'
players.loc[players['cc'] == 4, 'hair'] = 'pink'
players.loc[players['cc'] == 4, 'skin'] = 'green'
# Now determine how many unique hair and skin colors we have in each group.
group_stats = players.groupby('cc').agg({
'hair': 'nunique',
'skin': ['nunique', 'size']
})
# Simplify the column names
group_stats.columns = ['hair_colors', 'skin_colors', 'player_count']
# Select homogenous groups, i.e. groups for which only 1 unique
# hair color is present and 1 unique skin color is present
homogenous = group_stats.query('hair_colors == 1 and skin_colors == 1')
# Sort from large groups to small groups
homogenous = homogenous.sort_values('player_count', ascending=False)
print(homogenous)
That prints the following:
hair_colors skin_colors player_count
cc
4 1 1 4
2 1 1 3