pythonpandasdataframenetworkxgraph-theory

Adjacency matrix not square error from square dataframe with networkx


I have code that aims to generate a graph from an adjacency matrix from a table correlating workers with their manager. The source is a table with two columns (Worker, manager). It still works perfectly from a small mock data set, but fails unexpectedly with the real data:

import pandas as pd
import networkx as nx

# Read input
df = pd.read_csv("org.csv")

# Create the input adjacency matrix
am = pd.DataFrame(0, columns=df["Worker"], index=df["Worker"])
# This way, it is impossible that the dataframe is not square,
# or that index and columns don't match

# Fill the matrix
for ix, row in df.iterrows():
    am.at[row["manager"], row["Worker"]] = 1

# At this point, am.shape returns a square dataframe (2825,2825)
# Generate the graph
G = nx.from_pandas_adjacency(am, create_using=nx.DiGraph)

This returns: NetworkXError: Adjacency matrix not square: nx,ny=(2825, 2829)

And indeed, the dimensions reported in the error are not the same as in those of the input dataframe am.

Does anyone have an idea of what happens in from_pandas_adjacency that could lead to this mismatch?


Solution

  • In:

    am = pd.DataFrame(0, columns=df["Worker"], index=df["Worker"])
    # This way, it is impossible that the dataframe is not square,
    

    your DataFrame is indeed square, but when you later assign values in the loop, if you have a manager that is not in "Worker", this will create a new row:

    am.at[row["manager"], row["Worker"]]
    

    Better avoid the loop, use a crosstab, then reindex on the whole set of nodes:

    am = pd.crosstab(df['manager'], df['Worker'])
    nodes = am.index.union(am.columns)
    am = am.reindex(index=nodes, columns=nodes, fill_value=0)
    

    Even better, if you don't really need the adjacency matrix, directly create the graph with nx.from_pandas_edgelist:

    G = nx.from_pandas_edgelist(df, source='manager', target='Worker',
                                create_using=nx.DiGraph)
    

    Example:

    # input
    df = pd.DataFrame({'manager': ['A', 'B', 'A'], 'Worker': ['D', 'E', 'F']})
    
    # adjacency matrix
       A  B  D  E  F
    A  0  0  1  0  1
    B  0  0  0  1  0
    D  0  0  0  0  0
    E  0  0  0  0  0
    F  0  0  0  0  0
    
    # adjacency matrix with your code
    Worker    D    E    F
    Worker               
    D       0.0  0.0  0.0
    E       0.0  0.0  0.0
    F       0.0  0.0  0.0
    A       1.0  NaN  1.0  # those rows are created 
    B       NaN  1.0  NaN  # after initializing am
    

    Graph:

    enter image description here