pythonnetworkxgraphml

Why is the Girvan-Newman algorithm in NetworkX so slow


I have approx. 4000 nodes and 6000 edges in my graphml file and having no problem in transforming it into digraph format of networkx. However, when I tried to run girvan_newman() from networkx, it looks like it freezes because I have run the script and it hasn't finished for the past 10 hours (I tried with 10 nodes and edges, and it went through at 5 mins).

Here's my snippet:

import community as community_louvain
import networkx as nx
from networkx.algorithms.community.centrality import girvan_newman

G = nx.read_graphml('graph.graphml')
partition_girvan_newman = girvan_newman(G)
list(partition_girvan_newman)

My questions are:

  1. Does girvan_newman() from NetworkX only accept undirected graph?
  2. If girvan-newman in networkx indeed able to process this much of data, what should I modify to make it faster?

Solution

  • The Girvan–Newman algorithm is computationally very expensive. As mentioned in the the docs in NetworkX:

    The Girvan–Newman algorithm detects communities by progressively removing edges from the original graph. The algorithm removes the “most valuable” edge, traditionally the edge with the highest betweenness centrality, at each step.

    By looking at the source code, the following is called:

    while g.number_of_edges() > 0:
        yield _without_most_central_edges(g, most_valuable_edge)
    

    Which in turn calls:

    while num_new_components <= original_num_components:
        edge = most_valuable_edge(G)
        G.remove_edge(*edge)
        new_components = tuple(nx.connected_components(G))
        num_new_components = len(new_components)
    return new_components
    

    So at each step, the most valuable edge is removed, defined as the edge with the highest betweenness centrality, and the connected components are found. So roughly the complexity is of the order of the number of edges times the complexity of the connected components algorithm and the highest betweenness centrality.

    The docs mention some approaches to slice the returned generator and keep the first k tuples of communities. Here's one if you want to run the algorithm up to the kth iteration:

    from itertools import islice, takewhile
    
    G = nx.fast_gnp_random_graph(10, 0.2)
    k = 2
    comp = girvan_newman(G)
    for communities in islice(comp, k):
        print(tuple(sorted(c) for c in communities)) 
    ([0, 3, 4, 8], [1, 5], [2], [6, 7, 9])
    ([0, 3], [1, 5], [2], [4, 8], [6, 7, 9])
    

    Or one using itertools.takewhile to take tuples until the number of communities surpasses some threshold, which seems an interesting approach, since it allows you to impose the amount of clusters you want, for example:

    G = nx.fast_gnp_random_graph(10, 0.3)
    k = 4
    comp = girvan_newman(G)
    limited = takewhile(lambda c: len(c) <= k, comp)
    for communities in limited:
        print(tuple(sorted(c) for c in communities)) 
    
    ([0, 1, 2, 3, 4, 5, 6, 7, 8], [9])
    ([0, 2, 4, 7, 8], [1, 3, 5, 6], [9])
    ([0, 2, 4, 7, 8], [1], [3, 5, 6], [9])
    

    Answering to your first question, you'll see in the source code that the graph is copied to an undirected graph g = G.copy().to_undirected(), so yes, it is only intended to work for undirected graphs.