pythondepth-first-searchbreadth-first-search

How can I efficiently parallelize and optimize a large-scale graph traversal algorithm in Python to handle millions of nodes?


I'm working on Python project that involves processing a very large graph - it has millions of nodes and edges. The goal is to perform a breadth-first search (BFS) or depth-first search (DFS) from a given start node to compute shortest paths or reachability.

Here's the challenge :

Currently, I have a basic BFS implementation using a Python dictionary for adjacency, but it runs slowly and hits memory limits:

from collections import deque

def bfs(graph, start):
    visited = set()
    queue = deque([start])
    while queue:
        node = queue.popleft()
        if node not in visited:
            visited.add(node)
            queue.extend(graph.get(node, []))
    return visited

Question

  1. How can I parallelize BFS/DFS in Python to speed up traversal? Should I use multiprocessing, concurrent.features, or another approach?

I'm open to suggestions involving alternative algorithms (like bidirectional BFS), external databases, or memory-mapped files.

I’ve tried the following:

I expected to be able to:


Solution

  • You can use combination of "pyspark + graphframes" to achieve this.

    Sample Code

    from pyspark.sql import SparkSession
    from graphframes import GraphFrame
    
    # Create Spark session
    spark = SparkSession.builder \
        .appName("BFS") \
        .config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.0-s_2.12") \
        .getOrCreate()
    
    # Define vertices and edges as DataFrames
    vertices = spark.createDataFrame([
        ("a", "Alice"),
        ("b", "Bob"),
        ("c", "Charlie"),
        ("d", "David"),
        ("e", "Esther")
    ], ["id", "name"])
    
    edges = spark.createDataFrame([
        ("a", "b"),
        ("b", "c"),
        ("c", "d"),
        ("d", "e")
    ], ["src", "dst"])
    
    # Create GraphFrame
    g = GraphFrame(vertices, edges)
    
    # Run BFS from "a" to "e"
    results = g.bfs(fromExpr="id = 'a'", toExpr="id = 'e'")
    results.show(truncate=False)