I'm working on Python project that involves processing a very large graph - it has millions of nodes and edges. The goal is to perform a breadth-first search (BFS) or depth-first search (DFS) from a given start node to compute shortest paths or reachability.
Here's the challenge :
Currently, I have a basic BFS implementation using a Python dictionary for adjacency, but it runs slowly and hits memory limits:
from collections import deque
def bfs(graph, start):
visited = set()
queue = deque([start])
while queue:
node = queue.popleft()
if node not in visited:
visited.add(node)
queue.extend(graph.get(node, []))
return visited
Question
multiprocessing
, concurrent.features
, or another approach?I'm open to suggestions involving alternative algorithms (like bidirectional BFS), external databases, or memory-mapped files.
I’ve tried the following:
A basic BFS using Python’s built-in set
, deque
, and dict
, as shown in the code.
Storing the entire adjacency list in memory using a dictionary of lists, but this doesn’t scale well for millions of nodes.
I attempted to use the multiprocessing
module to run multiple BFS instances in parallel from different starting points, but coordinating shared state and combining results without race conditions became complex.
Looked into NetworkX
, but found it quite slow and memory-heavy for very large graphs.
I also tried reading adjacency data from a file line-by-line and processing it lazily, but traversal logic became messy and error-prone.
I expected to be able to:
Traverse the graph quickly (within a few seconds) even with millions of nodes.
Use multicore processing to speed up traversal.
Efficiently manage memory without crashing due to RAM limitations.
Ideally, stream data or keep only parts of the graph in memory at any given time.
I know Python isn’t the fastest language for this, but I’d like to push it as far as reasonably possible before considering rewriting in C++ or Rust.
You can use combination of "pyspark + graphframes" to achieve this.
Sample Code
from pyspark.sql import SparkSession
from graphframes import GraphFrame
# Create Spark session
spark = SparkSession.builder \
.appName("BFS") \
.config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.0-s_2.12") \
.getOrCreate()
# Define vertices and edges as DataFrames
vertices = spark.createDataFrame([
("a", "Alice"),
("b", "Bob"),
("c", "Charlie"),
("d", "David"),
("e", "Esther")
], ["id", "name"])
edges = spark.createDataFrame([
("a", "b"),
("b", "c"),
("c", "d"),
("d", "e")
], ["src", "dst"])
# Create GraphFrame
g = GraphFrame(vertices, edges)
# Run BFS from "a" to "e"
results = g.bfs(fromExpr="id = 'a'", toExpr="id = 'e'")
results.show(truncate=False)