I'm working on a Python project where I need to search for specific files across a very large directory structure. Currently, I'm using the glob (or rglob) method from the pathlib module, but it is quite slow due to the extensive number of files and directories.
Here's a simplified version of my current code:
from pathlib import Path
base_dir = Path("/path/to/large/directory")
files = list(base_dir.rglob("ind_stat.zpkl"))
This works, but it's too slow because it has to traverse through a massive number of directories and files. Ideally, I'd like to divide the directory traversal work across multiple threads or processes to improve performance. Are there optimizations or alternative libraries/methods that could help improve the performance?
Empirically, on my Mac, in a directory tree that has 73,429 files and 3,315 directories (and 3 files named Snap.wav):
~/Samples $ pwd
/Users/akx/Samples
~/Samples $ find . -type f | wc -l
73429
~/Samples $ find . -type d | wc -l
3315
a trivial os.walk()
based implementation is 1.8x as fast as rglob
:
import os
import pathlib
base_path = pathlib.Path("/Users/akx/Samples")
file_to_find = "Snap.wav"
def find_snaps_rglob():
paths = []
for path in base_path.rglob(file_to_find):
paths.append(path)
return paths
def find_snaps_walk():
paths = []
for dp, dn, fn in os.walk(base_path):
for f in fn:
if f == file_to_find:
paths.append(pathlib.Path(os.path.join(dp, f)))
return paths
assert sorted(find_snaps_rglob()) == sorted(find_snaps_walk())
name='find_snaps_rglob' iters=10 time=1.622 iters_per_sec=6.17
name='find_snaps_walk' iters=20 time=1.740 iters_per_sec=11.49
Implementing the walking with a stack and os.scandir()
makes things a tiny bit faster still (iters_per_sec=12.98
), but this likely doesn't take e.g. symlinks and errors into account like walk
or rglob
might.
def find_snaps_walk_manually():
paths = []
stack = [base_path]
while stack:
current = stack.pop()
for child in os.scandir(current):
if child.is_dir():
stack.append(child)
elif child.name == file_to_find:
paths.append(pathlib.Path(child.path))
return paths
I also gave a shot at wrapping the Rust walkdir
and jwalk
crates with PyO3 – they weren't much faster than find_snaps_walk_manually
(about 13.14/13.34 RPS).