pythonparallel-processingglobpathlib

How to efficiently perform parallel file search using pathlib `glob` in Python for large directory structures?


I'm working on a Python project where I need to search for specific files across a very large directory structure. Currently, I'm using the glob (or rglob) method from the pathlib module, but it is quite slow due to the extensive number of files and directories.

Here's a simplified version of my current code:

from pathlib import Path

base_dir = Path("/path/to/large/directory")
files = list(base_dir.rglob("ind_stat.zpkl"))

This works, but it's too slow because it has to traverse through a massive number of directories and files. Ideally, I'd like to divide the directory traversal work across multiple threads or processes to improve performance. Are there optimizations or alternative libraries/methods that could help improve the performance?


Solution

  • Empirically, on my Mac, in a directory tree that has 73,429 files and 3,315 directories (and 3 files named Snap.wav):

    ~/Samples $ pwd
    /Users/akx/Samples
    ~/Samples $ find . -type f | wc -l
       73429
    ~/Samples $ find . -type d | wc -l
        3315
    

    a trivial os.walk() based implementation is 1.8x as fast as rglob:

    import os
    import pathlib
    
    base_path = pathlib.Path("/Users/akx/Samples")
    file_to_find = "Snap.wav"
    
    
    def find_snaps_rglob():
        paths = []
        for path in base_path.rglob(file_to_find):
            paths.append(path)
        return paths
    
    
    def find_snaps_walk():
        paths = []
        for dp, dn, fn in os.walk(base_path):
            for f in fn:
                if f == file_to_find:
                    paths.append(pathlib.Path(os.path.join(dp, f)))
        return paths
    
    
    assert sorted(find_snaps_rglob()) == sorted(find_snaps_walk())
    
    name='find_snaps_rglob' iters=10 time=1.622 iters_per_sec=6.17
    name='find_snaps_walk' iters=20 time=1.740 iters_per_sec=11.49
    

    Implementing the walking with a stack and os.scandir() makes things a tiny bit faster still (iters_per_sec=12.98), but this likely doesn't take e.g. symlinks and errors into account like walk or rglob might.

    def find_snaps_walk_manually():
        paths = []
        stack = [base_path]
        while stack:
            current = stack.pop()
            for child in os.scandir(current):
                if child.is_dir():
                    stack.append(child)
                elif child.name == file_to_find:
                    paths.append(pathlib.Path(child.path))
        return paths
    

    I also gave a shot at wrapping the Rust walkdir and jwalk crates with PyO3 – they weren't much faster than find_snaps_walk_manually (about 13.14/13.34 RPS).