pythonglobpathlib

Is the ordering of pathlib's `glob` method consistent between runs?


Will Path('.').glob('*.ext') produce consistent ordering of results (assuming the files being globbed don't change)?

It seems the glob ordering is based on the file system order (at least, for the old glob package). Will pathlib's glob order be changed by adding files to the directory (which will not be included in the glob)? Will this order be changed by the file system even if nothing is added to the specific directory (e.g., when other large file changes are made elsewhere on the system)? Over the course of several days? Or will the ordering remain consistent in all these cases?

Just to clarify, I can't simple convert to a list and sort as there are too many file paths to fit into memory simultaneously. I'm hoping to achieve the same order each time as I will be doing some ML training, and want to set aside every nth file as validation data. This training will take several days, which is why I'm interested to know if the order remains stable over long times on the file system.


Solution

  • Checking the source code for the pathlib module, by chance, the latest commit points us directly to the relevant place:

    Use os.scandir() as context manager in Path.glob().

    So under the hood Path.glob uses os.scandir to get the directory entries. The docs of this function report that the results are unordered:

    Return an iterator of os.DirEntry objects corresponding to the entries in the directory given by path. The entries are yielded in arbitrary order, and the special entries '.' and '..' are not included.

    (emphasis mine)