pythonpathos.walkos.path

Extracting file paths, occasionally returns special characters (~$) in file names


When extracting file paths, not all but a few results are returned that contain special characters ~$ at the start of the file name. I am looking to compare these file paths with another list, thus the special characters prevent the ability to find a proper match.

The current code:

import os

for path, sub_dirs, files in os.walk(root): 
    for name in files:
        # For each file we find, we need to ensure it is a .docx file before adding
        #  it to our list
        if os.path.splitext(os.path.join(path, name))[1] == ".docx":
            document_list.append(os.path.join(path, name))

The majority of results are satisfactory, for example:

X:/Serial Numbers/6200\Test Company\6275 Documents\6275rA_Order_TEST_120221.docx

however there are occasional results of special characters that do not exist in file name:

X:/Serial Numbers/6200\Test Company\6275 Documents\~$75rA_Order_MERZ_120221.docx

Preferably seeking a solution that does not rely on a string replace method.


Solution

  • As has been pointed out in another answer, files beginning with "~$" are probably Microsoft temporary files.

    The pathlib module (preferred over os nowadays) offers are more OO approach to interacting with your filesystem.

    In this case I would suggest using a generator in preference to the current structure.

    Something like this:

    from pathlib import Path
    from collections.abc import Iterable
    
    def genpaths(directory: Path) -> Iterable[Path]:
        ignore = ("~$", ) # add any additional filename prefixes to be ignored here
        for fullpath in directory.rglob("*.docx"):
            if not fullpath.name.startswith(ignore):
                yield fullpath
    
    root = Path("the_root_directory")
    
    for c in genpaths(root):
        print(c) # the files of interest