pythonpandaswindowsunc

In Python, how do I iterate a long path name through glob and read in all files in Windows?


I'm trying to write a Python script where I use glob to get a list of file paths, and then read them through pd.read_csv() to make one big data frame. This script works great on a Mac. However, I was trying to help a colleague run this in Jupyter Notebook on their PC, and it kept failing.

The Mac script is simple. It's just:

df = pd.DataFrame()

subfolders = ['sub1', 'sub2', 'sub3']

for sub in subfolders:
        for filePath in glob.glob('/server/folder1/folder2/'+sub+'/*/*/*/*/*/*/lastfolder/*output.txt'):
                df2 = pd.read_csv(filePath, sep='\t')
                df = pd.concat([df, df2])

They've run similar scripts in the past with no problem. However, this path length is now greater than >260 characters, which I think caused the problem.

For long paths, Windows adds a bit to the front (...for reasons?) like: \?\UNC\server\share\directory...

When I tried to read in a path with that additional "\?\UNC" part, it breaks things. Normally I tell them just to change the backslashes to forward slashes, but that doesn't work here (I'm sure there's a better way to do this, but this really isn't a big part of my job--I'm just trying to get things to run more efficiently).

I don't really know how to deal with Windows, so every time I try to help them it's an adventure. I tried to ask ChatGPT, and it led me to this solution:

# Define the UNC path with the long path prefix
long_unc_path = r'\\?\UNC\server\folder1\folder2\'


# Convert the long UNC path to a Path object
new_path = Path(long_unc_path)

df = pd.DataFrame()

subfolders = ['sub1', 'sub2', 'sub3']

for sub in subfolders:
        for filePath in glob.glob(new_path+sub+'/*/*/*/*/*/*/lastfolder/*output.txt'):
                df2 = pd.read_csv(filePath, sep='\t')
                df = pd.concat([df, df2])

But the first line for long_unc_path gave me an unterminated string literal error. I'm sure it's a simple solution, but I can't figure out what it is. We're using Python 3.11.

Could anyone help? Thank you!


Solution

  • Discard '\' at the end of the long_unc_path and use os.path.join to create file address instead of adding strings, something like this

    import os
    import re
    
    # Define the UNC path with the long path prefix
    long_unc_path = r'\\?\UNC\server\folder1\folder2'
    
    
    # Convert the long UNC path to a Path object
    new_path = Path(long_unc_path)
    
    df = pd.DataFrame()
    
    subfolders = ['sub1', 'sub2', 'sub3']
    
    for sub in subfolders:
            for filePath in glob.glob(os.path.join(new_path,sub,*re.split('/', '/*/*/*/*/*/*/lastfolder/*output.txt'))):
                    df2 = pd.read_csv(filePath, sep='\t')
                    df = pd.concat([df, df2])