I'm trying to write a Python script where I use glob to get a list of file paths, and then read them through pd.read_csv() to make one big data frame. This script works great on a Mac. However, I was trying to help a colleague run this in Jupyter Notebook on their PC, and it kept failing.
The Mac script is simple. It's just:
df = pd.DataFrame()
subfolders = ['sub1', 'sub2', 'sub3']
for sub in subfolders:
for filePath in glob.glob('/server/folder1/folder2/'+sub+'/*/*/*/*/*/*/lastfolder/*output.txt'):
df2 = pd.read_csv(filePath, sep='\t')
df = pd.concat([df, df2])
They've run similar scripts in the past with no problem. However, this path length is now greater than >260 characters, which I think caused the problem.
For long paths, Windows adds a bit to the front (...for reasons?) like: \?\UNC\server\share\directory...
When I tried to read in a path with that additional "\?\UNC" part, it breaks things. Normally I tell them just to change the backslashes to forward slashes, but that doesn't work here (I'm sure there's a better way to do this, but this really isn't a big part of my job--I'm just trying to get things to run more efficiently).
I don't really know how to deal with Windows, so every time I try to help them it's an adventure. I tried to ask ChatGPT, and it led me to this solution:
# Define the UNC path with the long path prefix
long_unc_path = r'\\?\UNC\server\folder1\folder2\'
# Convert the long UNC path to a Path object
new_path = Path(long_unc_path)
df = pd.DataFrame()
subfolders = ['sub1', 'sub2', 'sub3']
for sub in subfolders:
for filePath in glob.glob(new_path+sub+'/*/*/*/*/*/*/lastfolder/*output.txt'):
df2 = pd.read_csv(filePath, sep='\t')
df = pd.concat([df, df2])
But the first line for long_unc_path gave me an unterminated string literal error. I'm sure it's a simple solution, but I can't figure out what it is. We're using Python 3.11.
Could anyone help? Thank you!
Discard '\' at the end of the long_unc_path and use os.path.join to create file address instead of adding strings, something like this
import os
import re
# Define the UNC path with the long path prefix
long_unc_path = r'\\?\UNC\server\folder1\folder2'
# Convert the long UNC path to a Path object
new_path = Path(long_unc_path)
df = pd.DataFrame()
subfolders = ['sub1', 'sub2', 'sub3']
for sub in subfolders:
for filePath in glob.glob(os.path.join(new_path,sub,*re.split('/', '/*/*/*/*/*/*/lastfolder/*output.txt'))):
df2 = pd.read_csv(filePath, sep='\t')
df = pd.concat([df, df2])