I am trying to loop over some compressed files (extension '.gz') and I am running into a problem. I want to perform a specific action when the FIRST file ending in 'aa' is encountered - it can be a random one, it doesn't necessarily have to be the first one on the list. Only then, Python has to search if there are OTHER "aa" files in the folder, if so the 2nd rule has to be applied. (There may be from 1 to many "aa" files). Finally, the 3rd rule has to be applied to all other files not ending with "aa".
However, when I run the code below, not all the files get processed.
What am I doing wrong?
Thanks!
inputPath = "write your path"
fileExt = r".gz"
flag = False
for item in os.listdir(inputPath): # loop through items in dir
if item.endswith(fileExt): # check for ".gz" extension
full_path = os.path.join(inputPath, item) # get full path of files
if item.endswith('aa' + fileExt) and flag == False:
df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
# do something
flag = True
print('1 rule:', "The item processed is ", item)
elif item.endswith('aa' + fileExt) and flag == True:
df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
# do something else
print('2 rule:', "The item processed is ", item)
elif not (item.endswith('aa' + fileExt)) and flag == True:
df = pd.read_csv(full_path, compression='gzip', header=0, sep='|', encoding="ISO-8859-1") #from gzip to pandas df
# do something else
print('3 rule:', "The item processed is ", item)
I believe this is due to the fact that Python iterates over the list of files sorted in alphabetical order, then it the other files are ignored. How can I fix this issue?
LIST OF FILES:
File_202112311aa.gz
File_20211231ab.gz
File_20211231.gz
File_20211231aa.gz
OUTPUT
1 rule The item processed is File_202112311aa.gz
3 rule The item processed is File_20211231ab.gz
2 rule The item processed is File_20211231aa.gz
Largely untested, but something along the following lines should work.
This code first processes a file ending in 'aa.gz' (note: not all files ending in 'aa.gz' are processed first, as this is not stated in the question), then processes the remaining files. There is no particular ordering for the remaining files: this will depend on how Python has been built on the system, and what the (file)system does by default, and is simply not guaranteed.
# Obtain an unordered list of compressed files
filenames = glob.glob("*.gz")
# Now find a filename ending with 'aa.gz'
for i, filename in enumerate(filenames):
if filename.endswith('aa.gz'):
firstfile = filenames.pop(i)
# We immediately break out of the loop,
# so we're safe to have altered `filenames`
break
else:
# the sometimes useful and sometimes confusing else part
# of a for-loop: what happens if `break` was not called:
raise ValueError("no file ending in 'aa.gz' found!")
# Ignoring the `full_path` part
df = pd.read_csv(firstfile, compression='gzip', header=0, sep='|', encoding="ISO-8859-1")
# do something
print(f"1 rule: The file processed is {firstfile}")
# Process the remaining files
for filename in filenames:
df = pd.read_csv(filename, compression='gzip', header=0, sep='|', encoding="ISO-8859-1")
if filename.endswith('aa.gz'):
# do something
print(f"2 rule: The file processed is {filename}")
else:
# do something else
print(f"3 rule: The file processed is {filename}")