I have a directory of some circa 900 word, excel, PDF files and my end goal is I want to scan the directory for PDF documents only, move them to a single file, datestamp them, and then search them certain Company names, returning the file name/date stamp where text was found.
My first steps in coding this, was to first organise my files by stripping out what I don't need/copying PDF files over and at the same time, renaming each PDF file to include the creation date in each file name. However I am struggling to get these first basics working. Here is my code so far, on a test directory of a handful of files - so far I have set it to print each folder, subfolder and filename to check the walk through is working, and this works:
import os
import datetime
os.chdir(r'H:\PyTest')
def modification_date(filename):
t = os.path.getctime(filename)
return datetime.datetime.fromtimestamp(t).year, datetime.datetime.fromtimestamp(t).month
#Test function works
modification_date(r'H:\PyTest\2010\Oct\Meeting Minutes.docx')
#output: (2020, 10)
#for loop walks through the main folder, each subfolder and each file and prints the name of each pdf file found
for folderName, subfolders, filenames in os.walk('H:\PyTest'):
print ('the current folder is ' + folderName)
for subfolder in subfolders:
print('SUBFOLDER OF ' + folderName + ':' + subfolder)
for filename in filenames:
if filename.endswith('pdf'):
print(filename)
#print(modification_date(filename))
Without the bit on the end that I've commented out, print(modification_date(filename)
, this seems to work in printing out the directories and names of any pdfs.
the current folder is H:\PyTest
SUBFOLDER OF H:\PyTest:2010
SUBFOLDER OF H:\PyTest:2011
SUBFOLDER OF H:\PyTest:2012
the current folder is H:\PyTest\2010
SUBFOLDER OF H:\PyTest\2010:Dec
SUBFOLDER OF H:\PyTest\2010:Oct
the current folder is H:\PyTest\2010\Dec
HF Cheat Sheet.pdf
the current folder is H:\PyTest\2010\Oct
the current folder is H:\PyTest\2011
SUBFOLDER OF H:\PyTest\2011:Dec
SUBFOLDER OF H:\PyTest\2011:Oct
the current folder is H:\PyTest\2011\Dec
HF Cheat Sheet.pdf
the current folder is H:\PyTest\2011\Oct
the current folder is H:\PyTest\2012
SUBFOLDER OF H:\PyTest\2012:Dec
SUBFOLDER OF H:\PyTest\2012:Oct
the current folder is H:\PyTest\2012\Dec
HF Cheat Sheet.pdf
the current folder is H:\PyTest\2012\Oct
However with the print(modification_date(filename) included in my code, I am getting FileNotFound error. So it seems the function doesn't know the directory path and that's why its falling over.
FileNotFoundError: [WinError 2] The system cannot find the file specified: 'HF Cheat Sheet.pdf'
Can anyone suggests edits how to get the date stamps and then change each pdf name to include it on either beginning or end? I'm looking for the date the file was last saved.
Many thanks
You must construct the full path of the file using the var folderName
. It will be like this:
for folderName, subfolders, filenames in os.walk('H:\PyTest'):
print ('the current folder is ' + folderName)
for subfolder in subfolders:
print('SUBFOLDER OF ' + folderName + ':' + subfolder)
for filename in filenames:
if filename.endswith('pdf'):
print(filename)
print(modification_date(os.path.join(folderName,filename)))
In folderName
(that usually this var is called root
) what is stored is the path from: the path that you put in os.walk()
to: the current folder in the iteration. To get the Complete path of the file you must join this with the file name.