I am working on an invoice scraper for work, where I have successfully written all the code to scrape the fields that I need using PyPDF2. However, I am having trouble figuring out how to put this code into a for loop so I can iterate through all the invoices stored in my directory. There could be anywhere from 1 to 250+ files depending on which project I am using this for.
I thought I would be able to use "*.pdf" in place of the pdf name, but it does not work for me. I am relatively new to Python and have not used that many loops before, so any guidance would be appreciated!
import re
pdfFileObj = open(r'C:\Users\notylerhere\Desktop\Test Invoices\SampleInvoice.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
#Print all text on page
#print(pageObj.extractText())
#Grab Account Number Meter Number
accountNumber = re.compile(r'\d\d\d\d\d-\d\d\d\d\d')
meterNumber = re.compile(r'(\d\d\d\d\d\d\d\d)')
moAccountNumber = accountNumber.search(pageObj.extractText())
moMeterNumber = meterNumber.search(pageObj.extractText())
print('Account Number: '+moAccountNumber.group())
print('Meter Number: '+moMeterNumber.group(1))'''
Thanks very much!
Another option is glob:
import glob
files = glob.glob("c:/mydirectory/*.pdf")
for file in files:
(Do your processing of file here)
You need to ensure everything past the colon is properly indented.