I am trying to open a zip file and iterate through the PDFs in the zip file. I want to scrape a certain portion of the text in the pdf. I am using the following code:
def get_text(part):
#Create path
path = f'C:\\Users\\user\\Data\\Part_{part}.zip'
with zipfile.ZipFile(path) as data:
listdata = data.namelist()
onlypdfs = [k for k in listdata if '_2018' in k or '_2019' in k or '_2020' in k or '_2021' in k or '_2022' in k]
for file in onlypdfs:
with data.open(file, "r") as f:
#Get the pdf
pdffile = pdftotext.PDF(f)
text = ("\n\n".join(pdffile))
#Remove the newline characters
text = text.replace('\r\n', ' ')
text = text.replace('\r', ' ')
text = text.replace('\n', ' ')
text = text.replace('\x0c', ' ')
#Get the text that will talk about what I want
try:
text2 = re.findall(r'FEES (.+?) Types', text, re.IGNORECASE)[-1]
except:
text2 = 'PROBLEM'
#Return the file name and the text
return file, text2
Then in the next line I am running:
info = []
for i in range(1,2):
info.append(get_text(i))
info
My output is only the first file and text. I have 4 PDFs in the zip folder. Ideally, I want it to iterate through the 30+ zip files. But I am having trouble with just one. I've seen this question asked before, but the solutions didn't fit my problem. Is it something with the with statement?
You need to process all the files and store each of them as you iterate. An example of how you could do this is to store them in a list of tuples:
file_list = []
for file in onlypdfs:
...
file_list.append((file, text2)
return file_list
You could then use this like so:
info = []
for i in range(1,2):
list = get_text(i)
for file_text in list:
info.append(file_text)
print(info)