pythonfor-loopwith-statementpdftotext

Python: For loop only iterates once - also using a with statement


I am trying to open a zip file and iterate through the PDFs in the zip file. I want to scrape a certain portion of the text in the pdf. I am using the following code:

def get_text(part):
    #Create path
    path = f'C:\\Users\\user\\Data\\Part_{part}.zip'
    
    with zipfile.ZipFile(path) as data:
        listdata = data.namelist()
        onlypdfs = [k for k in listdata if '_2018' in k or '_2019' in k or '_2020' in k or '_2021' in k or '_2022' in k]

        for file in onlypdfs:
            with data.open(file, "r") as f:
                #Get the pdf
                pdffile = pdftotext.PDF(f)
                text = ("\n\n".join(pdffile))

    
                #Remove the newline characters
                text = text.replace('\r\n', ' ')
                text = text.replace('\r', ' ')
                text = text.replace('\n', ' ')
                text = text.replace('\x0c', ' ')

                #Get the text that will talk about what I want
                try:
                    text2 = re.findall(r'FEES (.+?) Types', text, re.IGNORECASE)[-1]

                except:
                    text2 = 'PROBLEM'

                #Return the file name and the text
                return file, text2

Then in the next line I am running:

info = []
for i in range(1,2):
    info.append(get_text(i))
info

My output is only the first file and text. I have 4 PDFs in the zip folder. Ideally, I want it to iterate through the 30+ zip files. But I am having trouble with just one. I've seen this question asked before, but the solutions didn't fit my problem. Is it something with the with statement?


Solution

  • You need to process all the files and store each of them as you iterate. An example of how you could do this is to store them in a list of tuples:

    file_list = []
    for file in onlypdfs:
        ...
        file_list.append((file, text2)
    return file_list
    

    You could then use this like so:

    info = []
    for i in range(1,2):
        list = get_text(i)
        for file_text in list:
            info.append(file_text)
    print(info)