pythonpython-3.xpdfpypdfpdf-extraction

How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2


In order to get a single string from a multi-paged PDF I'm doing this:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    output = page.extractText()
output

The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.

Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.

Any help would be appreciated.

Update:

As suggested, I defined an output as a list and then appended to it (as I thought) all pages in a loop like this:

for i in range(count):
    page = pdfReader.getPage(i)
    output = []
    output.append(page.extractText())

The result, thought, is a single string in the list like ['sample content from the last page of PDF']


Solution

  • Could it be because of this line:

    output = page.extractText()
    

    Try this instead:

    output += page.extractText()
    

    Because in your code, you're overwriting the value of the "output" variable instead of appending to it. Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count):