In order to get a single string from a multi-paged PDF I'm doing this:
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
output = page.extractText()
output
The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.
Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.
Any help would be appreciated.
Update:
As suggested, I defined an output
as a list and then appended to it (as I thought) all pages in a loop like this:
for i in range(count):
page = pdfReader.getPage(i)
output = []
output.append(page.extractText())
The result, thought, is a single string in the list like ['sample content from the last page of PDF']
Could it be because of this line:
output = page.extractText()
Try this instead:
output += page.extractText()
Because in your code, you're overwriting the value of the "output" variable instead of appending to it. Don't forget to declare the "output" variable before the for loop. So output = ''
before for i in range(count):