Hello I am trying to extract text from pdf. I am using PyPDF2. It works right but there are messages says Superfluous whitespace found in object header b'60' b'0'
in the terminal.
My Code Below:
pdfFile = open(filePath, 'rb')
pdfReader = ppdf.PdfFileReader(pdfFile)
for pageIndex in range(pdfReader.numPages):
page = pdfReader.pages[pageIndex]
words = page.extract_text(0).split()
for word in words:
main_text.append(word)
# print(main_text)
I were using print(main_text)
and I thought that was the source of problem. But I removed it and I still get this annoying message in the terminal. Is there a way to prevent it?
Superfluous whitespace found in object header b'36' b'0'
Superfluous whitespace found in object header b'47' b'0'
Superfluous whitespace found in object header b'50' b'0'
Superfluous whitespace found in object header b'53' b'0'
Superfluous whitespace found in object header b'56' b'0'
This is a log message which informs you that the PDF file you're processing is not following the PDF standard. It's only a log message and not an exception as PyPDF2 is certain enough that it can still deal with that message. I think it has the log level "Warning".
If you don't want to see it, just set the loggers level to something higher, e.g. ERROR:
import logging
logger = logging.getLogger("PyPDF2")
logger.setLevel(logging.ERROR)
PyPDF2 does not have CRITICAL messages, so you can disable all log messages by setting that level.
See https://pypdf2.readthedocs.io/en/3.x/user/suppress-warnings.html#log-messages
The project has moved from PyPDF2 to pypdf. The setting is the same:
https://pypdf.readthedocs.io/en/stable/user/suppress-warnings.html#log-messages