I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.
Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?
environment: PYTHON 3.6
The below code will work to extract data text data from both searchable and non-searchable PDF's.
import fitz
text = ""
path = "Your_scanned_or_partial_scanned.pdf"
doc = fitz.open(path)
for page in doc:
text += page.get_text()()
You can refer to this link for more information.
If you don't have the fitz
module you need to do this:
pip install --upgrade pymupdf