pythonpython-3.xpypdfpdfminerpdf-extraction

How to check if PDF is scanned image or contains text


I have a large number of files, some of them are scanned images into PDF and some are full/partial text PDF.

Is there a way to check these files to ensure that we are only processing files which are scanned images and not those that are full/partial text PDF files?

environment: PYTHON 3.6


Solution

  • The below code will work to extract data text data from both searchable and non-searchable PDF's.

    import fitz
    
    text = ""
    path = "Your_scanned_or_partial_scanned.pdf"
    
    doc = fitz.open(path)
    for page in doc:
        text += page.get_text()()
    

    You can refer to this link for more information.

    If you don't have the fitz module you need to do this:

    pip install --upgrade pymupdf