pythonmachine-learningpdfdeep-learningpdftotext

Check if a PDF searchable has been OCR’d Or is a PDF searchable TRUE


Is there any Python way to identify if the PDF has been OCR’d (the quality of the text is bad) vs a searchable PDF (the quality of the text is perfect)?

Using metadata of pdf

import pprint 
import PyPDF2
def get_doc_info(path):
    pp  = pprint.PrettyPrinter(indent =4)
    pdf_file = PyPDF2.PdfFileReader(path, 'rb')
    doc_info = pdf_file.getDocumentInfo()
    pp.pprint(doc_info)

I find :

result = get_doc_info(PDF_SEARCHABLE_HAS_BEEN_OCRD.pdf)
{   '/Author': 'NAPS2',
    '/CreationDate': "D:20200701104101+02'00'",
    '/Creator': 'NAPS2',
    '/Keywords': '',
    '/ModDate': "D:20200701104101+02'00'",
    '/Producer': 'PDFsharp 1.50.4589 (www.pdfsharp.com)'}



result = get_doc_info(PDF_SEARCHABLE_TRUE.pdf)
{   '/CreationDate': 'D:20210802122000Z',
    '/Creator': 'Quadient CXM AG~Inspire~14.3.49.7',
    '/Producer': ''}

Can i check the type of the PDF (True PDF or OCR PDF) using Creator from metaData of the PDF?

There is another way using python ?

If there is no solution to the problem, how can i use the deep learning/Machine learning to detect the type of the pdf searchable (True or OCR) ?

This is a video to understand the difference between TRUE PDF and OCR PDF : https://www.youtube.com/watch?v=xs8KQbxsMcw


Solution

  • Not long ago i ran into the same problem!

    I developed (based on some SO post i cannot recall) this function:

    def get_scanned_pages_percentage(filepath: str) -> float:
    """
        INPUT: path to a pdf file
        OUTPUT: % of pages OCR'd which include text
    """
    total_pages = 0
    total_scanned_pages = 0
    with fitz.open(filepath) as doc:
        for page in doc:
            text = page.getText().strip()
            if len(text) == 0:
                # Ignore "empty" pages
                continue
            total_pages += 1
            pix1 = page.getPixmap(alpha=False)  # render page to an image
            remove_all_text(doc, page)
            pix2 = page.getPixmap(alpha=False)
            img1 = pix1.getImageData("png")
            img2 = pix2.getImageData("png")
            if img1 == img2:
                # print(f"{page.number} was scanned or has no text")
                if len(text) > 0:
                    # print(f"\tHas text of length {len(text):,} characters")
                    total_scanned_pages += 1
            else:
                pass
    if total_pages == 0:
        return 0
    return (total_scanned_pages / total_pages) * 100
    

    This function will give a 100 (or close to it) is the pdf is an image containing an OCR'd text, and a 0 if its a native digital pdf.

    remove all text:

    def remove_all_text(doc, page):
        """Removes all text from a doc pdf page (metadata)"""
        page.cleanContents()  # syntax cleaning of page appearance commands
    
        # xref of the cleaned command source (bytes object)
        xref = page.getContents()[0]
    
        cont = doc.xrefStream(xref)  # read it
        # The metadata is stored, it extracts it as bytes. Then searches fot the tags refering to text and deletes it.
        ba_cont = bytearray(cont)  # a modifyable version
        pos = 0
        changed = False  # switch indicates changes
        while pos < len(cont) - 1:
            pos = ba_cont.find(b"BT\n", pos)  # begin text object
            if pos < 0:
                break  # not (more) found
            pos2 = ba_cont.find(b"ET\n", pos)  # end text object
            if pos2 <= pos:
                break  # major error in PDF page definition!
            ba_cont[pos: pos2 + 2] = b""  # remove text object
            changed = True
        if changed:  # we have indeed removed some text
            doc.updateStream(xref, ba_cont)  # write back command stream w/o text