pythonpdfpypdfstrikethroughpymupdf

How to identify strike-out text from PDF files using Python


I would like to extract only the strike-out text from a .pdf file. I have tried the below code, it is working with a sample pdf file I have. But it is not working with another pdf file which I think is a scanned one. Is there any standard way to extract only strike-out text from a pdf file using python? Any help would be really appreciated.

This is the code I was using:

from pydoc import doc
from pdf2docx import parse
from typing import Tuple
from docx import Document

def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
    """Converts pdf to docx"""
    if pages:
        pages = [int(i) for i in list(pages) if i.isnumeric()]
    result = parse(pdf_file=input_file,
                   docx_with_path=output_file, pages=pages)
    summary = {
        "File": input_file, "Pages": str(pages), "Output File": output_file
    }

if __name__ == "__main__":
    pdf_file = 'D:/AWS practice/sample_striken_out.pdf'
    doc_file = 'D:/AWS practice/sample_striken_out.docx'
    convert_pdf2docx(pdf_file, doc_file)
    document = Document(doc_file)
    with open('D:/AWS practice/sample_striken_out.txt', 'w') as f:
        for p in document.paragraphs:
            for run in p.runs:
                if not run.font.strike:
                    f.write(run.text)
                    print(run.text)
            f.write('\n')

Note: I am converting PDF to DOCX first and then trying to identify the strike-out text. This code is working with a sample file. But it is not working with the scanned pdf file. The pdf to doc conversion is taking place, but the strike-through detection does not.


Solution

  • Q.

    another pdf file which I think is a scanned one. Is there any standard way to extract only strike-out text from a pdf file using python?

    A.

    You can use any language including Python but since like many reversal tasks related to decompiling a very complex but dumb compiled page language file it is not one task but many often based on single characters. For one of the better solutions in PDF extraction see Detect Bold, Italic and Strike Through text using PDFBox with VB.NET also Amazon Textract to identify strike through text from pdf file

    In general each conversion source and target format have very different ways of describing a line placed through text. Lets look at a few of the many. So strikeout in PDF is not tied to the text, it can come in many forms depending on the print writer. here is just one added after the plain text.

    23 0 obj
    <<
      /Type /Annot
      /Subtype /StrikeOut
      /C [ 1 0 0 ]
      /P 3 0 R
      /F 4
      /M (D:20220614085648Z)
      /T (K)
      /Rect [ 26.577025 361.84715 70.29766 393.2207 ]
      /AP <<
        /N 24 0 R
      >>
      /QuadPoints [ 28.32 391.47773 68.55469 391.47773 28.32 363.59013
          68.55469 363.59013 ]
      /Contents (AEI)
    >>
    endobj
    
    24 0 obj
    <<
      /Type /XObject
      /Subtype /Form
      /BBox [ 26.577025 361.84715 70.29766 393.2207 ]
      /Matrix [ 1 0 0 1 0 0 ]
      /Length 62
    >>
    stream
    1 0 0 RG
    1.7429752 w
    28.32 375.54197 m
    68.55469 375.54197 l
    S
    
    endstream
    endobj
    

    so although in this case it confirms the line is over Contents (AEI) that is not usually the case since it is just a line independent of the text. The only tie in is the location defined as a rect somewhere on the page. So the above PDF is the red line on the Left in this screenshot however the Black Red Blu Green lines are different strike through lines produced from a source txt file, which are tied by colour in addition to position (Note the text is spaced different to the lines yet they seem to be one continuous line).

    enter image description here

    In the docX Common text such as that underlined IOX is grouped differently "in-line".

    <w:r>
    <w:rPr>
    <w:rFonts w:ascii="Verdana" w:hAnsi="Verdana" w:cs="Verdana" w:eastAsia="Verdana"/>
    <w:strike w:val="true"/>
    <w:color w:val="auto"/>
    <w:spacing w:val="0"/>
    <w:position w:val="0"/>
    <w:sz w:val="50"/>
    <w:u w:val="single"/>
    <w:shd w:fill="auto" w:val="clear"/>
    </w:rPr>
    <w:t xml:space="preserve">I0X</w:t>
    </w:r>
    

    Thus the monochrome text is grouped first by the line floating under, then grouped as stricken.

    For this and many reasons it is not easy for a program to detect how to handle such cases, every library will do it different based on differing inputs. However the one thing they will generally agree is there is not much chance for a basic PDF converter to turn pixels in a line of pixels into OCR strike through.