pythontext-parsingtext-extractionpdfminerpdf-scraping

Python PdfMiner - How to get the info on the orientation of each word/sentence included in a pdf?


Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero degrees, not the 90,180 or 270 degrees.

.

What I have tried: The first thing I tried is to use the parameter: detect_vertical of LAParams of PDFMiner but this does not help me.

When I am trying: "detect_vertical=True" then I am getting all the text from all of the orientations but the sentences of 180 degrees (the one that is inverted actually) has wrong order:

*Upper side, third line
Upper side, second line
This is the upper side of the box. *

When I am trying: "detect_vertical=False" then I am getting the text from the sides one by one but I am still getting the text from the 180 degrees (the one that is inverted actually) with wrong order again. The text from the sides is one by one character.

Since I only want to filter the text with orientation 0 degrees, none of the above does not help me.

The code used for this is the following:

from pdfminer.high_level import extract_pages 
from pdfminer.layout import LTTextContainer, LAParams

page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
                               laparams= LAParams(detect_vertical=True ) ) ) 
 
for page in page_info:
    for element in page:
        if isinstance(element, LTTextContainer): 
            print(element.get_text()) 

The second thing I tried is to get this info from the latest level of the PDF layout (LTChar) as described here: https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#working-with-rotated-characters

The Code I have used is the following one for this attempt but unfortunately I can only get: fontname, font size and the coordinates of the character, not the orientation:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LAParams , LTChar
 
page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
                               laparams= LAParams(detect_vertical=True ) ) ) 
for page in page_info:
    for element in page:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print('======================')
                        print('text:',character.get_text()) 
                        print('fontname:',character.fontname[7:])
                        print('size:',character.size)   
                        print('adv:',character.adv)   # textwidth * fontsize * scaling  
                        print('matrix:',character.matrix)  
                        (_,_,x,y) = character.bbox 
                        print('x dim:',x,'and y dim:',y) 
                        print('\n') 

What I do not want to use:

I do not want to use Tesseract as I have already tried it and the results are not as good as using PDFMiner

Any suggestions on this?


Solution

  • After a lot of investigation I finally found a way to do this in character level by using the matrix included in LTChar.

    So in order to get all of the characters with 0 degrees i do the following:

    for page in label_pages:
        for element in page:
            if isinstance(element, LTTextContainer):
                for text_line in element:
                    for character in text_line:
                        if isinstance(character, LTChar):
                            if character.matrix[0]>0 :
                                print('======================')
                                print('text:',character.get_text())    
                                print('matrix:',character.matrix)     
                                (_,_,x,y) = character.bbox 
                                print('x dim:',x,'and y dim:',y) 
                                print('\n')