Target: I want to extract the info on the orientation of each word or sentence from a PDF like the attached one. The reason for this is that i want to keep the text only from the orientation with zero degrees, not the 90,180 or 270 degrees.
.
What I have tried: The first thing I tried is to use the parameter: detect_vertical of LAParams of PDFMiner but this does not help me.
When I am trying: "detect_vertical=True" then I am getting all the text from all of the orientations but the sentences of 180 degrees (the one that is inverted actually) has wrong order:
*Upper side, third line
Upper side, second line
This is the upper side of the box. *
When I am trying: "detect_vertical=False" then I am getting the text from the sides one by one but I am still getting the text from the 180 degrees (the one that is inverted actually) with wrong order again. The text from the sides is one by one character.
Since I only want to filter the text with orientation 0 degrees, none of the above does not help me.
The code used for this is the following:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LAParams
page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
laparams= LAParams(detect_vertical=True ) ) )
for page in page_info:
for element in page:
if isinstance(element, LTTextContainer):
print(element.get_text())
The second thing I tried is to get this info from the latest level of the PDF layout (LTChar) as described here: https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#working-with-rotated-characters
The Code I have used is the following one for this attempt but unfortunately I can only get: fontname, font size and the coordinates of the character, not the orientation:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LAParams , LTChar
page_info = list(extract_pages('pdfminer/text_with_orientation.pdf' ,
laparams= LAParams(detect_vertical=True ) ) )
for page in page_info:
for element in page:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print('======================')
print('text:',character.get_text())
print('fontname:',character.fontname[7:])
print('size:',character.size)
print('adv:',character.adv) # textwidth * fontsize * scaling
print('matrix:',character.matrix)
(_,_,x,y) = character.bbox
print('x dim:',x,'and y dim:',y)
print('\n')
What I do not want to use:
I do not want to use Tesseract as I have already tried it and the results are not as good as using PDFMiner
Any suggestions on this?
After a lot of investigation I finally found a way to do this in character level by using the matrix included in LTChar.
So in order to get all of the characters with 0 degrees i do the following:
for page in label_pages:
for element in page:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
if character.matrix[0]>0 :
print('======================')
print('text:',character.get_text())
print('matrix:',character.matrix)
(_,_,x,y) = character.bbox
print('x dim:',x,'and y dim:',y)
print('\n')