I am trying to extract text from this PDF using python library PyMuPDF.
The problem I am facing is that some of the sentences are being split into several sentences. For example, the sentence "Art. 1º Esta Lei dispõe sobre a atividade profissional de musicoterapeuta." is extracted like this:
"Art.
1º
Esta
Lei
dispõe
sobre
a
atividade
profissional de musicoterapeuta. "
This happens because each word is separated by a large gap of whitespace, so PyMuPDF interprets that a new line character should be added. I tried using the flag "TEXT_INHIBIT_SPACES" to solve that, as explained in the documentation, but the extracted text was not changed. Can somebody help me with this problem, please?
When I use pypdf (a different library), I don't have this issue, but there are some functions I need that I only managed to use in PyMuPDF.
The code I am using:
def get_text(pdf_path):
doc = fitz.open(pdf_path)
text = ""
for page in doc:
flags = fitz.TEXT_INHIBIT_SPACES
text += page.get_text("text", flags=flags, sort=True)
return text
Thanks!
Recompose text lines on the basis of coordinates of each word. This should deliver lines as expected in most cases. At least as long as the single characters have not been written in an arbitrary permutation of the "natural" reading sequence.
import fitz # PyMuPDF
doc = fitz.open("input.pdf")
for page in doc:
words = page.get_text("words", sort=True) # words sorted vertical, then horizontal
line = [words[0]] # list of words in same line
for w in words[1:]:
w0 = line[-1] # get previous word
if abs(w0[3] - w[3]) <= 3: # same line (approx. same bottom coord)
line.append(w)
else: # new line starts
line.sort(key=lambda w: w[0]) # sort words in line left-to-right
# print text of line
text = " ".join([w[4] for w in line])
print(text)
line = [w] # init line list again
# print last line
text = " ".join([w[4] for w in line])
print(text)
print(chr(12)) # print a form feed char as page break
Note: I am a maintainer and the original creator of PyMuPDF.