pythonpdfpymupdf

How to avoid PyMuPDF (Fitz) interpreting large gaps between words as a newline character?


I am trying to extract text from this PDF using python library PyMuPDF.

The problem I am facing is that some of the sentences are being split into several sentences. For example, the sentence "Art. 1º Esta Lei dispõe sobre a atividade profissional de musicoterapeuta." is extracted like this:

"Art.

Esta

Lei

dispõe

sobre

a

atividade

profissional de musicoterapeuta. "

This happens because each word is separated by a large gap of whitespace, so PyMuPDF interprets that a new line character should be added. I tried using the flag "TEXT_INHIBIT_SPACES" to solve that, as explained in the documentation, but the extracted text was not changed. Can somebody help me with this problem, please?

When I use pypdf (a different library), I don't have this issue, but there are some functions I need that I only managed to use in PyMuPDF.

The code I am using:

def get_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        flags = fitz.TEXT_INHIBIT_SPACES
        text += page.get_text("text", flags=flags, sort=True)
    return text

Thanks!


Solution

  • Recompose text lines on the basis of coordinates of each word. This should deliver lines as expected in most cases. At least as long as the single characters have not been written in an arbitrary permutation of the "natural" reading sequence.

    import fitz  # PyMuPDF
    
    doc = fitz.open("input.pdf")
    for page in doc:
        words = page.get_text("words", sort=True)  # words sorted vertical, then horizontal
        line = [words[0]]  # list of words in same line
        for w in words[1:]:
            w0 = line[-1]  # get previous word
            if abs(w0[3] - w[3]) <= 3:  # same line (approx. same bottom coord)
                line.append(w)
            else:  # new line starts
                line.sort(key=lambda w: w[0])  # sort words in line left-to-right
                # print text of line
                text = " ".join([w[4] for w in line])
                print(text)
                line = [w]  # init line list again
        # print last line
        text = " ".join([w[4] for w in line])
        print(text)
        print(chr(12))  # print a form feed char as page break
    

    Note: I am a maintainer and the original creator of PyMuPDF.