[SOLVED] How to avoid PyMuPDF (Fitz) interpreting large gaps between words as a newline character?

How to avoid PyMuPDF (Fitz) interpreting large gaps between words as a newline character?

I am trying to extract text from this PDF using python library PyMuPDF.

The problem I am facing is that some of the sentences are being split into several sentences. For example, the sentence "Art. 1º Esta Lei dispõe sobre a atividade profissional de musicoterapeuta." is extracted like this:

"Art.

1º

Esta

Lei

dispõe

sobre

atividade

profissional de musicoterapeuta. "

This happens because each word is separated by a large gap of whitespace, so PyMuPDF interprets that a new line character should be added. I tried using the flag "TEXT_INHIBIT_SPACES" to solve that, as explained in the documentation, but the extracted text was not changed. Can somebody help me with this problem, please?

When I use pypdf (a different library), I don't have this issue, but there are some functions I need that I only managed to use in PyMuPDF.

The code I am using:

def get_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        flags = fitz.TEXT_INHIBIT_SPACES
        text += page.get_text("text", flags=flags, sort=True)
    return text

Thanks!

Solution

Recompose text lines on the basis of coordinates of each word. This should deliver lines as expected in most cases. At least as long as the single characters have not been written in an arbitrary permutation of the "natural" reading sequence.

import fitz  # PyMuPDF

doc = fitz.open("input.pdf")
for page in doc:
    words = page.get_text("words", sort=True)  # words sorted vertical, then horizontal
    line = [words[0]]  # list of words in same line
    for w in words[1:]:
        w0 = line[-1]  # get previous word
        if abs(w0[3] - w[3]) <= 3:  # same line (approx. same bottom coord)
            line.append(w)
        else:  # new line starts
            line.sort(key=lambda w: w[0])  # sort words in line left-to-right
            # print text of line
            text = " ".join([w[4] for w in line])
            print(text)
            line = [w]  # init line list again
    # print last line
    text = " ".join([w[4] for w in line])
    print(text)
    print(chr(12))  # print a form feed char as page break

Note: I am a maintainer and the original creator of PyMuPDF.