pdfpdfminerpdftotextpymupdfpdfium

ways to separate passages in pdf using gap?


I have some pdf's with 2-3 passages for every page. every passage is separated by some line gap, but while reading with pymupdf, I cannot see any machine printable separator between passages. is there any other way, other library can do this?

code:

import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf',)
single_doc = doc.load_page(0)  # put here the page number
text=single_doc.get_text('text')
text

page screen shot: enter image description here

pdf Full pdf


Solution

  • There is no gap as such, just for the moment as its much easier, lets look closer in your linked viewer rendering :-

    enter image description here

       
    So lets replicate what is inside the real PDF (that has no web side html <p> markers) :-

    support, product design, HR Management, knowledge process outsourcing for
    pharmaceutical companies and large complex projects.
    Software exports make up 20 % of India's total export revenue in 2003-04, up from 4.9 %
    in 1997.This figure is expected to go up to 44% of annual exports by 2010. Though India
    

    See there is "no gap" just left aligned non justified (ragged) text that needs a style such as a font name and stretched out locations added to hold in a page de-void of line feeds nor true carriage returns. (occasionally there are some backspace or vertical/horizontal moves but generally meaningless in line printer text). Even "Tabs" "Indents" and some spatial characters are normally discarded in a PDF printout.

    If you want gaps or line-wrap you need to add them.

    A good alternative is export the -layout using poppler or xpdf here to - (console) or pipe it or replace that with a path/name.txt, many other options available like -nopgbrk

    xpdf-tools-win-4.04\bin32>pdftotext -f 1 -l 1 -layout IT_past.pdf -

    enter image description here