pythonpdftext-searchpymupdf

Can a text be searched Blockwise in a PDF using PyMuPDF?


page.getTextBlocks()

Output

[(42.5, 86.45002746582031, 523.260009765625, 100.22002410888672, TEXT, 0, 0),
(65.75, 103.4000244140625, 266.780029296875, 159.59010314941406, TEXT, 1, 0),
(48.5, 86.123456, 438.292048492, 100.92920404974, TEXT, 0, 0)]

(x0, y0, x1, y1, "lines in block", block_type, block_no)

My main aim is:

to search for a text in a PDF and highlight it The text that has to be searched can exist in a page n number of times. using tp.search(text,hit_max=1) it could limit the maximum number of occurence but it won't solve the problem because it will select the first occurence of text but for me may be the second or the third occurence is important.

My Idea is:

getTextBlocks extracts the text as mentioned above, using this information specifically the block_no, i want to perform page.searchForfunction for that particular block. Logically it should be possible, but practically i need help on how to do it.

I would appreciate any inputs on acheiving the main aim.

Thanks


Solution

  • As a preface let me say that your question would benefit the issue page of my repository.

    Page.searchFor() searches for any number text items on the page. The restriction is the number of hits, which has a limit you must specify in the call. But you can use any number here (take 100 for example). This method extracts no text, ignores character casing and also supports non-horizontal text or text spread across multiple lines. Its output can be directly used to create text marker annotations and more.

    You are of course free to extract text by using variations of Page.getText(option) and then apply your finesse to find what you want in the output. option may be "text", "words", "blocks", "dict", "rawdict", "html", "xhtml", or "xml". Each output has its pros and cons obviously. Many of the variants come with text position information, or font information including text color, etc. But as said: it is up to you how you locate stuff. Let me suggest again we continue this conversation on the Github repo issue page, where I can better point to other resources. Or feel free to use my private e-mail.

    If your question means to (1) locate text occurrences, and then (2) link each occurrence to a text block number, then just make a list of block rectangles and check each occurrence whether it is contained in a block rectangle:

    for j, rect in enumerate(page.searchFor(text,...)):
        for i, bbox in enumerate(block_rectangles):
            if rect in bbox:
                print("occurrence %i is contained in block %i" % (j, i))