pythonpdfnlppdfminer

Get page number of certain string using pdfminer


I would like to find the page number of certain string in a pdf document using pdfminer.six. Here you can find some reproducible pdf document. We can use the extract_pages function to find the number of pages and extract_text to extract the text. But I'm not sure how to find the page of certain string. Imagine we want to find the page number of string "File 2", which is on page 2. According to his answer, we could use the page_numbers argument from extract_pages. Here is some code I tried:

from pdfminer.high_level import extract_pages, extract_text

file = 'sample.pdf'

for i in range(len(list(extract_pages(file)))):
    extract_pages(file, page_numbers=i, maxpages=len(list(extract_pages(file))))

But now I don't understand how to get the page number of certain string, so I was wondering if anyone could explain how to get the page number of certain string in a pdf document?


Solution

  • To find the page number of a certain string, you can search for the desired string in the extracted text using "extract_pages" function. When the string is found, you can record the page number.

    Here's an example:

    from pdfminer.high_level import extract_pages, extract_text
    
    file = 'sample.pdf'
    search_string = "abc"
    
    for page_number, page in enumerate(extract_pages(file)):
        for element in page:
            page_text = element.get_text()
            if search_string in page_text:
                print(f"Found '{search_string}' on page {page_number + 1}")
    

    This code will iterate through each page, extract the text, and search for the "abc" string. When the string is found, it will print the page number where the string is located.