I would like to find the page number of certain string in a pdf document using pdfminer.six
. Here you can find some reproducible pdf document. We can use the extract_pages
function to find the number of pages and extract_text
to extract the text. But I'm not sure how to find the page of certain string. Imagine we want to find the page number of string "File 2", which is on page 2. According to his answer, we could use the page_numbers
argument from extract_pages
. Here is some code I tried:
from pdfminer.high_level import extract_pages, extract_text
file = 'sample.pdf'
for i in range(len(list(extract_pages(file)))):
extract_pages(file, page_numbers=i, maxpages=len(list(extract_pages(file))))
But now I don't understand how to get the page number of certain string, so I was wondering if anyone could explain how to get the page number of certain string in a pdf document?
To find the page number of a certain string, you can search for the desired string in the extracted text using "extract_pages" function. When the string is found, you can record the page number.
Here's an example:
from pdfminer.high_level import extract_pages, extract_text
file = 'sample.pdf'
search_string = "abc"
for page_number, page in enumerate(extract_pages(file)):
for element in page:
page_text = element.get_text()
if search_string in page_text:
print(f"Found '{search_string}' on page {page_number + 1}")
This code will iterate through each page, extract the text, and search for the "abc" string. When the string is found, it will print the page number where the string is located.