pythonpython-3.xpdfpdf-manipulation

PDF File Manipulation (open a large pdf file, find a keyword, then save in which page was found, and then split those pages and merge them in one pdf)


I'm working on a project for a friend of mine. I want to find one specific keyword that is on multiple pages, and it has duplicates on other places on a large PDF file (40-60 pages and above) then save in memory in what page the keyword was found, and then split those pages from the original PDF File and lastly, merge them together.

I'm thinking about using PDFMiner or PyPDF2 (i'm open to other suggestions as well)

I'm already writing the code for the most part of it, but i can't figure out a good and efficient way to search the file and find that keyword, because this keyword is located in other places in the same file, and make sure that the data i want to extract from the original file isn't duplicate and all the data was extracted.

Thanks in Advance.


Solution

  • Did you try to split pdf file on couple of blocks and search keyword on each block with multithreading? This should be faster.