rpdfsplittabulizer

Split PDF according to pages in R


I have a pdf file with multiple pages, but I am interested in only a subgroup of them. For example, my original PDF has 30 pages and I want only the pages 10 to 16.

I tried using the function split_pdf from tabulizer package, that only splits the pdf page to page (resulting in 200 files, one for each page), followed by merge_pdfs(which merge pdf files). It worked properly, but is taking ages (and I have around 2000 pdf files I have to split).

This is the code I am using:

split = split_pdf('file_path')

start = 10
end = 16

merge_pdfs(split[start:end], 'saving_path')

I couldn't find any better option to do this. Any help would appreciated.


Solution

  • Unfortunatly, I find it a bit unclear what kind of data is in your PDF and what you are trying to extract from it. So I outline two approaches.

    1. If you have tables in the pdf, you should be able to extract the data from said pages using using:

      tab <- tabulizer::extract_tables(file = "path/file.pdf", pages = 10:16)

    2. If you only want the text, you should use pdftools which is a lot faster:

      text <- pdftools::pdf_text("path/file.pdf")[10:16]