rpdf

Split PDF file into separate files by n pages in R


I want to split a LARGE pdf file into separate files by 3 pages. But qpdf is able to write only one pdf file so I have to reopen a large pdf again and again. Is there a solution to write every page without reopening the whole file?


Solution

  • You should be able to get a result in less than 10 seconds from running an OS binary. I tested one of 1332 pages (1332splitMe.pdf = 16,536,595 bytes) using this OS shell cpdf program instruction.

    ..\cpdf -split 1332splitMe.pdf -o chunk-0%%%.pdf -chunk 3
    

    enter image description here

    In return I got 3 page per printouts as requested chunk-0001 to chunk-0444.pdf with a combined total of 44,371,068 bytes.

    The increase in file storage is expected (and cannot be avoided), as it will be due to each new files overheads will be independently required compared to when they were initially combined/merged/printed into one chunk.

    However not all split "chunk" files will have the same overheads. Thus from 16 fonts in my original, in some "split" cases there may be only a few sub-set fonts in each smaller file.

    Binaries are at https://github.com/coherentgraphics/cpdf-binaries

    There are many others such as use GhostScript rewrite (also AGPL). That could like qpdf be given instruction to cleave off 3 pages. However most other PDF command modules require looping "over and over" with a set of numbers thus defeating your "Programming Question" about how to avoid repeatedly invoking a programmatic reopen.

    The core gain is by using one PDF function with internal loop instruction rather than calling similar in your own numeric loops.

    One simpler alternative is to shell out to significant slower (by several times but still under a minute) PDFcpu binary. It has more useful splits in file naming as they include page numbers, however not with leading 0.

    pdfcpu split input.pdf . 3
    
    splitting 1332splitMe.pdf to ./...
    
    optimizing...
    writing 1332splitMe_1-3.pdf...
    writing 1332splitMe_4-6.pdf...
    writing 1332splitMe_7-9.pdf...
    writing 1332splitMe_10-12.pdf...
    ...
    writing 1332splitMe_1321-1323.pdf...
    writing 1332splitMe_1324-1326.pdf...
    writing 1332splitMe_1327-1329.pdf...
    writing 1332splitMe_1330-1332.pdf...
    

    So for 444 files will be similar 44,436,556 bytes (thus just a tad bigger).

    For a set of binary modules see https://github.com/pdfcpu/pdfcpu/releases

    The desire to convert those numeric outputs into desired names (perhaps based on content) is where you need to use other shell modules or shell commands to rename the files as per human desire.

    NOTE if each chunk of pages are already "bookmarked" in the input.pdf then it may be possible to output as new names based on outline.