If I have a multi page PDF, and split it into separate pages using the excellent poppler package (installed on macOS using brew install poppler
) like this:
pdfseparate foo.pdf bar-%04d.pdf
and then rejoin the resulting bar-####.pdf files, like so:
pdfunite bar-*.pdf baz.pdf
The resulting baz.pdf appears to have the same content, but the file is much larger.
At first I assumed because there would be duplicate metadata in the result or something. But even if I strip all metadata from all files, i.e. from the input, as well as the intermediate bar-####.pdf files, as well as the resulting output file, using exiftool and qpdf like this:
# command line steps to strip metadata from (and re-linearize) example.pdf :
exiftool -all= -overwrite_original example.pdf ;
mv example.pdf temp.pdf ;
qpdf --linearize temp.pdf example.pdf
Then still the resulting baz.pdf file is much larger than the original input.
What can be the cause of this? What else can there be in a multiple-page PDF file other than it's bare contents? Assuming that poppler's pdfseparate
and pdfunite
leave the actual content itself untouched, and that my stripping of metadata is correct.
Or is it possible that pdfseparate
and pdfunite
somehow decompose and reconstruct the PDF contents in a way that is lossless but sub-optimal? (I don't know enough about the inner structure of PDF files but I can imagine there are a lot of different ways to encode the same content)
By the way if I inspect any of the involved PDF files using exiftool somefile.pdf
it does indeed show no metadata at all (and Linearized: Yes).
PDF pages use shared resources such as fonts, images, etc. When you split the document the shared resources go into each resulted file. When you merge those files the resources might not by merged back (this depends on how the merger tool is implemented) thus resulting a much larger file.