pdfmetadataexiftoolpopplerqpdf

splitting and rejoining PDFs with Poppler results in larger file? (despite stripping metadata)


If I have a multi page PDF, and split it into separate pages using the excellent poppler package (installed on macOS using brew install poppler) like this:

pdfseparate foo.pdf bar-%04d.pdf

and then rejoin the resulting bar-####.pdf files, like so:

pdfunite bar-*.pdf baz.pdf

The resulting baz.pdf appears to have the same content, but the file is much larger.

At first I assumed because there would be duplicate metadata in the result or something. But even if I strip all metadata from all files, i.e. from the input, as well as the intermediate bar-####.pdf files, as well as the resulting output file, using exiftool and qpdf like this:

# command line steps to strip metadata from (and re-linearize) example.pdf :
exiftool -all= -overwrite_original example.pdf ;
mv example.pdf temp.pdf ;
qpdf --linearize temp.pdf example.pdf

Then still the resulting baz.pdf file is much larger than the original input.

What can be the cause of this? What else can there be in a multiple-page PDF file other than it's bare contents? Assuming that poppler's pdfseparate and pdfunite leave the actual content itself untouched, and that my stripping of metadata is correct.

Or is it possible that pdfseparate and pdfunite somehow decompose and reconstruct the PDF contents in a way that is lossless but sub-optimal? (I don't know enough about the inner structure of PDF files but I can imagine there are a lot of different ways to encode the same content)

By the way if I inspect any of the involved PDF files using exiftool somefile.pdf it does indeed show no metadata at all (and Linearized: Yes).


Solution

  • PDF pages use shared resources such as fonts, images, etc. When you split the document the shared resources go into each resulted file. When you merge those files the resources might not by merged back (this depends on how the merger tool is implemented) thus resulting a much larger file.