pdf latex pdf-generation ghostscript pdfrw

Crop PDF Content

I have a pdf that I would like to impose. It has 8.5x11" pages, media box, and crop box. I want the pdf to have 17x11" pages, by merging adjacent pages. Unfortunately, most pages have content either completely outside or straddling the crop box. Because each page can only have a single stream and crop box, when imposed, the overlapping content becomes visible. This is bad.

I don't want to rasterize my pdf because that would fix the DPI ahead-of-time. So I won't consider exporting pages as images, appending the images (imagemagick), then embedding these paired images into a new pdf.

I've also had problems imposing in postscript - issues with transparency, font rasterization, and other visual glitches during the pdf->ps->pdf conversions.

The answer should be scriptable.

So far I've tried:

podofo imposition scripts (lua)
PyPDF2 (python)
ghostscript
latex

The question "Ghostscript removes content outside the crop box?" suggests that ghostscript's pdfwrite module, when generating an output pdf file, will rasterize and crop content according to the crop box. So I'd only have to pipe my pdf through ghostscript's pdfwrite module. Unfortunately, this doesn't work.

I was about to give up when I tried printing the pdf to another pdf through evince. It works perfectly - text & vector elements within the crop box are not rasterized, and elements outside the crop box are removed (I haven't tested straddling elements yet). The quality is high - resolution (page size) and appearance are identical. In fact, everything seems to be the same except for the metadata.

So:

the question is possible
the answer already exists

How can I access it?

I think this functionality might be provided by cup's pdftopdf binary. I don't have any problems calling an external binary.... but can't figure out how to use pdftopdf.

Edit: Link to test pdf. It contains raster, vector, and text items - some partially occluded by partially transparent items - that span as well as abut adjacent pages. Once again, printing this PDF through cups appears to crop all content outside the crop box. However, opening the filtered pdf in inkscape shows that the off-page items are individually masked, not cropped - except text, which is trimmed.

Solution

The trick is to use Form XObjects to impose multiple pages within a single page. Form XObjects can reference entire PDF pages, and maintain independent clips. PyPDF2 doesn't support Form XObjects, so merging unifies the stream of all input pages such that they share the clip/media box of the output page. I've been successful in using both pdflatex and pdfrw (python) - test programs are inlined below. Since Form XObjects are derived from a similar postscript level 2 feature, as suggested by KenS it should be possible to achieve the same goal in ghostscript using "page clips". In fact he shared a ghostscript 2x1 imposition script in another answer, but it appears horrendously complicated. Combined with the font rasterization issues of poppler's pdftops (even with compatibility level > 1.4), I've abandoned the ghostscript approach.

Latex script derived from How to stitch two PDF pages together as one big page?. Requires pdflatex:

\documentclass{article}
\usepackage{pdfpages}
\usepackage[paperwidth=8.5in, paperheight=11in]{geometry}
\usepackage[multidot]{grffile}
\pagestyle{plain}

\begin{document}
    \setlength\voffset{+0.0in}
    \setlength\hoffset{+0.0in}

    \includepdf[ noautoscale=true
               , frame=false
               , pages={1}
               ]
               {<file.pdf>}

    \eject \paperwidth=17in \pdfpagewidth=17in \paperheight=11in \pdfpageheight=11in 

    \includepdf[ nup=2x1
               , noautoscale=true
               , frame=false
               , pages={2-,}
               ]
               {<file.pdf>}
\end{document}

pdfrw (python script) derived from pdfrw:examples:booklet. Requires pdfrw >= 0.2:

#!/usr/bin/env python3

# Copyright:
#   Yclept Nemo
#       2016
# License:
#   GPLv3

import itertools
import argparse
import pdfrw

# from itertool recipes in the python documentation
def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

def pagemerge(page, *pages):
    merged = pdfrw.PageMerge() + page
    for page in reversed(list(itertools.takewhile(lambda i: i is not None, reversed(pages)))):
        merged = merged + page
        merged[-1].x = merged[-2].x + merged[-2].w
    return merged.render()

parser = argparse.ArgumentParser(description='Impose PDF files using Form XOBjects')

parser.add_argument\
    ( "source"
    , help="PDF, source path"
    , type=pdfrw.PdfReader
    )
parser.add_argument\
    ( "-s", "--spacer"
    , help="PDF, spacer path"
    , type=lambda fp: next(iter(pdfrw.PdfReader(fp).pages), None)
    )
parser.add_argument\
    ( "target"
    , help="PDF, target path"
    )

args = parser.parse_args()

pages = args.source.pages[:1]

for pair in grouper(args.source.pages[1:], 2):
    assert pair[0] is not None
    pages.append(pagemerge(pair[0], args.spacer, pair[1]))

# include metadata in target
target = pdfrw.PdfWriter()
target.addpages(pages)
target.trailer.Info = args.source.Info
target.write(args.target)

Some idiosyncrasies as of pdfrw 0.2:

Note that the operations +=, append and extend are not defined for pdfrw.PageMerge, even though it behaves like a list. Furthermore + acts like += in that it modifies the left-hand-side object.