javapdfpdfbox

Split A4 page into A7 sections


I'm given a document of A4 pages with 8 A7 sections on each page. I need to extract the data from each A7 area of each page because they're related.

Is it possible to break each A4 in 8 A7 and go through the data?

This is the PDF file I'm dealing with: https://s3.us-east-2.amazonaws.com/s3.barcodegen-website.io/programada+pdf+teste.pdf

(Regarding A4/A7 paper sizes, see ISO 216 at Wikipedia.)


Solution

  • Splitting PDF pages raises a number of secondary issues like what will you do with half a "glyph" (or half a hyperlink) Thus internal hyperlinks will usually be discarded but perhaps externals need keeping.

    We need to test for duplication of resources so a source A4 of 526 KB (539,607 bytes) may actually become slightly different as 537 KB (550,093 bytes) which sometimes is oddly smaller but here only slightly larger!

    enter image description here

    Using an image approach is not acceptable as clearly at this scale the Bar codes are likely to be destroyed.

    Image Left (Notice the bad infill), Vector Right is accurate for scanning.

    enter image description here

    Cropped duplication is not always a good solution as there can be overlapping contents per page. However in this case that can be broken by a decimation into 4 x 2 pages, Seen here in facing pairs. We may also see at that stage the offsets vary and are not perfect for such splitting. Thus the source positions either need alter or the page boundary sliding in different directions.

    enter image description here

    Corrected Result as seen in Acrobat Reader etc.
    mutool poster -x 4 -y 2 -r programada.pdf output.pdf

    enter image description here

    Nearest to desired cropping is

    cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
    cpdf -chop "4 2" tempout1.pdf -o tempout2.pdf
    mutool trim -b MediaBox -o final.pdf tempout2.pdf
    

    or

    cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
    mutool poster -x 4 -y 2 -r tempout1.pdf tempout2.pdf
    mutool trim -b MediaBox -o final.pdf tempout2.pdf
    

    These should produce similar cleaner A7 size pages.