imagepdfboxpoppler

images inverted and split when extracting images from pdf document by using PDFBox or Poppler


want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself. so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred. my pdf link: download


Solution

  • At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer.

    I based code doing this on the PdfContentStreamEditor originally from this answer like this:

    PDDocument document = PDDocument.load(...);
    
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
            ByteArrayOutputStream commonRaw = null;
            ContentStreamWriter commonWriter = null;
            int depth = 0;
    
            @Override
            public void processPage(PDPage page) throws IOException {
                commonRaw = new ByteArrayOutputStream();
                try {
                    commonWriter = new ContentStreamWriter(commonRaw);
                    startFigurePage(page);
                    super.processPage(page);
                } finally {
                    endFigurePage();
                    commonRaw.close();
                }
            }
    
            @Override
            protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
                    List<COSBase> operands) throws IOException {
                String operatorString = operator.getName();
                if (operatorString.equals("BT")) {
                    endFigurePage();
                }
                if (operatorString.equals("q")) {
                    depth++;
                }
                writeFigure(operator, operands);
                if (operatorString.equals("Q")) {
                    depth--;
                }
                if (operatorString.equals("ET")) {
                    startFigurePage(getCurrentPage());
                }
    
                super.write(contentStreamWriter, operator, operands);
            }
    
            OutputStream figureRaw = null;
            ContentStreamWriter figureWriter = null;
            PDPage figurePage = null;
            int xobjectsDrawn = 0;
            int pathsPainted = 0;
    
            void startFigurePage(PDPage currentPage) throws IOException {
                figurePage = new PDPage(currentPage.getMediaBox());
                figurePage.setResources(currentPage.getResources());
                PDStream stream = new PDStream(document);
                figurePage.setContents(stream);
                figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
                figureRaw.write(commonRaw.toByteArray());
                xobjectsDrawn = 0;
                pathsPainted = 0;
            }
    
            void endFigurePage() throws IOException {
                if (figureWriter != null) {
                    figureWriter = null;
                    figureRaw.close();
                    figureRaw = null;
                    if (xobjectsDrawn > 0 || pathsPainted > 3)
                        document.addPage(figurePage);
                    figurePage = null;
                }
            }
    
            final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
                    "B", "B*", "b", "b*");
    
            void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
                if (figureWriter != null) {
                    String operatorString = operator.getName();
                    boolean isXObjectDo = operatorString.equals("Do");
                    boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
                    if (isXObjectDo)
                        xobjectsDrawn++;
                    if (isPathPainting)
                        pathsPainted++;
                    figureWriter.writeTokens(operands);
                    figureWriter.writeToken(operator);
                    if (depth == 0) {
                        if (!isXObjectDo) {
                            if (isPathPainting)
                                operator = Operator.getOperator("n");
                            commonWriter.writeTokens(operands);
                            commonWriter.writeToken(operator);
                        }
                    }
                }
            }
        };
        editor.processPage(page);
    }
    
    document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));
    

    (IsolateFigures test testIsolateInMy)

    The first figures are extracted quite fine:

    S30 a S30 b S31 a S31 b
    enter image description here enter image description here enter image description here enter image description here

    Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content:

    S32 b 1 S32 b 2 S32 b 3 S32 b 4
    enter image description here enter image description here enter image description here enter image description here