javapdfboxpdf-parsing

Apache PDFBox - vertical match between image and text position


I need help to achieve a mapping between text and image objects in a PDF document.

As the first figure shows, my PDF documents have 3 images arranged randomly in the y-direction. To the left of them are texts. The texts extend along the height of the images.

enter image description here

My goal is to combine the texts into "ImObj" objects (see the class ImObj).

The 2nd figure shows that I want to use the height of the image to detect the position of the texts (all texts outside of the image height should be ignored). In the example, there will be 3 ImObj-objects formed by the 3 images.

enter image description here

The link to the pdf file is here (on wetransfer): [enter link description here][3]

But my mapping does not work, because I probably use the wrong coordinates from the image. Now I have already looked at some examples, but I still don't really understand how to get the coordinates of text and images working together? Here is my code:

import java.awt.Image;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.util.Matrix;

public class ImExample extends PDFTextStripper {

    public static void main(String[] args) {
        File file = new File("C://example document.pdf");

        try {
            PDDocument document = PDDocument.load(file);

            ImExample example = new ImExample();

            for (int pnr = 0; pnr < document.getPages().getCount(); pnr++) {
                PDPage page = document.getPages().get(pnr);
                PDResources res = page.getResources();
            
                example.processPage(page);
             
                int idx = 0;

                for (COSName objName : res.getXObjectNames()) {
                    PDXObject xObj = res.getXObject(objName);
                    if (xObj instanceof PDImageXObject) {

                        System.out.println("...add a new image");

                        PDImageXObject imXObj = (PDImageXObject) xObj;
                        BufferedImage image = imXObj.getImage();

                        // Here is my mistake ... but I do not know how to solve it.
                        ImObj imObj = new ImObj(image, idx++, pnr, image.getMinY(), image.getMinY() + image.getHeight());
                        example.imObjects.add(imObj);
                    }
                } 
            }

            example.setSortByPosition(true);
            example.getText(document);

            // Output
            for (ImObj iObj : example.imObjects)
                System.out.println(iObj.idx + " -> " + iObj.text);

            document.close();

        } catch (Exception e) {
            e.printStackTrace();
        } 
    }

    public List<ImObj> imObjects = new ArrayList<ImObj>();

    public ImExample() throws IOException {
        super();
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException {

        // match between imagesize and textposition
        TextPosition txtPos = textPositions.get(0);

        for (ImObj im : imObjects) {
            if(im.page == (this.getCurrentPageNo()-1))
                if (im.minY < txtPos.getY() && (txtPos.getY() + txtPos.getHeight()) < im.maxY)
                    im.text.append(text + " "); 
        }
    }
}

class ImObj {

    float minY, maxY;

    Image image = null;
    StringBuilder text = new StringBuilder("");
    int idx, page = 0;

    public ImObj(Image im, int idx, int pnr, float yMin, float yMax) {
        this.idx = idx;
        this.image = im;
        this.minY = yMin;
        this.maxY = yMax;
        this.page = pnr;
    } 
}

Best regards


Solution

  • You're looking for the images in the (somewhat) wrong place!

    You iterate over the image XObject resources of the page itself and inspect them. But this is not helpful:

    What you actually need is to parse the page content stream for uses of images and the current transformation matrix at the time of use. For a basic implementation of this have a look at the PDFBox example PrintImageLocations.


    The next problem you'll run into is that the coordinates PDFBox returns in the TextPosition methods getX and getY is not from the original coordinate system of the PDF page in question but from some coordinate system normalized for the purpose of easier handling in the text extraction code. Thus, you most likely should use the un-normalized coordinates.

    You can find information on that in this answer.