java pdf accessibility pdfbox tagged-pdf

Get tag's related BBox's even though there is no attributes (/A in document catalog structure) related to Layout in PDFBox?

I want to highlight the bbox's of a particular tag when they selected the tag in structure root. For that reason I am able to get the bbox's when the tag contains Attributes like this

But I found in some pdf's even though there is no attributes like (/A) , Adobe dc can able to highlight the content(bbox's) when you select the particular tag. How I can get bbox's in this case? The code what I tried to get attributes related bbox's is

String inputPdfFile = "D:/Documents/pdfs/res.pdf";
PDDocument old_document = PDDocument.load(new File(inputPdfFile));
PDStructureTreeRoot treeRoot = old_document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
    for (Object kid2 :((PDStructureElement)kid).getKids()){
        PDStructureElement kid2c = (PDStructureElement)kid2;
        for (Object kid3 : kid2c.getKids()){
            if (kid3 instanceof PDStructureElement){
                PDStructureElement kid3c = (PDStructureElement)kid3;
                System.out.println(kid3c.getAttributes());
            }
        }
    }
}

The pdf link is https://drive.google.com/file/d/1_-tuWuReaTvrDsqQwldTnPYrMHSpXIWp/view?usp=sharing

Please help me any one......

Solution

To determine the actual bounding boxes (in contrast to those given in some Structure Element Layout Attributes), of the text of some marked content, you can use the PDFBox PDFMarkedContentExtractor and combine its results with the PDF Structure Tree data.

The following code does so and creates an output PDF in which the determined bounding boxes are enclosed in colored rectangles:

PDDocument document = PDDocument.load(SOURCE);

Map<PDPage, Map<Integer, PDMarkedContent>> markedContents = new HashMap<>();

for (PDPage page : document.getPages()) {
    PDFMarkedContentExtractor extractor = new PDFMarkedContentExtractor();
    extractor.processPage(page);

    Map<Integer, PDMarkedContent> theseMarkedContents = new HashMap<>();
    markedContents.put(page, theseMarkedContents);
    for (PDMarkedContent markedContent : extractor.getMarkedContents()) {
        addToMap(theseMarkedContents, markedContent);
    }
}

PDStructureNode root = document.getDocumentCatalog().getStructureTreeRoot();
Map<PDPage, PDPageContentStream> visualizations = new HashMap<>();
showStructure(document, root, markedContents, visualizations);
for (PDPageContentStream canvas : visualizations.values())
    canvas.close();

document.save(RESULT);

(from the VisualizeMarkedContent method visualize)

It uses the following helper method for recursively mapping the PDMarkedContent objects by their MCID:

void addToMap(Map<Integer, PDMarkedContent> theseMarkedContents, PDMarkedContent markedContent) {
    theseMarkedContents.put(markedContent.getMCID(), markedContent);
    for (Object object : markedContent.getContents()) {
        if (object instanceof PDMarkedContent) {
            addToMap(theseMarkedContents, (PDMarkedContent)object);
        }
    }
}

(VisualizeMarkedContent helper method)

The method showStructure recursively determines the bounding box of structure elements and draws a rectangle for each element respectively. Actually a structure element can contain content across pages, so we have to work with a mapping of pages to bounding boxes in its boxes variable...

Map<PDPage, Rectangle2D> showStructure(PDDocument document, PDStructureNode node, Map<PDPage, Map<Integer, PDMarkedContent>> markedContents, Map<PDPage, PDPageContentStream> visualizations) throws IOException {
    Map<PDPage, Rectangle2D> boxes = null;
    PDPage page = null;
    if (node instanceof PDStructureElement) {
        PDStructureElement element = (PDStructureElement) node;
        page = element.getPage();
    }
    Map<Integer, PDMarkedContent> theseMarkedContents = markedContents.get(page);
    for (Object object : node.getKids()) {
        if (object instanceof COSArray) {
            for (COSBase base : (COSArray) object) {
                if (base instanceof COSDictionary) {
                    boxes = union(boxes, showStructure(document, PDStructureNode.create((COSDictionary) base), markedContents, visualizations));
                } else if (base instanceof COSNumber) {
                    boxes = union(boxes, page, showContent(((COSNumber)base).intValue(), theseMarkedContents));
                } else {
                    System.out.printf("?%s\n", base);
                }
            }
        } else if (object instanceof PDStructureNode) {
            boxes = union(boxes, showStructure(document, (PDStructureNode) object, markedContents, visualizations));
        } else if (object instanceof Integer) {
            boxes = union(boxes, page, showContent((Integer)object, theseMarkedContents));
        } else {
            System.out.printf("?%s\n", object);
        }

    }
    if (boxes != null) {
        Color color = new Color((int)(Math.random() * 256), (int)(Math.random() * 256), (int)(Math.random() * 256));

        for (Map.Entry<PDPage, Rectangle2D> entry : boxes.entrySet()) {
            page = entry.getKey();
            Rectangle2D box = entry.getValue();
            if (box == null)
                continue;

            PDPageContentStream canvas = visualizations.get(page);
            if (canvas == null) {
                canvas = new PDPageContentStream(document, page, AppendMode.APPEND, false, true);
                visualizations.put(page, canvas);
            }
            canvas.saveGraphicsState();
            canvas.setStrokingColor(color);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
            canvas.restoreGraphicsState();
        }
    }
    return boxes;
}

(VisualizeMarkedContent method)

The method showContent determines the bounding box of text associated with a given MCID, recursing if need be.

Rectangle2D showContent(int mcid, Map<Integer, PDMarkedContent> theseMarkedContents) throws IOException {
    Rectangle2D box = null;
    PDMarkedContent markedContent = theseMarkedContents != null ? theseMarkedContents.get(mcid) : null;
    List<Object> contents = markedContent != null ? markedContent.getContents() : Collections.emptyList();
    StringBuilder textContent =  new StringBuilder();
    for (Object object : contents) {
        if (object instanceof TextPosition) {
            TextPosition textPosition = (TextPosition)object;
            textContent.append(textPosition.getUnicode());

            int[] codes = textPosition.getCharacterCodes();
            if (codes.length != 1) {
                System.out.printf("<!-- text position with unexpected number of codes: %d -->", codes.length);
            } else {
                box = union(box, calculateGlyphBounds(textPosition.getTextMatrix(), textPosition.getFont(), codes[0]).getBounds2D());
            }
        } else if (object instanceof PDMarkedContent) {
            PDMarkedContent thisMarkedContent = (PDMarkedContent) object;
            box = union(box, showContent(thisMarkedContent.getMCID(), theseMarkedContents));
        } else {
            textContent.append("?" + object);
        }
    }
    return box;
}

(VisualizeMarkedContent method)

The previous two methods showStructure and showContent make use of the following helpers to build the (page-wise) union of bounding boxes:

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D>... maps) {
    Map<PDPage, Rectangle2D> result = null;
    for (Map<PDPage, Rectangle2D> map : maps) {
        if (map != null) {
            if (result != null) {
                for (Map.Entry<PDPage, Rectangle2D> entry : map.entrySet()) {
                    PDPage page = entry.getKey();
                    Rectangle2D rectangle = union(result.get(page), entry.getValue());
                    if (rectangle != null)
                        result.put(page, rectangle);
                }
            } else {
                result = map;
            }
        }
    }
    return result;
}

Map<PDPage, Rectangle2D> union(Map<PDPage, Rectangle2D> map, PDPage page, Rectangle2D rectangle) {
    if (map == null)
        map = new HashMap<>();
    map.put(page, union(map.get(page), rectangle));
    return map;
}

Rectangle2D union(Rectangle2D... rectangles)
{
    Rectangle2D box = null;
    for (Rectangle2D rectangle : rectangles) {
        if (rectangle != null) {
            if (box != null)
                box.add(rectangle);
            else
                box = rectangle;
        }
    }
    return box;
}

(VisualizeMarkedContent helper methods)

Finally the method calculateGlyphBounds has been borrowed from the PDFBox example DrawPrintTextLocations to calculate the individual glyph bounding boxes:

private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
{
    GeneralPath path = null;
    AffineTransform at = textRenderingMatrix.createAffineTransform();
    at.concatenate(font.getFontMatrix().createAffineTransform());
    if (font instanceof PDType3Font)
    {
        // It is difficult to calculate the real individual glyph bounds for type 3 fonts
        // because these are not vector fonts, the content stream could contain almost anything
        // that is found in page content streams.
        PDType3Font t3Font = (PDType3Font) font;
        PDType3CharProc charProc = t3Font.getCharProc(code);
        if (charProc != null)
        {
            BoundingBox fontBBox = t3Font.getBoundingBox();
            PDRectangle glyphBBox = charProc.getGlyphBBox();
            if (glyphBBox != null)
            {
                // PDFBOX-3850: glyph bbox could be larger than the font bbox
                glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                path = glyphBBox.toGeneralPath();
            }
        }
    }
    else if (font instanceof PDVectorFont)
    {
        PDVectorFont vectorFont = (PDVectorFont) font;
        path = vectorFont.getPath(code);

        if (font instanceof PDTrueTypeFont)
        {
            PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
            int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
            at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
        }
        if (font instanceof PDType0Font)
        {
            PDType0Font t0font = (PDType0Font) font;
            if (t0font.getDescendantFont() instanceof PDCIDFontType2)
            {
                int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
        }
    }
    else if (font instanceof PDSimpleFont)
    {
        PDSimpleFont simpleFont = (PDSimpleFont) font;

        // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
        // which is why PDVectorFont is tried first.
        String name = simpleFont.getEncoding().getName(code);
        path = simpleFont.getPath(name);
    }
    else
    {
        // shouldn't happen, please open issue in JIRA
        System.out.println("Unknown font class: " + font.getClass());
    }
    if (path == null)
    {
        return null;
    }
    return at.createTransformedShape(path.getBounds2D());
}

(VisualizeMarkedContent method)

The result for your example document: