javapdfpdfboxtagged-pdf

Java-PDFbox: Creating the artifact tag for lines and underlines in tagged PDF


I am creating the accessibility PDF from the tagged pdf. It shows a "path object is not tagged" error. The PDF has lines and underlined text. So, I am trying to add an "ARTIFACT" tag for the untagged line items. I am able to get the lines from PDFGraphicsStreamEngine. Could anyone help me with this?

PDF Page PAC3 Error
enter image description here enter image description here

Solution

  • You can use the PdfContentStreamEditor class from this answer to edit the page content streams as desired by customizing and calling it like this:

    PDDocument document = ...;
    for (PDPage page : document.getDocumentCatalog().getPages()) {
        PdfContentStreamEditor markEditor = new PdfContentStreamEditor(document, page) {
            int markedContentDepth = 0;
    
            @Override
            public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
                if (inArtifact) {
                    System.err.println("Structural error in content stream: Path not properly closed by path painting instruction.");
                }
                markedContentDepth++;
                super.beginMarkedContentSequence(tag, properties);
            }
    
            @Override
            public void endMarkedContentSequence() {
                markedContentDepth--;
                super.endMarkedContentSequence();
            }
    
            boolean inArtifact = false;
    
            @Override
            protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
                String operatorString = operator.getName();
    
                boolean unmarked = markedContentDepth == 0;
                boolean inArtifactBefore = inArtifact;
    
                if (unmarked && (!inArtifactBefore) && PATH_CONSTRUCTION.contains(operatorString)) {
                    super.write(contentStreamWriter, Operator.getOperator("BMC"), Collections.singletonList(COSName.ARTIFACT));
                    inArtifact = true;
                }
    
                super.write(contentStreamWriter, operator, operands);
    
                if (unmarked && inArtifactBefore && PATH_PAINTING.contains(operatorString)) {
                    super.write(contentStreamWriter, Operator.getOperator("EMC"), Collections.emptyList());
                    inArtifact = false;
                }
            }
    
            final List<String> PATH_CONSTRUCTION = Arrays.asList("m", "l", "c", "v", "y", "h", "re");
            final List<String> PATH_PAINTING = Arrays.asList("s", "S", "f", "F", "f*", "B", "B*", "b", "b*", "n");
        };
        markEditor.processPage(page);
    }
    document.save(...);
    

    (EditMarkedContent test testMarkUnmarkedPathsAsArtifactsTradeSimple1)

    The beginMarkedContentSequence and endMarkedContentSequence overrides track the current marked content nesting depth, in particular whether or not the current content is marked at all.

    For yet unmarked instructions the write override then encloses unmarked path construction and painting instruction sequences in /Artifact BMC ... EMC.

    Beware, this code only considers content in page content streams, it does not descend into form XObjects, Patterns, etc.

    Furthermore, in case of content streams with errors (e.g. with path construction without painting) this code may add additional errors (e.g. unbalanced marked content starts and ends).