I am creating the accessibility PDF from the tagged pdf. It shows a "path object is not tagged" error. The PDF has lines and underlined text. So, I am trying to add an "ARTIFACT" tag for the untagged line items. I am able to get the lines from PDFGraphicsStreamEngine
. Could anyone help me with this?
You can use the PdfContentStreamEditor
class from this answer to edit the page content streams as desired by customizing and calling it like this:
PDDocument document = ...;
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor markEditor = new PdfContentStreamEditor(document, page) {
int markedContentDepth = 0;
@Override
public void beginMarkedContentSequence(COSName tag, COSDictionary properties) {
if (inArtifact) {
System.err.println("Structural error in content stream: Path not properly closed by path painting instruction.");
}
markedContentDepth++;
super.beginMarkedContentSequence(tag, properties);
}
@Override
public void endMarkedContentSequence() {
markedContentDepth--;
super.endMarkedContentSequence();
}
boolean inArtifact = false;
@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator, List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
boolean unmarked = markedContentDepth == 0;
boolean inArtifactBefore = inArtifact;
if (unmarked && (!inArtifactBefore) && PATH_CONSTRUCTION.contains(operatorString)) {
super.write(contentStreamWriter, Operator.getOperator("BMC"), Collections.singletonList(COSName.ARTIFACT));
inArtifact = true;
}
super.write(contentStreamWriter, operator, operands);
if (unmarked && inArtifactBefore && PATH_PAINTING.contains(operatorString)) {
super.write(contentStreamWriter, Operator.getOperator("EMC"), Collections.emptyList());
inArtifact = false;
}
}
final List<String> PATH_CONSTRUCTION = Arrays.asList("m", "l", "c", "v", "y", "h", "re");
final List<String> PATH_PAINTING = Arrays.asList("s", "S", "f", "F", "f*", "B", "B*", "b", "b*", "n");
};
markEditor.processPage(page);
}
document.save(...);
(EditMarkedContent test testMarkUnmarkedPathsAsArtifactsTradeSimple1
)
The beginMarkedContentSequence
and endMarkedContentSequence
overrides track the current marked content nesting depth, in particular whether or not the current content is marked at all.
For yet unmarked instructions the write
override then encloses unmarked path construction and painting instruction sequences in /Artifact BMC ... EMC
.
Beware, this code only considers content in page content streams, it does not descend into form XObjects, Patterns, etc.
Furthermore, in case of content streams with errors (e.g. with path construction without painting) this code may add additional errors (e.g. unbalanced marked content starts and ends).