We are creating pdf documents in Java using pdfBox. Since they should be accessible by Screenreaders, we are using tags and we are setting up a parentTree and we add that to the document catalog.
Please find an example file here.
When we check the resulting pdf with PAC3 validator we get 25 errors for inconsistent entries in the structural parent tree.
Same result but more details in Adobe prefight syntax error check. The error message is
Inconsistent ParentTree mapping (ParentTree element 0) for structure element
Traversal Path:->StructTreeRoot->K->K->[1]->K->[3]->K->[4]
Adobe preflight syntax error check
When i try to follow that traversal path in pdfBox Debugger, i see an element referencing the ID 22.
Now my questions are:
PDF Debugger
I think, building accessible pdf with pdfBox as well as error messages from common validation tools are rather poorly documented. Or where can i find more information about it?
Thanks a lot for your help.
The issue in your PDF reminds very much of the issue discussed in the last section "Yet another issue with parent tree entries" in this answer to the question “Find Tag from Selection” is not working in tagged pdf? by fascinating coder:
In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.
Instead you should simply reference the actual parent structure element of the MCID.
As your question title asks how to heal inconsistent parent tree mappings in a PDF created by pdfBox, here an approach to fix your parent tree by rebulding the parent tree from the structure tree.
First recursively collect MCIDs and their parent structure tree elements by page, e.g. using a method like this:
void collect(PDPage page, PDStructureNode node, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
COSDictionary pageDictionary = node.getCOSObject().getCOSDictionary(COSName.PG);
if (pageDictionary != null) {
page = new PDPage(pageDictionary);
}
for (Object object : node.getKids()) {
if (object instanceof COSArray) {
for (COSBase base : (COSArray) object) {
if (base instanceof COSDictionary) {
collect(page, PDStructureNode.create((COSDictionary) base), parentsByPage);
} else if (base instanceof COSNumber) {
setParent(page, node, ((COSNumber)base).intValue(), parentsByPage);
} else {
System.out.printf("?%s\n", base);
}
}
} else if (object instanceof PDStructureNode) {
collect(page, (PDStructureNode) object, parentsByPage);
} else if (object instanceof Integer) {
setParent(page, node, (Integer)object, parentsByPage);
} else {
System.out.printf("?%s\n", object);
}
}
}
(RebuildParentTreeFromStructure method)
with this helper method
void setParent(PDPage page, PDStructureNode node, int mcid, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
if (node == null) {
System.err.printf("Cannot set null as parent of MCID %s.\n", mcid);
} else if (page == null) {
System.err.printf("Cannot set parent of MCID %s for null page.\n", mcid);
} else {
Map<Integer, PDStructureNode> parents = parentsByPage.get(page);
if (parents == null) {
parents = new HashMap<>();
parentsByPage.put(page, parents);
}
if (parents.containsKey(mcid)) {
System.err.printf("MCID %s already has a parent. New parent rejected.\n", mcid);
} else {
parents.put(mcid, node);
}
}
}
(RebuildParentTreeFromStructure helper method)
and then rebuild based on the collected information:
void rebuildParentTreeFromData(PDStructureTreeRoot root, Map<PDPage, Map<Integer, PDStructureNode>> parentsByPage) {
int parentTreeMaxkey = -1;
Map<Integer, COSArray> numbers = new HashMap<>();
for (Map.Entry<PDPage, Map<Integer, PDStructureNode>> entry : parentsByPage.entrySet()) {
int parentsId = entry.getKey().getCOSObject().getInt(COSName.STRUCT_PARENTS);
if (parentsId < 0) {
System.err.printf("Page without StructsParents. Ignoring %s MCIDs.\n", entry.getValue().size());
} else {
if (parentTreeMaxkey < parentsId)
parentTreeMaxkey = parentsId;
COSArray array = new COSArray();
for (Map.Entry<Integer, PDStructureNode> subEntry : entry.getValue().entrySet()) {
array.growToSize(subEntry.getKey() + 1);
array.set(subEntry.getKey(), subEntry.getValue());
}
numbers.put(parentsId, array);
}
}
PDNumberTreeNode numberTreeNode = new PDNumberTreeNode(PDParentTreeValue.class);
numberTreeNode.setNumbers(numbers);
root.setParentTree(numberTreeNode);
root.setParentTreeNextKey(parentTreeMaxkey + 1);
}
(RebuildParentTreeFromStructure method)
Applied like this
PDDocument document = PDDocument.load(SOURCE));
rebuildParentTree(document);
document.save(RESULT);
(RebuildParentTreeFromStructure test testTestdatei
)
PAC3 and Adobe Preflight (at least of my old Acrobat 9.5) go all green for the result:
Beware: This is no generic parent tree rebuilder yet. It is made to work for the test file at hand with a specific kind of structure tree nodes and content only in page content streams. For a generic tool it has to learn to cope with other kinds, too, and to also process e.g. marked content in embedded XObjects.