I have tagged a pdf using pdfbox.
How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC) and then I am adding that marked content to document root catalog structure.
What working: Almost everything is working fine like completely tagged pdf. It is passing the PAC3 accessibility checker also.
//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));
// Adding marked content to root structure
structureElement.appendKid(markedContent);
currentSection.appendKid(structureElement);
What not working: After tagging one future Is missing from tag structure. There is an option called "Find Tag from Selection" . Is not working. It is going to last tag while I select some test and press " Find tag from selection" in root structure. Please find the pdf in below link.
https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing
Parent tree:
https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing
extra doc with tagging and parent tree: https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing
Please help me to solve this problem.
New Problem: I observed that
while Jaws reading my tagged document, I am pressing controls like ctl+shift+5 in windows machine . It will show the options like drop down>"Read based on tagged structure" or >"Top left to bottom right" and below two radio buttons
Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here
I selected "read based on tagging structure and Read current page" Now the jaws not reading the Tag structure. But if i use same doc for "Read entire document" it is reading perfect?
Link to doc:
https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing
Any help?
How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex:
/p<< MCID 0 >> BDC .. .. .. EMC
)
You're doing this incorrectly. See for example the start of the page content stream in your document:
BT
0 i
/C0_0 18 Tf
41.91 740.175 Td
/H2 <</MCID 0 >> BDC
( \) F M M P 8 P S M E) Tj
ET
/TouchUp_TextEdit MP
BT
/C0_1 14 Tf
EMC
Focusing on the beginning and end of text objects and marked content, we see that you have BT ... BDC ... ET ... BT ... EMC
According to the specification, though:
When the marked-content operators BMC, BDC, and EMC are combined with the text object operators BT and ET (see 9.4, “Text Objects”), each pair of matching operators (BMC…EMC, BDC…EMC, or BT…ET) shall be properly (separately) nested. Therefore, the sequences
BMC BT BT BMC … and … ET EMC EMC ET
are valid, but
BMC BT BT BMC … and … EMC ET BT EMC
are not valid.
(ISO 32000-1 section 14.6 "Marked Content")
This issue was fixed in the second shared PDF, res1.pdf
.
The problem your question focuses on is
There is an option called "Find Tag from Selection" . Is not working.
Finding a tag from selection essentially means that you have the MCID of some content stream instruction and you search the structure element in the structure tree referencing that marked content ID.
How PDF processors are expected to do this, is described in section 14.7.4.4 "Finding Structure Elements from Content Items" of the PDF specification ISO 32000-1 (or section 14.7.5.4 in ISO 32000-2):
Because a stream cannot contain object references, there is no way for content items that are marked-content sequences to refer directly back to their parent structure elements (the ones to which they belong as content items). Instead, a different mechanism, the structural parent tree, shall be provided for this purpose. For consistency, content items that are entire PDF objects, such as XObjects, shall also use the parent tree to refer to their parent structure elements.
The parent tree is a number tree, accessed from the ParentTree entry in a document’s structure tree root. The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item.
Your PDF does not have that ParentTree at all, and your page does not contain a StructParents entry to lookup in a parent tree. Thus, the prescribed way to get from marked content to the structure tree is impossible to go.
A ParentTree was added in the third shared PDF, new.pdf
.
While in new.pdf
you have a ParentTree, its contents are clearly incorrect:
The ParentTree is a number tree, i.e. integers are mapped to something here, so there obviously must not be multiple entries for the same integer key.
Furthermore, looking inside one of those values:
one sees that you claim that the following StructElem is the value for all marked content IDs:
Inspecting this StructElem further, one sees that it represents the final paragraph on the final page.
Thus, your observation
Now instead of "selection not found " it is highlighting the last <P> tag in parent tree. Irrespective of what what we selected.
is what one can expect. If one expects any reasonable behavior at all, that is, with a ParentTree structure broken so badly.
Actually there was not only this new.pdf
but also res.pdf
and tagged without altext.pdf
with ParentTrees, but all these ParentTrees were broken like the tree of new.pdf
.
You might want to start inspecting the structures you create when analyzing an unwanted behavior.
The previously described issue in parent trees meanwhile has been resolved, different pages now have different struct parents and the parent tree arrays now reference the struct elements for distinct MCIDs.
For some documents a different error occurs now, though, e.g. "res29_08_19.pdf". Here the parent tree starts like this:
In particular the first entry in the array is for MCID 3, the second for MCID 4, ...
This is invalid, according to the specification
The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.
(ISO 32000-1 section 14.7.4.4 "Finding Structure Elements from Content Items")
Thus, the first entry must be for MCID 0, the second for MCID 1, ...
You objected in a comment
No I used 0 and 1 Mcid's for Artifacts.
But as a corollary of the above: Do not give MCIDs to marked content sequences you don't have a structure element for! MCIDs are for going back and forth between the structure hierarchy and the content streams. If you mark a piece of content without having a structure element for it, don't give it a MCID.
You again report problems with your newest file mathpdf.pdf. And indeed, there are issues; Adobe Acrobat Preflight reports a 5 pages list of inconsistent parent tree mappings like this:
In contrast to the previous issues the cause does not become clear by looking at the parent tree alone, one also has to look at the structure hierarchy.
Doing so, though, one peculiarity immediately hits the eye: In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.
For example let's look at the MCID 0 on the first page. In the structure hierarchy you have:
In the parent tree you have:
You should have simply referenced object 238 (the structure hierarchy parent of MCID 0) directly from the parent tree array for page one instead of that in-between object 62 which claims to have that object 238 as parent and MCID 0 as kid.
The reported inconsistency may be due to the node referenced from the parent tree (in object 62) claims to be a P paragraph with a parent node (in object 238) which is a Span. That is not allowed, a paragraph may contain a span but it cannot be contained in one.