javapdfitextpdfboxtagged-pdf

"Find Tag from Selection" is not working in tagged pdf?


I have tagged a pdf using pdfbox.

How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC) and then I am adding that marked content to document root catalog structure.

What working: Almost everything is working fine like completely tagged pdf. It is passing the PAC3 accessibility checker also.

//Adding tags
tokens.add(++ind, type_check(t_ype, page));
currentMarkedContentDictionary = new COSDictionary();
currentMarkedContentDictionary.setInt(COSName.MCID, mcid);
if (altText != null && !altText.isEmpty()) {
    currentMarkedContentDictionary.setString(COSName.ALT, altText);
}
mcid++;
tokens.add(++ind, currentMarkedContentDictionary);
tokens.add(++ind, Operator.getOperator("BDC"));

// Adding marked content to root structure
structureElement.appendKid(markedContent);

currentSection.appendKid(structureElement);             

What not working: After tagging one future Is missing from tag structure. There is an option called "Find Tag from Selection" . Is not working. It is going to last tag while I select some test and press " Find tag from selection" in root structure. Please find the pdf in below link.

https://drive.google.com/file/d/11Lhuj50Bb9kChvD0kL_GOHQn4RNKZ0hR/view?usp=sharing

Parent tree:

https://drive.google.com/file/d/109xhUpqsQSFLPJB2nhXoU9ssMKnyht3G/view?usp=sharing

extra doc with tagging and parent tree: https://drive.google.com/file/d/1yzZSsjkb5_dGfq1Wu3VxsH73vr3alRmC/view?usp=sharing

Please help me to solve this problem.

New Problem: I observed that

while Jaws reading my tagged document, I am pressing controls like ctl+shift+5 in windows machine . It will show the options like drop down>"Read based on tagged structure" or >"Top left to bottom right" and below two radio buttons

Read curent page Read all pages image you can see. Shift+CTL+5 in adobe dc you can see image here

I selected "read based on tagging structure and Read current page" Now the jaws not reading the Tag structure. But if i use same doc for "Read entire document" it is reading perfect?

Link to doc:

https://drive.google.com/file/d/1CguMHa4DikFMP15VGERnPNWRq5vO3u6I/view?usp=sharing

Any help?


Solution

  • A nesting issue

    How I was tagged: Instead of extract text and tagging I am adding mcid's to the existing content stream (both open and closing ex: /p<< MCID 0 >> BDC .. .. .. EMC)

    You're doing this incorrectly. See for example the start of the page content stream in your document:

    BT
    0 i
    /C0_0 18 Tf
    41.91 740.175 Td
    /H2 <</MCID  0  >> BDC
    ( \) F M M P  8 P S M E) Tj
    ET
    /TouchUp_TextEdit MP
    BT
    /C0_1 14 Tf
    EMC 
    

    Focusing on the beginning and end of text objects and marked content, we see that you have BT ... BDC ... ET ... BT ... EMC

    According to the specification, though:

    When the marked-content operators BMC, BDC, and EMC are combined with the text object operators BT and ET (see 9.4, “Text Objects”), each pair of matching operators (BMCEMC, BDCEMC, or BTET) shall be properly (separately) nested. Therefore, the sequences

    BMC             BT
      BT              BMC
        …    and         …
      ET              EMC
    EMC             ET
    

    are valid, but

    BMC             BT
      BT              BMC
        …    and         …
      EMC             ET
    BT              EMC
    

    are not valid.

    (ISO 32000-1 section 14.6 "Marked Content")

    This issue was fixed in the second shared PDF, res1.pdf.

    Missing ParentTree and StructParents

    The problem your question focuses on is

    There is an option called "Find Tag from Selection" . Is not working.

    Finding a tag from selection essentially means that you have the MCID of some content stream instruction and you search the structure element in the structure tree referencing that marked content ID.

    How PDF processors are expected to do this, is described in section 14.7.4.4 "Finding Structure Elements from Content Items" of the PDF specification ISO 32000-1 (or section 14.7.5.4 in ISO 32000-2):

    Because a stream cannot contain object references, there is no way for content items that are marked-content sequences to refer directly back to their parent structure elements (the ones to which they belong as content items). Instead, a different mechanism, the structural parent tree, shall be provided for this purpose. For consistency, content items that are entire PDF objects, such as XObjects, shall also use the parent tree to refer to their parent structure elements.

    The parent tree is a number tree, accessed from the ParentTree entry in a document’s structure tree root. The tree shall contain an entry for each object that is a content item of at least one structure element and for each content stream containing at least one marked-content sequence that is a content item.

    Your PDF does not have that ParentTree at all, and your page does not contain a StructParents entry to lookup in a parent tree. Thus, the prescribed way to get from marked content to the structure tree is impossible to go.

    A ParentTree was added in the third shared PDF, new.pdf.

    Incorrect ParentTree entries

    While in new.pdf you have a ParentTree, its contents are clearly incorrect:

    Screenshot ParentTree

    The ParentTree is a number tree, i.e. integers are mapped to something here, so there obviously must not be multiple entries for the same integer key.

    Furthermore, looking inside one of those values:

    Screenshot ParentTree, first entry opened

    one sees that you claim that the following StructElem is the value for all marked content IDs:

    common value of all content IDs

    Inspecting this StructElem further, one sees that it represents the final paragraph on the final page.

    Thus, your observation

    Now instead of "selection not found " it is highlighting the last <P> tag in parent tree. Irrespective of what what we selected.

    is what one can expect. If one expects any reasonable behavior at all, that is, with a ParentTree structure broken so badly.

    Actually there was not only this new.pdf but also res.pdf and tagged without altext.pdf with ParentTrees, but all these ParentTrees were broken like the tree of new.pdf.

    You might want to start inspecting the structures you create when analyzing an unwanted behavior.

    Another issue with parent tree entries

    The previously described issue in parent trees meanwhile has been resolved, different pages now have different struct parents and the parent tree arrays now reference the struct elements for distinct MCIDs.

    For some documents a different error occurs now, though, e.g. "res29_08_19.pdf". Here the parent tree starts like this:

    Screenshot ParentTree

    In particular the first entry in the array is for MCID 3, the second for MCID 4, ...

    This is invalid, according to the specification

    The array element corresponding to each sequence shall be found by using the sequence’s marked-content identifier as a zero-based index into the array.

    (ISO 32000-1 section 14.7.4.4 "Finding Structure Elements from Content Items")

    Thus, the first entry must be for MCID 0, the second for MCID 1, ...

    You objected in a comment

    No I used 0 and 1 Mcid's for Artifacts.

    But as a corollary of the above: Do not give MCIDs to marked content sequences you don't have a structure element for! MCIDs are for going back and forth between the structure hierarchy and the content streams. If you mark a piece of content without having a structure element for it, don't give it a MCID.

    Yet another issue with parent tree entries

    You again report problems with your newest file mathpdf.pdf. And indeed, there are issues; Adobe Acrobat Preflight reports a 5 pages list of inconsistent parent tree mappings like this:

    preflight report excerpt

    In contrast to the previous issues the cause does not become clear by looking at the parent tree alone, one also has to look at the structure hierarchy.

    Doing so, though, one peculiarity immediately hits the eye: In your parent tree you do not reference the actual parent structure element of the MCID but you reference a new structure tree node which claims to have the actual parent node from the structure hierarchy as its own parent (not actually being one of its kids) and also claims to have the MCID in question as kid.

    For example let's look at the MCID 0 on the first page. In the structure hierarchy you have:

    structure hierarchy screen shot

    In the parent tree you have:

    parent tree screen shot

    You should have simply referenced object 238 (the structure hierarchy parent of MCID 0) directly from the parent tree array for page one instead of that in-between object 62 which claims to have that object 238 as parent and MCID 0 as kid.

    The reported inconsistency may be due to the node referenced from the parent tree (in object 62) claims to be a P paragraph with a parent node (in object 238) which is a Span. That is not allowed, a paragraph may contain a span but it cannot be contained in one.