pythonaccessibilityadobewcagalt-attribute

I am trying to add ALT text to images inside a pdf programatically


I have the ALT text generated just need to add it somehow to the images under the figure tag. A little background - I want to my my pdf accessible to the WCAG 2.1 AA standards and i am using adobe autotag feature to tag the pdf. It tags the images as /figure. I can totally extract the figures and generate alt text but I cant find a way to embed or add that alt text to the image and make it WCAG 2.1 AA compliant. I ultimately also want to add this to a lambda function in AWS. Is there any way I could do so? Thank you!

I tried using multiple open source libraries pikepdf,pymupdf, and some more and also tried converting the pdf to html or xml but the issue with that is the pdf cant be converted back exactly to what it was. I also tried adding it directly in code but the file goes corrupt.


Solution

  • The MCID for Alt text is either allocated at time of PDF generation (so for this WEB.HTML page by the browsers PDF generator), or can be easily be manually assigned in a GUI, when checking for the other human verified content. Thus Acrobat pre-flight is the simplest and easiest point to index Alt Text in the mandatory PDF/AU post production.

    In a web page there is a 1:1 direct relationship the alt= is directly combined with the img src. That direct association is not maintained in a PDF.

    <div class="gravatar-wrapper-32">
    <img src="https://www.gravatar.com/avatar/f50f5b351c1d07d2a5a8f023e1731768?s=64&amp;d=identicon&amp;r=PG&amp;f=y&amp;so-version=2"
    alt="Aryan Khanna's user avatar"
    width="32" height="32" class="bar-sm"></div>
    

    Attempting to add all the production interconnected components inside a PDF stream is usually fraught with problems, since all existing file components need re indexing & renumbering, thus becomes a massive slow internal task.

    enter image description here

    To add the required object and all it's dependents or ancestors (Parent = 119) and or any children objects midway through a PDF is not easy. This is object 120 of 156. the image can be anywhere in the file as the image and /Alt text are not directly related, but just numbers in a page index. Actually in this case, the image was placed way back as document number 11 object.

    120 0 obj
    <</Type /StructElem /S /Figure /Alt (Aryan Khanna's user avatar) /P 119 0 R /K [ << /Type /MCR /Pg 2 0 R /MCID 34 >> ] /ID (node00000026) >>
    endobj
    

    How to place the Tag is find the image number and look for it in the page contents here it is added as /X11.

    /X11 Do
    

    now inject the related Tag /MCID 34 number before it

    /P<</MCID 34>>BDC
    /X11 Do
    

    That is where the link to a tag is manually placed before the correct image as a child reference. So it will be seen as a tag for an image.

    enter image description here

    However since EVERY PDF needs AT LEAST two manual visual checks to verify images, it is easiest to check the image alt data at the same time.

    enter image description here