javapdfboxapache-tika

How to extract ALT-Texts and Images from a PDF


I have a PDF that contains text and images. All images have an ALT-Text for accessibility readers.

Can someone tell me how I can extract Value Pairs <BufferedImage, String>, where BufferedImage is the image and String is the ALT-Text?

For me, it doesn't matter whether I use PDFBox or Apache Tika.

Example PDF: GitHub Repository with example PDF


Solution

  • Here's some code that expands on the solution from Apache PDFBox PDFTextStripper access text parts of the page, how can I? . It has been tested only on your file, so you may encounter surprises. The placement of getNumberTreeAsMap is inefficient and should be moved to the top. Null checks and class checks are missing.

    public static void main(String[] args) throws IOException
    {
        PDDocument document = Loader.loadPDF(new File("mixed-3-images.pdf"));
        PDFMarkedContentExtractor markedContentExtractor = new PDFMarkedContentExtractor();
        PDPage page = document.getPage(0); //TODO expand
        markedContentExtractor.processPage(page);
        List<PDMarkedContent> markedContents = markedContentExtractor.getMarkedContents();
        for (PDMarkedContent pdMarkedContent : markedContents)
        {
            COSDictionary pdmcProperties = pdMarkedContent.getProperties();
            if (pdmcProperties == null)
                continue;
            // actual answer starts here
            if (pdMarkedContent.getContents().size() >= 1 && pdMarkedContent.getContents().get(0) instanceof PDImageXObject)
            {
                PDImageXObject img = (PDImageXObject) pdMarkedContent.getContents().get(0);
                int mcid = pdmcProperties.getInt(COSName.MCID);
                System.out.println("MCID: " + mcid + "; " + img.getImage());
                
                PDStructureTreeRoot structureTreeRoot = document.getDocumentCatalog().getStructureTreeRoot();
                int sp = page.getStructParents();
                PDNumberTreeNode parentTree = structureTreeRoot.getParentTree();
                Map<Integer, COSObjectable> numberTreeAsMap = getNumberTreeAsMap(parentTree);
                PDParentTreeValue val = (PDParentTreeValue) numberTreeAsMap.get(sp);
                COSArray mcidArray = (COSArray) val.getCOSObject();
                COSDictionary obj = (COSDictionary) mcidArray.getObject(mcid);
                System.out.println("ALT: " + obj.getDictionaryObject(COSName.ALT));
            }
        }
    }
    
    // from PDMergerUtility source code
    // PDNumberTreeNode.getNumbers() only brings one level, this is why we need this
    static Map<Integer, COSObjectable> getNumberTreeAsMap(PDNumberTreeNode tree) throws IOException
    {
        if (tree == null)
        {
            return new LinkedHashMap<>();
        }
        Map<Integer, COSObjectable> numbers = tree.getNumbers();
        if (numbers == null)
        {
            numbers = new LinkedHashMap<>();
        }
        else
        {
            // must copy because the map is read only
            numbers = new LinkedHashMap<>(numbers);
        }
        List<PDNumberTreeNode> kids = tree.getKids();
        if (kids != null)
        {
            for (PDNumberTreeNode kid : kids)
            {
                numbers.putAll(getNumberTreeAsMap(kid));
            }
        }
        return numbers;
    }
    

    output:

    MCID: 25; BufferedImage@6475472c: type = 1 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=0 IntegerInterleavedRaster: width = 478 height = 699 #Bands = 3 xOff = 0 yOff = 0 dataOffset[0] 0
    ALT: COSString{Japanese Mask }
    

    Same with the second page:

    MCID: 6; BufferedImage@43aaf813: type = 1 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=0 IntegerInterleavedRaster: width = 461 height = 645 #Bands = 3 xOff = 0 yOff = 0 dataOffset[0] 0
    ALT: COSString{Black Dog and White Cat }
    MCID: 7; BufferedImage@328cf0e1: type = 1 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=0 IntegerInterleavedRaster: width = 413 height = 645 #Bands = 3 xOff = 0 yOff = 0 dataOffset[0] 0
    ALT: null
    MCID: 8; BufferedImage@201b6b6f: type = 1 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=0 IntegerInterleavedRaster: width = 913 height = 877 #Bands = 3 xOff = 0 yOff = 0 dataOffset[0] 0
    ALT: null