pdfhyperlinkadobeitextpdf-extraction

Hyperlink Detection from PDF


I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.

I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.

I have tried the following code in Java using iText

        PdfReader myReader = new PdfReader("pdf File Path");
        PdfDictionary pageDict = myReader.getPageN(1);
        PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
        System.out.println(annots);
        ArrayList<String> dests = new ArrayList<String>();
        if(annots != null) 
        {
            for(int i=0; i<annots.size(); ++i) 
            {
                PdfDictionary annotDict = annots.getAsDict(i);
                PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
                if (subType != null && PdfName.LINK.equals(subType)) 
                {
                    PdfDictionary action = annotDict.getAsDict(PdfName.A);
                    if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S))) 
                    {
                        dests.add(action.getAsString(PdfName.URI).toString());
                    } // else { its an internal link }
                }
            }
        }        
        System.out.println(dests);

Solution

  • You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).

    Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.

    After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.

    public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
    {
        using (PdfDocument doc = new PdfDocument(inputFile))
        {
            StringBuilder sb = new StringBuilder();
    
            for (int i = 0; i < doc.Pages.Count; i++)
            {
                PdfPage page = doc.Pages[i];
                foreach (PdfWidget widget in page.Widgets)
                {
                    PdfActionArea actionArea = widget as PdfActionArea;
                    if (actionArea == null)
                        continue;
    
                    PdfUriAction linkAction = actionArea.Action as PdfUriAction;
                    if (linkAction == null)
                        continue;
    
                    Uri url = linkAction.Uri;
                    PdfRectangle rect = actionArea.BoundingBox;
    
                    // add information about found link into string buffer
                    sb.Append("Page ");
                    sb.Append(i.ToString());
                    sb.Append(" : ");
                    sb.Append(rect.ToString());
                    sb.Append(" ");
                    sb.AppendLine(url.ToString());
    
                    // draw rectangle around found link
                    page.Canvas.DrawRectangle(rect);
                }
            }
    
            // save document with highlighted links and text information about links to files
            doc.Save(outputFile);
            System.IO.File.WriteAllText(outputTxt, sb.ToString());
    
            // open created PDF and text file in default viewers
            System.Diagnostics.Process.Start(outputTxt);
            System.Diagnostics.Process.Start(outputFile);
        }
    }
    

    You can use the sample code with a call like this:

    ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");