htmlpdfannotationsscribddocument-conversion

What technology is used behind A.nnotate.com?


I would like to know how do services like A.nnotate.com, Scribd, Google Docs render pdf,.doc, or any other document into HTML and how does the annotation system work?


Solution

  • A.nnotate.com does server-side conversion of PDF pages into PNG images at a given zoom level using xpdf - these are what get displayed in the browser.

    The text highlighting is done by extracting the text positions from the PDF, then adding a transparent overlay on top of the page images with absolutely positioned html DIVS on top of the words. Annotations then use an ajax gui to attach notes to highlighted text.

    Other formats (MS Word, PPT etc) are first converted to PDF using openoffice, then to images and text overlays as for PDFs.

    I think the other HTML document sites do something similar for rendering PDFs as HTML (i.e. page images + word overlay as transparent divs) - an alternative trick is convert the PDF embedded fonts to HTML5 CSS fonts, and use absolutely positioned divs for the text (& extract and position the images too).