xmlpdfmupdfxpdf

How to convert the pdf document to xml and get the section which contains the table data.


There is a pdf document,I want to convert it to xml or html.

Since the pdf document contains some tables,when it have converted to xml or html,I can not know which is table data and which is text.

I want to get tables data to store the database.

Can xpdf or mupdf make it?

Thanks.


Solution

  • PDF does not (in general) contain information about text. Text is text, there is no information to identify text in a table.

    Therefore ther is no reliable way for any PDF reading application to identify text as beig part of a table. So MuPDF will not be able to tell you this.

    You can, of course, attempt to apply heuristics yourself, identifying text in rows at the same vertical offset, and looking for text spaced horizontally at regular x offsets.