[SOLVED] Repairing pdfs with damaged xref table

Repairing pdfs with damaged xref table

Are there any solutions (preferably in Python) that can repair pdfs with damaged xref tables?

I have a pdf that I tried to convert to a png in Ghostscript and received the following error:

**** Error: An error occurred while reading an XREF table. **** The file has been damaged. This may have been caused **** by a problem while converting or transfering the file.

However, I am able to open the pdf in Preview on my Mac and when I export the pdf using Preview, I am able to convert the exported pdf.

Is there any way to repair pdfs without having to manually open them and export them?

Solution

If the file renders as expected in Ghostscript then you can run it through GS to the pdfwrite device and create a new PDF file which won't be damaged.

Preview is (like Acrobat) almost certainly silently repairing the problem in the background. Ghostscript will be doing the same, but unlike other applications we feel you need to know that the file has a problem. Firstly so that you know its broken, secondly so that if the file renders incorrectly in Ghostscript (or indeed, other applications) you know why.

Note that there are two main reasons for a damaged xref; firstly the developer of the application didn't read the specification carefully enough and the file offsets in the xref are correct, but the format is incorrect (this is not uncommon and a repair by GS will be harmless), secondly the file genuinely has been damaged in transit, or by editing it.

In the latter case there may be other problems and Ghostscript will try to warn you about those too. If you don't get any other warnings or errors, then its probably just a malformed xref table.