A colleague of mine filled-in dynamic PDF form, saved and sent it to me. However due to probably some weird symbol used it did not open, neither on colleague's or my PC. It was giving XML parsing error: not well-formed (invalid token) (error code 4). There was a lot of important info in that doc so I really need a way to recover it.
I tried many recommended things, such as:
However I came up with zero result.
The only thing that succeed a bit was opening PDF with default Windows notepad. It showed XML-formatted code, however most of the code was encoded (on gist small part of encoded code is seen in the end, but there is much more) Was something like that:
%PDF-1.6
%âãÏÓ
1 0 obj
<</AcroForm 59 0 R/MarkInfo<</Marked true>>/Metadata 2 0 R/Names 60 0 R/Pages 235 0 R/Type/Catalog/Perms 233 0 R/StructTreeRoot 243 0 R/NeedsRendering true>>
endobj
2 0 obj
<</Length 4114/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.4-c005 78.150055, 2013/08/07-22:58:47 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/"
xmlns:xmp="http://ns.adobe.com/xap/1.0/"
xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
xmlns:desc="http://ns.adobe.com/xfa/promoted-desc/">
<dc:format>application/pdf</dc:format>
<dc:creator>
<rdf:Seq>
<rdf:li>DAAD</rdf:li>
</rdf:Seq>
</dc:creator>
<dc:title>
<rdf:Alt>
<rdf:li xml:lang="x-default">PBF: Gutachtenformular</rdf:li>
</rdf:Alt>
</dc:title>
<pdf:Producer>Adobe XML Form Module Library</pdf:Producer>
<xmp:CreateDate>2008-08-14T09:56:29+02:00</xmp:CreateDate>
<xmp:CreatorTool>Adobe LiveCycle Designer ES 10.4</xmp:CreatorTool>
<xmp:MetadataDate>2017-03-17T09:14:06+01:00</xmp:MetadataDate>
<xmp:ModifyDate>2017-03-17T09:14:06+01:00</xmp:ModifyDate>
<xmpMM:DocumentID>uuid:d62a53c0-8974-4b14-888e-569579f416d8</xmpMM:DocumentID>
<xmpMM:InstanceID>uuid:c097e78e-1dd1-11b2-0a00-9e91daf58acd</xmpMM:InstanceID>
<desc:embeddedHref rdf:parseType="Resource">
<rdf:value>G:\Z2\00- Verbindliche Formulare, Vorlagen\___Logo_fuer_Formulare_06_2015\DAAD_Globe_Logo-Supplement_eng_tl_rgb_300dpi.jpg</rdf:value>
<desc:ref>/template/subform[1]/pageSet[1]/pageArea[1]/draw[2]</desc:ref>
</desc:embeddedHref>
<desc:Schema-Anmerkung rdf:parseType="Resource">
<rdf:value>16 byte UUID in 32 chars (hexadecimal encoded)</rdf:value>
<desc:ref>/template/subform[1]/subform[1]/field[1]</desc:ref>
</desc:Schema-Anmerkung>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj
214 0 obj
<</Filter[/FlateDecode]/Length 419>>stream
H‰¼“[kÂ0Çßýg}²LŠ¦àæC7'†nÞØžB°§.,¶¥IÕáüîKÓ8[´2˜¬”^’ÿ¹äwÎa>Tåg„¡_]û”°@HÊ9z6t:`%‡>гàërº%Æ‚…Á1UnnáÊiØ•M
I tried many different decoding tools - no success.
You should have used specific FlateDecoding method. There is a working solution written by Stephen Haywood . I checked its correctness in Python 2. Just change the PDF title to yours and run in terminal with python command. Here is the gist.
#!/bin/bash
import re
import zlib
pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(r'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip('\r\n')
try:
print(zlib.decompress(s))
print("")
except:
pass