pdfacrobat

Minimal PDF file according to PDF-2.0 spec results in corrupted document structure


I am trying to create a minimal PDF file example using the PDF-2.0 standard based on the ISO Specification. I would like to avoid using the Xref Table and use instead only the Cross-reference stream dictionary and no Trailer Section.

The file opens in Adobe, but when I want to close it, it tries to save it which I consider it does to fix what it is considering a corrupted document structure.

So I guess that my PDF to not comply with the PDF-2.0. But why not?

Here is my code for the PDF-2.0 File:

UPDATE: I tried to follow some of the comments, thanks for these inputs. Update was:

Still don't know what bit is missing to be reconized by Acrobat as valid file and no prompting any saving dialog.

%PDF-2.0
%Óëéá
1 0 obj
<</Type /Catalog
/Pages 2 0 R
/Metadata 5 0 R
>>
endobj
2 0 obj
<</Type /Pages
/Kids [3 0 R 4 0 R]
/Count 2
>>
endobj
3 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
>>
endobj
4 0 obj
<</Type /Page
/Parent 2 0 R
/MediaBox [0 0 595 842]
>>
endobj
5 0 obj
<</Type /Metadata
/Subtype /XML
/Length /2880
>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <pdf:Producer>PdfProd</pdf:Producer>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <xmp:CreateDate>2024-02-28T23:46:34+01:00</xmp:CreateDate>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <dc:format>application/pdf</dc:format>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
      <xmpMM:DocumentID>f2015454-8669-45e4-9218-ad61ad0e2082</xmpMM:DocumentID>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
<?xpacket end="w"?>
endstream
endobj
6 0 obj
<</Type /XRef
/Index [0 7]
/Size 7
/W [1 2 1]
/Root 1 0 R
/ID [<1f7139e82f1c048ff020a6c953c3addd><1f7139e82f1c048ff020a6c953c3addd>]
/Length 77
>>
stream
00 0000 00
01 000F 01
01 004F 02
01 008D 03
01 00D3 04
01 0119 05
01 0CAB 06
endstream
endobj
startxref
3405
%%EOF


2ND UPDATE: I tried to implement all suggestions, many thanks for all the very useful and precious inputs in the comments. After these changes, the validation over some online pdf validation, say the file is ok. But it fact for Acrobat now it's even worse, when I try to open the file in Acrobat, is not able to open it anymore ("The file is damaged and could not be repaired."). Thanks in advance for any help!

%PDF-2.0
%Óëéá
1 0 obj
<</Type /Catalog
/Metadata 2 0 R
/Pages 3 0 R
>>
endobj
2 0 obj
<</Type /Metadata
/Length 2881
/Subtype /XML
>>
stream
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
      <pdf:Producer>Pdf2You</pdf:Producer>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
      <xmp:CreateDate>2024-03-04T23:42:40+01:00</xmp:CreateDate>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
      <dc:format>application/pdf</dc:format>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
      <xmpMM:DocumentID>7525a6cc-24c0-4d27-a995-17ee0436f906</xmpMM:DocumentID>
    </rdf:Description>
  </rdf:RDF>
</x:xmpmeta>

                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
                                                                                                   
<?xpacket end="w"?>
endstream
endobj
3 0 obj
<</Type /Pages
/Kids [4 0 R]
/Count 1
>>
endobj
4 0 obj
<</Type /Page
/Parent 3 0 R
/MediaBox [0 0 595 842]
/Resources<<>>
>>
endobj
5 0 obj
<</Type /XRef
/Length 66
/Index [0 6]
/Filter /ASCIIHexDecode
/Size 6
/W [1 2 1]
/Root 1 0 R
/ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
>>
stream
00 0000 00
01 000F 00
01 004F 00
01 0BE0 00
01 0C18 00
01 0C6D 00
endstream
endobj
startxref
3181
%%EOF

PS: the end of lines are all LF. PPS: the validation tool saying is valid is https://www.pdf-online.com/osa/validate.aspx


Solution

  • I am showing the more common features of your file from 2.0 ISO Standard that will allow acceptance by most, if not all, version 1 or 2 PDF readers. Without them or Acrobat considering any "fix" on entry or exit.

    The smallest possible with a "Trailer" and acceptable to Acrobat etc. is roughly 300 bytes (303 with preferred EOL after EOF).

    %PDF-2.0
    1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
    2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
    3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]>>endobj
    xref
    0 4
    0000000000 65536 f 
    0000000009 00000 n 
    0000000052 00000 n 
    0000000101 00000 n 
    trailer<</Size 4/Root 1 0 R>>
    startxref
    164
    %%EOF
    

    Smallest with XrefStream and not "fixed" or "rejected" by Acrobat Viewer 6 or later is 371 bytes (perhaps 370 if you ignore the %%EOF EOL)! enter image description here

    %PDF-2.0
    %ÞЃ²
    2 0 obj<</Type/Catalog/Pages 4 0 R>>endobj
    4 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
    3 0 obj<</Type/Page/Parent 4 0 R/MediaBox[0 0 612 792]>>endobj
    1 0 obj<</Type/XRef/Root 2 0 R/Length 10/Size 5/Index [0 5]/W [1 1 0]/ID [<0123456789ABCDEF0123456789ABCDEF><0123456789ABCDEF0123456789ABCDEF>]>>
    stream
      ªk: 
    endstream endobj
    startxref
    170
    %%EOF
    % The following hex table is a pictorial representation in HexCode of the streams
    % 10 byte binary data hence we must include the binary 2nd line (%ÞЃ²) marker.
    stream
    00 00
    01 AA
    01 0F
    01 6B
    01 3A
    endstream endobj
    

    It does not matter which order each object is numbered and commonly metadata would be 3rd object, if an info section were at the first location. Acrobat will normally "fix" a PDF, by add a duplicate info section first, with entries selected from the metadata. However here for "minimal acceptable" to Acrobat conforming readers, there is no /Info section.

    Note the standard says there does not "need" to be a metadata section, so that can be deleted and thus use a smaller example.

    It is not strictly the minimum acceptable because it contains a page content stream (Contents in the page object), and a metadata stream. These objects were included to make this file useful as a starting point for creating other, more realistic PDF files.

    Usually the /Type is found as the last object entry while we may logically expect or preferer that first. Comments related to altering your version are at the end.

    %PDF-2.0
    %ÞЃ²
    1 0 obj
    <</Type/Catalog/Pages 2 0 R/Metadata 5 0 R>>
    endobj
    2 0 obj
    <</Type/Pages/Count 1/Kids[3 0 R]>>
    endobj
    3 0 obj
    <</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>
    endobj
    4 0 obj
    <</Length 0>>
    stream
    endstream
    endobj
    5 0 obj
    <</Type/Metadata/Subtype/XML/Length 1059>>
    stream
    <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
    <x:xmpmeta xmlns:x="adobe:ns:meta/">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="" xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
    <pdf:Producer>name</pdf:Producer>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmp="http://ns.adobe.com/xap/1.0/">
    <xmp:CreatorTool>name</xmp:CreatorTool>
    <xmp:CreateDate>2012-12-25T12:34:56Z</xmp:CreateDate>
    <xmp:ModifyDate>2012-12-25T12:34:56Z</xmp:ModifyDate>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <dc:format>application/pdf</dc:format>
    <dc:title><rdf:Alt>
    <rdf:li xml:lang="x-default">title</rdf:li>
    </rdf:Alt></dc:title>
    <dc:creator><rdf:Seq>
    <rdf:li>author</rdf:li>
    </rdf:Seq></dc:creator>
    </rdf:Description>
    <rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
    <xmpMM:DocumentID>GUID of document</xmpMM:DocumentID>
    <xmpMM:InstanceID>GUID save change</xmpMM:InstanceID>
    </rdf:Description>
    </rdf:RDF>
    </x:xmpmeta>
     
    <?xpacket end="w"?>
    endstream
    endobj
    xref
    0 6
    0000000000 65536 f 
    0000000015 00000 n 
    0000000075 00000 n 
    0000000126 00000 n 
    0000000220 00000 n 
    0000000266 00000 n 
    trailer
    <</Size 6/Root 1 0 R>>
    startxref
    1401
    %%EOF
    

    Alterations

    Pages should include a /Count, even if it is single Page [0].
    <</Type/Pages/Count 1/Kids[3 0 R]>>

    A page should infer some contents (even if we declare it is empty).
    <</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>

    A minimal page content can be acceptable as.

    # # obj
    <</Length 0>>
    stream
    endstream
    endobj
    

    Others have suggested an altered "xref" structure. However although the standard "HINTS" of the type you want (an expanded Cross-reference text stream). I have yet to see that documentation format acceptable by Adobe Acrobat. I have no examples, other than it be /FlateDecode encoded.

    In the standard this is "hinted" as.

    /Filter /ASCIIHexDecode %For readability only
    data has been encoded in hexadecimal representation for readability; in actual practice, a lossless decompression filter such as FlateDecode can be used.

    Adobe added an explanation in their 1.7 Reference, same appendix H, which seems to still be current policy.

    "3.4.7, “Cross-Reference Streams” (Cross-Reference Stream Dictionary) 20. FlateDecode is the only filter supported by Acrobat 6.0 and later viewers for cross-reference streams. These viewers also support unencoded cross- reference streams.

    So commonly readers should/could use either fully flatted or fully inflated and never mix the two, especially when there are edits (incremental additions, alterations etc).

    I have used the smaller (in this case 2.0 H2 example) with inflated text version above.

    [Later EDIT]

    As @mkl has pointed out your Xref table can be replaced without the ASCII /Filter by using a pure binary stream and all readers including Adobe Acrobat Viewers will accept that as an equivalent working format. In effect it meets the "unencoded cross- reference streams" statement.

    so replace

    5 0 obj
    <</Type /XRef
    /Length 66
    /Index [0 6]
    /Filter /ASCIIHexDecode
    /Size 6
    /W [1 2 1]
    /Root 1 0 R
    /ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
    >>
    stream
    00 0000 00
    01 000F 00
    01 004F 00
    01 0BE0 00
    01 0C18 00
    01 0C6D 00
    endstream
    endobj
    startxref
    3181
    %%EOF
    

    With

    5 0 obj
    <</Type /XRef
    /Length 24
    /Index [0 6]
    /Size 6
    /W [1 2 1]
    /Root 1 0 R
    /ID [<884dfb9a4ffe1d4accf3d4454478960f><884dfb9a4ffe1d4accf3d4454478960f>]
    >>
    stream
           O à  m 
    endstream
    endobj
    startxref
    3181
    %%EOF
    

    Where the stream in ASCII terms will be including nulls a more compressed (compared to the decimal text values)
    0000000001000F0001004F00010BE000010C1800010C6D00

    However for pure ANSI editing that would be an unworkable method.

    Most readers that allow it to open would simply replace that section as ASCII table for example replace 6 with 7

    6 0 obj
    <<>>
    endobj
    7 0 obj
    <</Creator (\(FlexiPDF\))/ICNAppName (FlexiPDF)/ICNAppPlatform (Win)/ICNAppVersion (3.0.7)/ModDate (D:20240305155621)>>
    endobj
    xref
    0 8
    0000000005 65535 f 
    0000000009 00000 n 
    0000000223 00000 n 
    0000003183 00000 n
    .... etc.
    

    or convert to flated stream

    6 0 obj
    <</DecodeParms<</Columns 4/Predictor 12>>/Filter FlateDecode
    /ID [<884DFB9A4FFE1D4ACCF3D4454478960F><884DFB9A4FFE1D4ACCF3D4454478960F>]
    /Length 41/Root 1 0 R/Size 7/Type/XRef/W [1 2 1]>>
    stream
    xÚcb``øÏÄÈÀÏÈÄÀàÈ dma`úÏÝd}­`dúÏý Y:{
    endstream
    endobj
    startxref
    3167
    %%EOF
    

    PDF ISO Standard compliant readers (apart from Acrobat) will also consider this combination as meeting the standard so easier to use uncompressed by ISO 2.0 Compliant Readers however NOT in Acrobat !

    /Filter/ASCII85Decode>>
    stream
    z!<<W1!<>mq!=Rfc!=TeF!=WfF
    endstream
    endobj
    startxref
    3181
    %%EOF
    

    The above compressed (5 bytes shown as one) ASCII 120% expanded string is acceptable to most readers (apart from Acrobat DC). Even Acrobat Powered plug-in within Edge will accept it !

    Here Acrobat reader in Edge refuses to open the file. enter image description here Same File in same EDGE TAB simply switched from IE mode, so using lighter "Powered by Adobe Acrobat" plug-in it works.

    enter image description here