pdfpdf-generationpdf-parsing

PDF Cross Reference Streams


I'm developing a PDF parser/writer, but I'm stuck at generating cross reference streams. My program reads this file and then removes its linearization, and decompresses all objects in object streams. Finally it builds the PDF file and saves it.

This works really well when I use the normal cross reference & trailer, as you can see in this file.

When I try to generate a cross reference stream object instead (which results in this file, Adobe Reader can't view it.

Has anyone experience with PDF's and can help me search what the Problem is?

Note that the cross reference is the ONLY difference between file 2 and file 3. The first 34127 bytes are the same.

If someone needs the content of the decoded reference stream, download this file and open it in a HEX editor. I've checked this reference table again and again but I could not find anything wrong. But the dictionary seems to be OK, too.

Thanks so much for your help!!!

Update

I've now completely solved the problem. You can find the new PDF here.


Solution

  • Two problems I see (without looking at the stream data itself.

    1. "Size integer (Required) The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary."

      your size should be... 14.

    2. "Index array (Optional) An array containing a pair of integers for each subsection in this section. The first integer shall be the first object number in the subsection; the second integer shall be the number of entries in the subsection The array shall be sorted in ascending order by object number. Subsections cannot overlap; an object number may have at most one entry in a section. Default value: [0 Size]."

      Your index should probably skip around a bit. You have no objects 2-4 or 7. The index array needs to reflect that.

    3. Your data Ain't Right either (and I just learned out to read an xref stream. Yay me.)

    00 00 00  
    01 00 0a  
    01 00 47  
    01 01 01  
    01 01 70  
    01 02 fd  
    01 76 f1  
    01 84 6b  
    01 84 a1  
    01 85 4f
    

    According to this data, which because of your "no index" is interpreted as object numbers 0 through 9, have the following offset:

    0 is unused.  Fine.  
    1 is at 0x0a.  Yep, sure is  
    2 is at 0x47.  Nope.  That lands near the beginning of "1 0"'s stream. This probably isn't a coincidence.  
    3 is at 0x101.  Nope.  0x101 is still within "1 0"'s stream.  
    4 is at 0x170.  Ditto  
    5 is at 0x2fd.  Ditto  
    6 is at 0x76f1. Nope, and this time buried inside that image's stream.
    

    I think you get the idea. So even if you had a correct \Index, your offsets are all wrong (and completely different from what's in resultNormal.pdf, even allowing for dec-hex confusion).

    What you want can be found in resultNormal's xref:

    xref  
    0 2  
    0000000000 65535 f  
    0000000010 00000 n  
    5 2  
    0000003460 00000 n  
    0000003514 00000 n  
    8 5  
    0000003688 00000 n  
    0000003749 00000 n  
    0000003935 00000 n  
    0000004046 00000 n  
    0000004443 00000 n  
    

    So your index should be (if I'm reading this right): \Index[0 2 5 2 8 5]. And the data:

    0 0 0  
    1 0 a  
    1 3460 (that's decimal)  
    1 3514 (ditto)  
    1 3688  
    etc
    

    Interestingly, the PDF spec says that the size must be BOTH the number of entries in this and all previous XRefs AND the number one higher than the highest object number in use.

    I don't think the later part is ever enforced, but I wouldn't be surprised to find that xref streams are more retentive than the normal cross reference tables. Might be the same code handling both, might not.


    @mtraut:

    Here's what I see:

    13 0 obj <</Size 10/Length 44/Filter /FlateDecode/DecodeParms <</Columns 3/Predictor 12>>/W [1 2 0]/Type /XRef/Root 8 0 R>>
    stream  
    ...  
    endstream  
    endobj