I am generating PDF files on-the-fly. The files contain JPEG images in the Adobe RGB (1998) colourspace, with the profile embedded. The PDF generation toolkit embeds the images correctly, but sets the /ColorSpace
image object stream metadata to /DeviceRGB
, with no way of changing it. When the PDF is printed, the image colours are incorrect, probably because they are not interpreted in the Adobe RGB colourspace.
Example of the original PDF object structure:
obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace /DeviceRGB
/Length 174780
>>
stream (jpeg data)
endstream
endobj
Therefore I am trying to alter the PDF after the fact, to change the /ColorSpace
key to use the Adobe RGB ICC profile. Using the code below, the object structure becomes as follows, which looks correct against other PDFs I have seen, but results in a corrupted PDF. Where have I gone wrong?
obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace [ /ICCBased <<
/N 3
/Filter /FlateDecode
/Length 284
>> stream (icc data)
endstream ]
/Length 174780
>>
stream (jpeg data)
endstream
endobj
This is the pypdf
code which loads original.pdf
, locates every image, replaces /ColorSpace /DeviceRGB
with /ColorSpace /ICCBased
, and writes out to edited.pdf
.
from pathlib import Path
from pypdf import PdfWriter
from pypdf.generic import NameObject, ArrayObject, StreamObject
writer = PdfWriter(clone_from="original.pdf")
icc_stream = StreamObject()
icc_stream.set_data(Path("AdobeRGB1998.icc").read_bytes())
colorspace = ArrayObject([
NameObject("/ICCBased"),
icc_stream.flate_encode()
])
for page in writer.pages:
for image in page.images:
image.indirect_reference.get_object()[NameObject("/ColorSpace")] = colorspace
with open("edited.pdf", "wb") as fp:
writer.write(fp)
The problem was a rookie error in PDF format. I was embedding the ICC profile stream within the image object stream:
10 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace [ /ICCBased
<<
/N 3
/Filter /FlateDecode
/Length 284
>> stream (icc data)
endstream
]
/Length 174780
>>
stream (jpeg data)
endstream
endobj
when I should have been using an indirect reference to the ICC data instead:
10 0 obj
<<
/Type /XObject
/Subtype /Image
/Width 2000
/Height 2000
/BitsPerComponent 8
/Interpolate true
/Filter /DCTDecode
/ColorSpace [ /ICCBased 20 0 R ]
/Length 174780
>>
stream (jpeg data)
endstream
endobj
20 0 obj
<<
/N 3
/Filter /FlateDecode
/Length 284
>>
stream (icc data)
endstream
endobj
The corrected code from above would be:
from pathlib import Path
from pypdf import PdfWriter
from pypdf.generic import ArrayObject, NameObject, NumberObject, StreamObject
writer = PdfWriter(clone_from="original.pdf")
icc_stream = StreamObject()
icc_stream.set_data(Path("AdobeRGB1998.icc").read_bytes())
icc_stream[NameObject("/N")] = NumberObject(3)
icc_ref = writer._add_object(icc_stream.flate_encode())
for page in writer.pages:
for image in page.images:
obj = image.indirect_reference.get_object()
obj[NameObject("/ColorSpace")] = ArrayObject(
[NameObject("/ICCBased"), icc_ref]
)
with open("edited.pdf", "wb") as fp:
writer.write(fp)