I'm decoding PDF files using Python with reference to the 2008 spec: https://web.archive.org/web/20081203002256/https://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf particularly section 7.4.4.4.
Images are usually embedded in PDF as byte streams, and each stream is associated with a dictionary with information about the stream. For example, often the stream is a compressed form of the original data; such details are described by the Filter
entry in the dictionary.
When I've got a stream whose filter is FlateDecode
, this means the data were compressed using deflate, and this is easily reversed with zlib.decompress
. But... to improve compression the original data may be preprocessed by a filter, for example to difference adjacent bytes - when the data have a lot of similar values the result then compresses better. The preprocessing is identified by the Predictor
entry in the dictionary.
The Predictor
value 15 means to use a PNG differencing algorithm; unfortunately the 2008 PDF document basicly says "PNG prediction (on encoding, PNG optimum)". Yay.
Can someone explain to me (a) exactly which PNG filter algorithm this means (with a reference to its specification) and (b) ideally point me at a library which will reverse it. Lacking the latter I'd have to reverse it in pure Python, which will be slow. Acceptably slow for my initial use case, and I guess I can write it as a C extension (much) later if my needs become more frequent.
Where I am at present is:
bytes
object, which is raw pixel dataPredictor
value, 15 in my present example documentCurrently my image
property method looks like this:
@property
def image(self):
im = self._image
if im is None:
decoded_bs = self.decoded_payload
print(".image: context_dict:")
print(decoded_bs[:10])
pprint(self.context_dict)
decode_params = self.context_dict.get(b'DecodeParms', {})
color_transform = decode_params.get(b'ColorTransform', 0)
color_space = self.context_dict[b'ColorSpace']
bits_per_component = decode_params.get(b'BitsPerComponent')
if not bits_per_component:
bits_per_component = {b'DeviceRGB': 8, b'DeviceGray': 8}[color_space]
colors = decode_params.get(b'Colors')
if not colors:
colors = {b'DeviceRGB': 3, b'DeviceGray': 1}[color_space]
mode_index = (color_space, bits_per_component, colors, color_transform)
width = self.context_dict[b'Width']
height = self.context_dict[b'Height']
print("mode_index =", mode_index)
PIL_mode = {
(b'DeviceGray', 1, 1, 0): 'L',
(b'DeviceGray', 8, 1, 0): 'L',
(b'DeviceRGB', 8, 3, 0): 'RGB',
}[mode_index]
print(
"Image.frombytes(%r,(%d,%d),%r)...", PIL_mode, width, height,
decoded_bs[:32]
)
im = Image.frombytes(PIL_mode, (width, height), decoded_bs)
im.show()
exit(1)
self._image = im
return im
This shows me the "edgy" and skewed image because I'm decoding difference data as colour data and decoding the row tags as pixel data, skewing subsequent rows slightly.
The predictor used for each row is given by the first byte in each row, if the "Predictor" parameter is 10 or more. In that case, the value of that parameter has no further meaning. It doesn't matter that it's 15, other than the fact that 15 >= 10.
You can find the filter types here: