pdfjpegghostscriptmupdfpoppler

Why does a JPEG embedded into a PDF render slightly differently than the original?


The PDF container allows one to embed a complete JPEG file (including header and all) into a document. But, even when the JPEG stored in a PDF is bit-by-bit identical to the original file, it will render slightly differently than the original JPEG does. I want to know why that is and how to make the JPEG in the PDF render precisely like the original JPEG file. Here is how to reproduce my findings:

Setup

Take this test image (md5sum: 5085774e481966b3359df0745c57daca)

test image

curl https://i.sstatic.net/fRHo8.jpg > test.jpg

And put it into a pdf container. You could use the tool img2pdf

img2pdf --producer="" --nodate test.jpg > test.pdf

or you could use this gzipped base64 representation of the PDF:

H4sICHImeFsAA291dC5wZGYAlVZ7PFT7Fp8xM8yMPMYrFDbK4wgzxgw5Ut6mkWEMyqMaY8cwZsxD
Ht2euKfI60QkeVPqFJEi1emirh6SmLx6EQp5HKXELWePHrqf7j93ffZn77W/67u+a63fP7+11tPJ
xZRgRsSuHXzZ3oklAHiAFxyOtbW1s8OC3BCJb/EdM/dkhoJCgAgBdMCcERcFAuaOTBGTwwtdphOX
6Y68aK4IIADmVHaIEAiwlOQFfUtc0lpOs/wxjSsCuSIhQPpSaAsYwmY68GKBADwEWJKgJ0iSL4BI
35qhg0JetIAFCrEAAElspQWHgyyRxKVE4gGyhGQH2Y/Fl2uTlmu7g9xQURhgaQ1FhSIByIzE8rGW
JDM8ZMCX8l/9//qwIrFLhZx4WC+J6tfUr/rkZX0HtkjoCQoceZFRPK6kf2voCHkcnsA7ismCGnMC
d7NZIN3VATB3YXNEoACCHBlOIIsXAkXdQHZomAgakYwHvrVKsFoP/XhHB4uWJqNEQqN9G/PbMZj7
sUMgKhm/PNViz2I/zBGGkJKY5I2ADCmNlBhaWholjUFjsRg0BoNdobACKysvi8HI4eTkFRQVcYrY
FUrKSorKkK8oEYFLMhFINJSoKIuRVfy/bfEmDIeG2cJsEXAcTAoHR+Dgiy0wLRhMCo6Cw2HfTVoG
LYVAojAwSdhMEQZHIOFwNByB+hKGQ0EYShono4tWUlYh2Hsx9VSLO/Tp/IP/gvir4TApBBz+g54M
GomShkNHAEUNcDA4UgqS+k6Q0JESNSVdZRU9gr2FqtdiH2wFAsJxCBxsI2zYJy4uiJlkW52VOvmn
QmwY3eTm6NO+Lmb6AN3wAyPAZqMPRWjUs9/LMbDDsHWa6nVpuwWFE5O54IM80dl+KEU0T3/UfsxX
rj+V0Yc9yXtfqfo2vwDjo1FwPDeXqpnTnb93QZhoXAM7bxqQmhtrv27kCpk0ZG2qd5DSUflipsZM
6/ShG35+Kw/0iuNE8pvNmjkF++jkU6liM4/2suYmg7A583ccq4zC3Nmkw/yOituzfzk2Ht9hkiys
u75P09Nq8ModUL9oi+/vU+Jazrqwl0nUh/QZN9OpT34njZkLOzBjt0zcoqhPxlUcj5p21XW79l2q
zz8oD8vCjUlXqjy86oJMbnQihFQiI1wO+wzNamJptSptlHtqKApj7DKdUW0yutoja1q8kD6/+rSh
Sp9tjOLmiCmD0vC56a6Wd5datDG0HvW3KWxjhw3OsaImwbNXJBI+KISfGLBzsLFugD3NWpcwu9U0
h3FMqWnQLhv8VHKdMCNgBfxqHaQdZtx6d9V8kcEOlvpH2cCLl+OPILRuCjSGzB58UD/e7auvXFYY
PqkjTKZsGg4lBogjfPK1GRrD6wwik0dmApmbUDrhO6s13CZa0MbevRbZE/58JhUEyjIi6AomNmgT
VviRwXr5M8XlEaGG+CGjWtro6f3qrbuU24r9tbPaGI9enAvU7vIcHWky6ZHj0/tTOOg7a/F9uGfI
5Huher/paEyfGF+1mt85Jf8f76rkem6/0rPO9lsed+3H8rZZrd+WMbkTE99nhanIrUtz39az96ry
js60A7C5suYzVxtzg0ykbrvmHXe8l599LYVX91FmG78sGhnvqX0stCo0fa5m1e6esMe264UvL9HU
muvjTmTmrhqLxwaLri0MTLCSd/zCs6DtqXvwyZO7kPqr9Ot3PIWgTnc5atPWhMySNErtze6ZyouW
lvKVD0Z9NySdj5G9UuBVnicYaHXIiC5133OuriOq4eidG0eLJ37h7abeKb1uZxNyRgs3vimtxEvI
ignvGYovnjg7L1gvalS7a3q+3x/D1/9L5WyI0eZRmq7cyPOzvXxhwR+gxh8GJVVXB2Oq8lZNzE81
nB18UVkVja3bXdBtNHpqeDx/vt9hXontenfgJL3zGnCh/9rGtff8/7HmBl9VeaJq7oBWWdcTxj81
XNHRafWLMJFW+9ihJwkhzGsldAeZqk+WTcyJ+i62+3up8af2v2UVXbTLLJssvJA9rP7wsdLhnBPz
R3r9PJ2J1W5r0jMqV6PpOrUXvOeDxFk5e/7M+FA8Mpo+O2CYaP60H5/YkCRaqXZfJzOSveWG22R6
vH+N9ci/dScxEQ2nGnwdng2euz6kFAjaexTS8lS5b5t9Sw8dIPpyCo8lbw9cY9zNKDfZy17TU8F5
nO9fQXa1jbM/cUucSz6dVXdwLi8h5cLoLpjJG1M/GrMIhap9cySnJo6Klalq761QxrldLnVgV1u/
IqdIuQWMqKnZdOLOvDG77fHcqDBYezq56MlJf5cu4wbxB//3g2HC7BWzKZtJzRa1YEH/65hNuCn/
+0l3u5l2DvdHCPtm9B6p3BtME125oOw1THUD4GJGPU9aU7edSd3XWebeS+3perQn09S7H9Xg1qHz
lLLS870xZ30HgalR1Y6oN3XMOToiliu3rjR6bReGDKNxt8f6kfDH7r9AMPgj+0LTzkw5pwBah2w0
9cc1mxMSnuFyPDoLtNdkGKXGtmqdjNptH/XZdSSVVHnZqNkbQ09gJhITbTEy5QG3Z5pb/ZCCWn23
aGSMT4AtOfipYIOV0SOXzveXt9B+L3Q1OiLflvlubsrzFEKhx4lsWzEr1/IxLnkLk8w5f85dz09w
1pxl4m4X6X7OVQ+khMGj7Ar3t8FqeuYMM3KSkq9UPTT7rKbtXDg0ZVHU8Krq4+dw03XOLo0ZYkOd
gyXdUupZcajBAZJYQyE7+ny5TUfL88NOmqq6SaXVujMen8LbAuFNi70/39GxAnAXFg9YYfHfDSCT
SEQSsEvuO0aANoWlCHcZI/4PzNriJ4xAtPoJsyATfsKIpB94IgGTzQEFWMnmQOHu4gGEr9sOjycC
LL743ux4ELBautiZAtHSGBZEArTQrXWmuWD/BrnwaYrgCQAA

Turn it into its original form by copypasting the base64 into a text file and running:

$ base64 -d test.txt | gzip -cd > test.pdf

md5sum of the final PDF: 156994ee6590ef8421fad1325378906d

The probably crucial part:

6 0 obj
<</BitsPerComponent 8 /ColorSpace /DeviceRGB /Filter /DCTDecode /Height
  60 /Length 1790 /Subtype /Image /Type /XObject /Width 60>>
stream

Looking at it in a PDF viewer will show you the original image. At least that's what you think, but there are tiny differences which we will reveal now. To test how the PDF is rendered we are using three different rendering engines, to make sure that this is a systematic error and probably not a problem with a particular rendering engine

ghostscript

$ gs -dNOPAUSE -dBATCH -sDEVICE=png16m -r96 -sOutputFile=gs.png test.pdf

poppler

We use the pdftocairo tool from the poppler-utils package in Debian and derivatives.

$ pdftocairo -r 96 -png test.pdf poppler

mupdf

$ mutool draw -o mupdf.png -r 96 test.pdf

Evaluation

First observation: All three tools produce the same rendering of the PDF. We compare using imagemagick.

$ compare -metric AE mupdf.png gs.png null:
0
$ compare -metric AE poppler-1.png mupdf.png null:
0
$ compare -metric AE gs.png poppler-1.png null:
0

Now we compare to the original input image:

$ compare -metric AE mupdf.png test.jpg null:
105
$ compare -metric AE poppler-1.png test.jpg null:
105
$ compare -metric AE gs.png test.jpg null:
105

Lets visualize the differences:

mupdf diff poppler diff gs diff

One might think that some data was probably changed when we embedded test.jpg into the PDF container, so lets extract the JPEG from the PDF:

$ pdfimages -j test.pdf extracted

The data that pdfimages extracts from the PDF is exactly identical to the input image test.jpg:

$ cmp test.jpg extracted-000.jpg || echo different
$ md5sum extracted-000.jpg test.jpg
5085774e481966b3359df0745c57daca  extracted-000.jpg
5085774e481966b3359df0745c57daca  test.jpg

Conclusion

So, evidently, the JPEG that is embedded in the PDF file is bit-by-bit the same as the original JPEG. Still, rendering the PDF produces a slightly different result than the input. Even more so: three different PDF engines produce the same difference.

Why is that?

How to make the PDF display exactly like the input JPEG?


Solution

  • At a guess, I'd say that ImageMagick is using a different JPEG decoder to the other three engines, since they agree with each other. I know that Ghostscript and MuPDF use jpeglib, don't know about poppler.

    So what you are saying isn't that the PDF varieties are 'wrong' just 'not the same as compare (the ImageMagick tool ?)' which isn't (IMO) the same thing. You can have Ghostscript render the original JPEG too, by using viewjpeg.ps in the lib directory, I'm fairly sure MuPDF can render a JPEG file directly as well. I'd bet that they render the JPEG the same as they do the PDF containing the JPEG.

    JPEG is a lossy format, and the Discrete Cosine Transform which it uses is a mathematical transformation (it uses a high pass and low pass filter), I'd bet this is simply due to rounding differences in the maths used by the libraries when reconstituting the samples from the filtered data, I strongly suspect you cannot detect any differences by eye. Have you looked at the RGB values of the sample at the relevant locations in the images ?

    I'd suggest you try using JPEG as direct input to MuPDF and Ghostscript (and poppler if it'll do it). My expectation is that the result will match the rendering of the PDF.

    In which case, who is the odd mane out ?