ghostscriptjbig2

Having ghostscript leave JBIG2 files alone


I'm using gs to remove some bad OCR from PDFs that are essentially images of book pages with invisible text layers. The page images in some of these are encoded as JBIG2. When I run them through gs, it changes the image format to CCIT, which usually isn't bad, but can be anywhere from 10 to 20 times bigger than the JBIG2 versions.

I was looking for a way to either have gs leave them alone - like PassThroughJPEGImages - or re-encode them with MonoImageEncoder, but I was unsuccessful. I didn't find any analogous passthrough option and got an error on setting the encoder to JBIG2Encode. I assume from what I did find that the latter isn't a standard option, but requires Luratech libraries.

Can anyone confirm or - preferably - explain my mistake?


Solution

  • There's no current way to have Ghostscript pass JBIG2 images unchanged.

    The pdfwrite device doesn't permit JBIG2Encode as a possible encoding method so you can't use that.

    The result of this is that you can only use CCITTFaxEncode as the MonoImageEncode parameter.

    In general JBIG2 is little if any better than CCITTFax, the exception is text where, if the content of the text is known, significant savings can be achieved by reusing segments (this is also the source of the JBIG2 decoding bug that hit the news in 2013). Sounds like your images are encoded that way, so yes, you are going to get larger images out.