pdfitextredaction

iText 7 pdfSweep and JPX encoded composite image (MRC compressed PDF)


I have a MRC compressed PDF (images are JPX encoded) which I can not get redacted with iText 7 pdfSweep as the ImageReadException is thrown.

Caused by: org.apache.commons.imaging.ImageReadException: Can't parse this format.
at org.apache.commons.imaging.Imaging.getImageParser(Imaging.java:731)
at org.apache.commons.imaging.Imaging.getImageInfo(Imaging.java:703)
at org.apache.commons.imaging.Imaging.getImageInfo(Imaging.java:637)
at com.itextpdf.pdfcleanup.PdfCleanUpFilter.processImage(PdfCleanUpFilter.java:343)
... 13 more

Do you know any workaround or solution for this issue? An obvious workaround would be to replace the jp2 (jpx) in the PDF with some other image format and perform the redaction on this modified PDF, however, in this case the benefits of MRC compression are lost, not to mention the overall speed of such conversion and then redaction.


Solution

  • (iText developer here)

    As you can see, iText uses org.apache.commons to handle the images. In the past we have had some problems with known bugs in this external library. A possible solution is to fork the org.apache.commons project, implement a fix, and submit your pull request.

    This way, everyone benefits, and the change would automatically be included in iText as well.

    Of course, should you be a paying customer, then reporting this problem through the iText support board might trigger us to do the pull request instead.

    As for a workaround, I think you've already suggest the appropriate idea.

    More detailed (step 1 and 2)

    using IEventListener you can obtain the underlying BufferedImage of a given resource, and you can then use a ByteArrayOutputStream and ImageIO to re-encode your image into standard jpg or png. You can then use iText to change the dictionary entry for this particular resource.