pythonpdfwatermark

How to find and remove watermarks in pdf using python?


I am currently using python to remove watermarks in PDF files. For example, I have a file like this: enter image description here

The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files). Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.

The code I used for extracting is like this:

        document = fitz.open(self.input)
        for each_page in document:
            image_list = each_page.getImageList()
            for image_info in image_list:
                pix = fitz.Pixmap(document, image_info[0])
                png = pix.tobytes()  # return picture in png format
                if png == watermark_image:
                    document._deleteObject(image_info[0])
        document.save(out_filename)

So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?

Are there any other "magic" libraries that can do this task, other than PyMuPDF?


Solution

  • For anyone interested in details see the solution provided here. Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.