I am currently using python to remove watermarks in PDF files. For example, I have a file like this:
The green shape on the center of the page is the watermark. I think it's not stored in the PDF in text form, because I can't find that text by simply searching using Edge browser (which can read PDF files). Also, I cannot find the watermark by image. I extracted all images from the PDF using PyMuPDF, and the watermark (which was supposed to appear on each page) is not to be found.
The code I used for extracting is like this:
document = fitz.open(self.input)
for each_page in document:
image_list = each_page.getImageList()
for image_info in image_list:
pix = fitz.Pixmap(document, image_info[0])
png = pix.tobytes() # return picture in png format
if png == watermark_image:
document._deleteObject(image_info[0])
document.save(out_filename)
So how do I find and remove the watermark using python's libraries? How is the watermark stored inside a PDF?
Are there any other "magic" libraries that can do this task, other than PyMuPDF?
For anyone interested in details see the solution provided here. Removal of the type of watermark used in this file works with PyMuPDF's low-level code interface. There is no direct, specialized high-level API for doing this.