pythonpdf-readerpdf2imagepdf-writer

Pdf file produces blank


I am creating a PDF file without text from a pdf file with text using the following program

def remove_text_from_pdf(pdf_path_in, pdf_path_out): '''Removes the text from the PDF file and saves it as a new PDF file''' #Open the PDF file with the diagram and the text in read mode pdf_file = open(pdf_path_in, 'rb')

#Create a PDF reader and wrtiter object
pdf_reader = PyPDF2.PdfReader(pdf_file) #OLDER VERSION WAS IN USE
pdf_writer = PyPDF2.PdfWriter(pdf_file) #OLDER VERSION 'PDFFILEWRITER' IN USE

#Get the pages from the PDF reader
page = pdf_reader.pages[0]

#Add the pages from the pdf reader to the pdf writer
pdf_writer.add_page(page)

#Remove the text from all pages added to the writer
pdf_writer.remove_text()

#Open the text output file in write mode
out_file = open(pdf_path_out, 'wb')

#Save the information to the text file
pdf_writer.write(out_file)

return

I am converting the output to a png file using the following function

def convert_pdf_to_png(pdf_path, png_path): '''Converts a PDF file to a PNG file''' #Set the image maximum pixels to be none so that it doesn't give a DOS attack error

pdffile = pdf_path
doc = fitz.open(pdffile)
page = doc.load_page(0)  # number of page
pix = page.get_pixmap()
output = png_path
pix.save(output)
doc.close()

but it gives me a png file that is just a blank white copy.

I was expecting a PDF file which is non blank


Solution

  • PyMuPDF lets you do all of the above in one go! No need to use additional packages.

    As an aside: You cannot convert a complete PDF with multiple pages into one PNG file. You either must create a new PDF with text-free pages, or create multiple PNG images - one for each page (with text removed).

    Here is the code that removes all text from all pages and then saves the resulting PDF under a new name:

    import fitz
    doc = fitz.open(pdf_file)
    for page in doc:
        page.add_redact_annot(page.rect)
        page.apply_redactions(images=fitz.PDF_REDACT_NONE)  # leave images untouched
    doc.save("notext-" + pdf_file, garbage=4, deflate=True)  # save under new name
    # DONE!
    

    If you instead want page images with no text, do this:

    doc = fitz.open(pdf_file)
    for page in doc:
        page.add_redact_annot(page.rect)
        page.apply_redactions(images=fitz.PDF_REDACT_NONE)  # leave images untouched
        pix = page.get_pixmap()
        pix.save(f"page-{page.number}.png")
    # DONE!