I am creating a PDF file without text from a pdf file with text using the following program
def remove_text_from_pdf(pdf_path_in, pdf_path_out): '''Removes the text from the PDF file and saves it as a new PDF file''' #Open the PDF file with the diagram and the text in read mode pdf_file = open(pdf_path_in, 'rb')
#Create a PDF reader and wrtiter object
pdf_reader = PyPDF2.PdfReader(pdf_file) #OLDER VERSION WAS IN USE
pdf_writer = PyPDF2.PdfWriter(pdf_file) #OLDER VERSION 'PDFFILEWRITER' IN USE
#Get the pages from the PDF reader
page = pdf_reader.pages[0]
#Add the pages from the pdf reader to the pdf writer
pdf_writer.add_page(page)
#Remove the text from all pages added to the writer
pdf_writer.remove_text()
#Open the text output file in write mode
out_file = open(pdf_path_out, 'wb')
#Save the information to the text file
pdf_writer.write(out_file)
return
I am converting the output to a png file using the following function
def convert_pdf_to_png(pdf_path, png_path): '''Converts a PDF file to a PNG file''' #Set the image maximum pixels to be none so that it doesn't give a DOS attack error
pdffile = pdf_path
doc = fitz.open(pdffile)
page = doc.load_page(0) # number of page
pix = page.get_pixmap()
output = png_path
pix.save(output)
doc.close()
but it gives me a png file that is just a blank white copy.
I was expecting a PDF file which is non blank
PyMuPDF lets you do all of the above in one go! No need to use additional packages.
As an aside: You cannot convert a complete PDF with multiple pages into one PNG file. You either must create a new PDF with text-free pages, or create multiple PNG images - one for each page (with text removed).
Here is the code that removes all text from all pages and then saves the resulting PDF under a new name:
import fitz
doc = fitz.open(pdf_file)
for page in doc:
page.add_redact_annot(page.rect)
page.apply_redactions(images=fitz.PDF_REDACT_NONE) # leave images untouched
doc.save("notext-" + pdf_file, garbage=4, deflate=True) # save under new name
# DONE!
If you instead want page images with no text, do this:
doc = fitz.open(pdf_file)
for page in doc:
page.add_redact_annot(page.rect)
page.apply_redactions(images=fitz.PDF_REDACT_NONE) # leave images untouched
pix = page.get_pixmap()
pix.save(f"page-{page.number}.png")
# DONE!