python-3.xpdfpdf-form

How to read the data and the associated field name that is in a filled-in PDF form


I am writing a python script that needs to pull the data filled in a PDF form as part of a larger script. I tried using pyPDF3 but while it can show me the strings in the form, it does not show the filled-in data. I have a form where I have entered the value 'XXX" into a field and I want the script to be able to return that data and the name of the field but I can't seem to read the data. The fillpdfs module is very helpful but AFAICT it can return the field names but not the data. I have this snippet:

    from PyPDF3 import PdfFileWriter, PdfFileReader
    # Open the PDF file
    pdf_file = open('filename.pdf', 'rb')
    pdf_reader = PdfFileReader(pdf_file)

   # Extract text data from each page
   for page_num in range(pdf_reader.numPages):
    page = pdf_reader.getPage(page_num)
    'XXX' in page.extractText()

Solution

  • There is a function for pdf forms:

    dictionary = pdf_reader.getFormTextFields() # returns a python dictionary
    print(dictionary)
    

    Documentation