pythonpdfpypdfpdfa

PyPDF2 PdfFileMerger loosing PDF module in merged file


I am merging PDF files with PyPDF2 but, when one of the files contains a PDF Module filled with data (a typical application-filled PDF), in the merged file the module is empty, no data is shown.

Here's the two methods I am using to merge the PDF:

def merge_pdf_files(pdf_files, i):
    pdf_merger = PdfFileMerger(strict=False)
    for pdf in pdf_files:
        pdf_merger.append(pdf)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    pdf_merger.write(output_filename)

def merge_pdf_files2(pdf_files, i):
    output = PdfFileWriter()
    for pdf in pdf_files:
        input = PdfFileReader(pdf)
        for page in input.pages:
            output.addPage(page)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    with open(output_filename,'wb') as output_stream:
        output.write(output_stream)

I would expect the final, merged PDF to show all the data filled in the PDF Module. Or, in alternative, someone can point me to another python library not suffering this (in appearance) bug. Thanks

UPDATE I tried also PyMuPDF with the same results.

def merge_pdf_files4(pdf_files, i):
    output = fitz.open()
    for pdf in pdf_files:
        input = fitz.open(pdf)
        output.insertPDF(input)
    output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
    output.save(output_filename)

Tried also PyPDF4. Same result as PyPDF2

Tried also using external tools launched from the script with a command line:

subprocess.call(cmd, shell=True)

I tried pdftk at first, but it failed too. The only one that worked was PDFill, commercial version, $19 bucks spent on the task... :( Too bad I couldn't find an open source, platform independant solution.


Solution

  • Finally I worked it out by myself, I am sharing it here in the hope to be useful to others.

    It's been a tough task.

    In the end I sticked to the pdfrw library (https://pypi.org/project/pdfrw/ and https://github.com/pmaupin/pdfrw), which gives a good PDF-DOM representation, very close to the PDF-Structure publicly documented in Adobe's official reference (https://www.adobe.com/devnet/pdf/pdf_reference.html).

    Using this library, PyCharm's object inspector and Adobe's documentation I could experiment with the output file's structure and found out that the simple 1-line-merge:

        from pdfrw import PdfReader, PdfWriter
    
        output = PdfWriter()
        input = PdfReader(pdf_filename)
        output.addpages(input.pages)
    

    would not add the AcroForm node to the output PDF file, hence losing all form fields.

    So I had to write my own code to merge, as best as I can, the AcroForm nodes of the various input files.

    I stress the "as best ad I can" sentence, because the merge function I ended up with is far from perfect but at least it works for me and can help others to build up from this point if they need.

    One important thing to do is to rename the form fields in order to avoid conflicts, so I renamed them to {file_num}_{field_num}_{original_name}.

    Then, not knowing exactly how to merge CO, DA, DR and NeedAppearances nodes, I simply add the nodes of the first source file that has them. If the same node is present in subsequent files, I skip it.

    I skip it except for the Fonts, I merge the contents of Font subnode of DR node.

    Last note, at my first attempt, all the above manipulation was done on output's trailer attribute. Then I found out that each time I added the pages from a new input file, pdfrw seems to erase any AcroForm already present in the trailer. I don't know the reason but I had to build an ouptut_acroform variable and to assign it to the output file the line before writing out the final pdf.

    In the end, here's my code. Forgive me if it's not pythonic, I just hope it clarifies the points above.

    from pdfrw import PdfReader, PdfWriter, PdfName
    
    
    def merge_pdf_files_pdfrw(pdf_files, output_filename):
      output = PdfWriter()
      num = 0
      output_acroform = None
      for pdf in pdf_files:
          input = PdfReader(pdf,verbose=False)
          output.addpages(input.pages)
          if PdfName('AcroForm') in input[PdfName('Root')].keys():  # Not all PDFs have an AcroForm node
              source_acroform = input[PdfName('Root')][PdfName('AcroForm')]
              if PdfName('Fields') in source_acroform:
                  output_formfields = source_acroform[PdfName('Fields')]
              else:
                  output_formfields = []
              num2 = 0
              for form_field in output_formfields:
                  key = PdfName('T')
                  old_name = form_field[key].replace('(','').replace(')','')  # Field names are in the "(name)" format
                  form_field[key] = 'FILE_{n}_FIELD_{m}_{on}'.format(n=num, m=num2, on=old_name)
                  num2 += 1
              if output_acroform == None:
                  # copy the first AcroForm node
                  output_acroform = source_acroform
              else:
                  for key in source_acroform.keys():
                      # Add new AcroForms keys if output_acroform already existing
                      if key not in output_acroform:
                          output_acroform[key] = source_acroform[key]
                  # Add missing font entries in /DR node of source file
                  if (PdfName('DR') in source_acroform.keys()) and (PdfName('Font') in source_acroform[PdfName('DR')].keys()):
                      if PdfName('Font') not in output_acroform[PdfName('DR')].keys():
                          # if output_acroform is missing entirely the /Font node under an existing /DR, simply add it
                          output_acroform[PdfName('DR')][PdfName('Font')] = source_acroform[PdfName('DR')][PdfName('Font')]
                      else:
                          # else add new fonts only
                          for font_key in source_acroform[PdfName('DR')][PdfName('Font')].keys():
                              if font_key not in output_acroform[PdfName('DR')][PdfName('Font')]:
                                  output_acroform[PdfName('DR')][PdfName('Font')][font_key] = source_acroform[PdfName('DR')][PdfName('Font')][font_key]
              if PdfName('Fields') not in output_acroform:
                  output_acroform[PdfName('Fields')] = output_formfields
              else:
                  # Add new fields
                  output_acroform[PdfName('Fields')] += output_formfields
          num +=1
      output.trailer[PdfName('Root')][PdfName('AcroForm')] = output_acroform
      output.write(output_filename)
    

    Hope this helps.