I am merging PDF files with PyPDF2 but, when one of the files contains a PDF Module filled with data (a typical application-filled PDF), in the merged file the module is empty, no data is shown.
Here's the two methods I am using to merge the PDF:
def merge_pdf_files(pdf_files, i):
pdf_merger = PdfFileMerger(strict=False)
for pdf in pdf_files:
pdf_merger.append(pdf)
output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
pdf_merger.write(output_filename)
def merge_pdf_files2(pdf_files, i):
output = PdfFileWriter()
for pdf in pdf_files:
input = PdfFileReader(pdf)
for page in input.pages:
output.addPage(page)
output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
with open(output_filename,'wb') as output_stream:
output.write(output_stream)
I would expect the final, merged PDF to show all the data filled in the PDF Module. Or, in alternative, someone can point me to another python library not suffering this (in appearance) bug. Thanks
UPDATE I tried also PyMuPDF with the same results.
def merge_pdf_files4(pdf_files, i):
output = fitz.open()
for pdf in pdf_files:
input = fitz.open(pdf)
output.insertPDF(input)
output_filename = '{out_root}{prog}.{cf}.pdf'.format(out_root=out_root_path, prog=i+1, cf=cf)
output.save(output_filename)
Tried also PyPDF4. Same result as PyPDF2
Tried also using external tools launched from the script with a command line:
subprocess.call(cmd, shell=True)
I tried pdftk at first, but it failed too. The only one that worked was PDFill, commercial version, $19 bucks spent on the task... :( Too bad I couldn't find an open source, platform independant solution.
Finally I worked it out by myself, I am sharing it here in the hope to be useful to others.
It's been a tough task.
In the end I sticked to the pdfrw library (https://pypi.org/project/pdfrw/ and https://github.com/pmaupin/pdfrw), which gives a good PDF-DOM representation, very close to the PDF-Structure publicly documented in Adobe's official reference (https://www.adobe.com/devnet/pdf/pdf_reference.html).
Using this library, PyCharm's object inspector and Adobe's documentation I could experiment with the output file's structure and found out that the simple 1-line-merge:
from pdfrw import PdfReader, PdfWriter
output = PdfWriter()
input = PdfReader(pdf_filename)
output.addpages(input.pages)
would not add the AcroForm node to the output PDF file, hence losing all form fields.
So I had to write my own code to merge, as best as I can, the AcroForm nodes of the various input files.
I stress the "as best ad I can" sentence, because the merge function I ended up with is far from perfect but at least it works for me and can help others to build up from this point if they need.
One important thing to do is to rename the form fields in order to avoid conflicts, so I renamed them to {file_num}_{field_num}_{original_name}.
Then, not knowing exactly how to merge CO, DA, DR and NeedAppearances nodes, I simply add the nodes of the first source file that has them. If the same node is present in subsequent files, I skip it.
I skip it except for the Fonts, I merge the contents of Font subnode of DR node.
Last note, at my first attempt, all the above manipulation was done on output's trailer attribute. Then I found out that each time I added the pages from a new input file, pdfrw seems to erase any AcroForm already present in the trailer. I don't know the reason but I had to build an ouptut_acroform variable and to assign it to the output file the line before writing out the final pdf.
In the end, here's my code. Forgive me if it's not pythonic, I just hope it clarifies the points above.
from pdfrw import PdfReader, PdfWriter, PdfName
def merge_pdf_files_pdfrw(pdf_files, output_filename):
output = PdfWriter()
num = 0
output_acroform = None
for pdf in pdf_files:
input = PdfReader(pdf,verbose=False)
output.addpages(input.pages)
if PdfName('AcroForm') in input[PdfName('Root')].keys(): # Not all PDFs have an AcroForm node
source_acroform = input[PdfName('Root')][PdfName('AcroForm')]
if PdfName('Fields') in source_acroform:
output_formfields = source_acroform[PdfName('Fields')]
else:
output_formfields = []
num2 = 0
for form_field in output_formfields:
key = PdfName('T')
old_name = form_field[key].replace('(','').replace(')','') # Field names are in the "(name)" format
form_field[key] = 'FILE_{n}_FIELD_{m}_{on}'.format(n=num, m=num2, on=old_name)
num2 += 1
if output_acroform == None:
# copy the first AcroForm node
output_acroform = source_acroform
else:
for key in source_acroform.keys():
# Add new AcroForms keys if output_acroform already existing
if key not in output_acroform:
output_acroform[key] = source_acroform[key]
# Add missing font entries in /DR node of source file
if (PdfName('DR') in source_acroform.keys()) and (PdfName('Font') in source_acroform[PdfName('DR')].keys()):
if PdfName('Font') not in output_acroform[PdfName('DR')].keys():
# if output_acroform is missing entirely the /Font node under an existing /DR, simply add it
output_acroform[PdfName('DR')][PdfName('Font')] = source_acroform[PdfName('DR')][PdfName('Font')]
else:
# else add new fonts only
for font_key in source_acroform[PdfName('DR')][PdfName('Font')].keys():
if font_key not in output_acroform[PdfName('DR')][PdfName('Font')]:
output_acroform[PdfName('DR')][PdfName('Font')][font_key] = source_acroform[PdfName('DR')][PdfName('Font')][font_key]
if PdfName('Fields') not in output_acroform:
output_acroform[PdfName('Fields')] = output_formfields
else:
# Add new fields
output_acroform[PdfName('Fields')] += output_formfields
num +=1
output.trailer[PdfName('Root')][PdfName('AcroForm')] = output_acroform
output.write(output_filename)
Hope this helps.