I tried to create a script to loop through parent folder and subfolders and merge all of the pdfs into one. Below if the code I wrote so far, but I don't know how to combine them into one script.
Reference: Merge PDF files
The first function is to loop through all of the subfolders under parent folder and get a list of path for each pdf.
import os
from PyPDF2 import PdfFileMerger
root = r"folder path"
path = os.path.join(root, "folder path")
def list_dir():
for path,subdirs,files in os.walk(root):
for name in files:
if name.endswith(".pdf") or name.endswith(".ipynb"):
print (os.path.join(path,name))
Second, I created a list to append all of the path to pdf files in the subfolders and merge into one combined file. At this step, I was told:
TypeError: listdir: path should be string, bytes, os.PathLike or None, not list
root_folder = []
root_folder.append(list_dir())
def pdf_merge():
merger = PdfFileMerger()
allpdfs = [a for a in os.listdir(root_folder)]
for pdf in allpdfs:
merger.append(open(pdf,'rb'))
with open("Combined.pdf","wb") as new_file:
merger.write(new_file)
pdf_merge()
Where and what should I modify the code in order to avoid the error and also combine two functions together?
First you have to create function which creates list with all files and return
it.
def list_dir(root):
result = []
for path, dirs, files in os.walk(root):
for name in files:
if name.lower().endswith( (".pdf", ".ipynb") ):
result.append(os.path.join(path, name))
return result
I use also .lower()
to catch extensions like .PDF
.
endswith()
can use tuple with all extensions.
It is good to get external values as arguments - list_dir(root)
instead of list_dir()
And later you can use as
allpdfs = list_dir("folder path")
in
def pdf_merge(root):
merger = PdfFileMerger()
allpdfs = list_dir(root)
for pdf in allpdfs:
merger.append(open(pdf, 'rb'))
with open("Combined.pdf", 'wb') as new_file:
merger.write(new_file)
pdf_merge("folder path")
EDIT:
First function could be even more universal if it would get also extensions
import os
def list_dir(root, exts=None):
result = []
for path, dirs, files in os.walk(root):
for name in files:
if exts and not name.lower().endswith(exts):
continue
result.append(os.path.join(path, name))
return result
all_files = list_dir('folder_path')
all_pdfs = list_dir('folder_path', '.pdf')
all_images = list_dir('folder_path', ('.png', '.jpg', '.gif'))
print(all_files)
print(all_pdfs)
print(all_images)
EDIT:
For single extension you can also do
import glob
all_pdfs = glob.glob('folder_path/**/*.pdf', recursive=True)
It needs **
with recursive=True
to search in subfolders.