
Splitting PDF with PyPDF2 in Lambda function

I'm probably doing something really stupid here but I've got the following Lambda function to split an uploaded PDF into individual pages. When I upload an 8-page PDF, it creates 8 identical copies of the original PDF.

I must be doing something stupid but am not sure what.. Help!

import boto3
from PyPDF2 import PdfReader, PdfWriter

s3 = boto3.client('s3')

def lambda_handler(event, context):
    # Retrieve the uploaded file details from the event
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    file_key = event['Records'][0]['s3']['object']['key']
    file_name = file_key.split('/')[-1]  # Extract the original file name

    # Prepare the output directory path
    output_dir = 'PCP/temp/'  # Specify your desired output directory
    output_prefix = file_name.split('.')[0] + '-'  # Prefix for split file names

    # Download the uploaded file to temp storage
    temp_file_path = '/tmp/' + file_name
    s3.download_file(bucket_name, file_key, temp_file_path)

    # Read the uploaded PDF file
    pdf = PdfReader(temp_file_path)

    # Split the PDF into individual pages and save them
    for page_number in range(len(pdf.pages)):
        print (f"Page {page_number}")
        temp_output_path = f"/tmp/{output_prefix}{page_number + 1}.pdf"
        output_page_path = f"{output_dir}{output_prefix}{page_number + 1}.pdf"
        output_pdf = PdfWriter()

        with open(temp_output_path, 'wb') as output_file:
        # Upload the split page to S3 bucket
        s3.upload_file(temp_file_path, bucket_name, output_page_path)

    return {
        'statusCode': 200,
        'body': 'PDF splitting completed successfully.'


  • When you call s3.upload_file you are passing temp_file_path which references the original downloaded file rather than temp_output_path which is where you wrote the current page within the for loop.

    I recommend using more descriptive variable names to help avoid such issues that are easy to miss with similar, generic variable names. Consider re-naming temp_file_path to downloaded_pdf_path and temp_output_path to current_page_path.