python-3.xamazon-web-servicesaws-lambdaaws-lambda-layerspymupdf

Reading a pdf in AWS lambda using PyMuPDF


I am trying to read a pdf in AWS lambda. The pdf is stored in an s3 bucket. I need to extract the text from pdf and translate them into any required language. I am able to run my code in my notebook but when I run it on Lambda I get this error message in my cloudwatch logs - task timed out after 3.01 seconds.

import fitz
import base64
from io import BytesIO
from PIL import Image
import boto3

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    client_textract = boto3.client('textract')
    translate_client = boto3.client('translate')
    try:
        
        print("Inside handler")  
        s3_bucket = "my_bucket"
        pdf_file_name = 'sample.pdf'
        pdf_file = s3.get_object(Bucket=s3_bucket, Key=pdf_file_name)
        file_content = pdf_file['Body'].read()
        print("Before reading ")
        with fitz.open(stream=file_content, filetype="pdf") as doc:
               
            
            

Solution

  • Try to extend the timeout, which by default is set at 3 sec.

    Lambda Configuration

    If that does not help, try to increase the allocated memory.

    Also, you may consider pushing

        s3 = boto3.client('s3')
        client_textract = boto3.client('textract')
        translate_client = boto3.client('translate')
    

    out of your handler. Put it right after the imports. The function will run more efficiently on frequent invocation.