pythonaws-lambdamultiprocessingamazon-textract

AWS Textract asynchronous operations within multiprocessing


I am working in a Lambda function within AWS. I have two functions which asynchronously call on Textract to return the extracted text from an image. By switching to this asynchronous operation from a singular call one at a time (which must wait for the result to complete before submitting a new request), given the volume of images I need processed by Textract, I was able to reduce processing time for Textract from 8 minutes to about 3 minutes--a vast improvement.

But, I am looking into using multiprocessing to see if I can reduce the time down even further. However, it appears that multiprocessing.map and multiprocessing.starmap do not seem to work very well in AWS Lambda. I saw some recommendations for using multiprocessing.Process or multiprocessing.Pipe, but it isn't clear if that will actually make a big impact.

Based on my code below, will leveraging multiprocessing.Process or multiprocessing.Pipe make noticeable improvements in processing time or is it not worth the effort? If it is worth it, can anyone make any suggestions on how to actually implement this given my code? I am brand new to multiprocessing and there's a lot to wrap my head around, further complicated by trying to also implement in AWS.

def extract_text_async(img, loc):
    img_obj = Image.fromarray(img).convert('RGB')
    out_img_obj = io.BytesIO()
    img_obj.save(out_img_obj, format="png")
    out_img_obj.seek(0)
    file_name = key_id + "_" + loc + ".png"
    s3.Bucket(bucket_name).put_object(Key=file_name, Body=out_img_obj, ContentType="image/png")
    response = textract_client.start_document_text_detection(DocumentLocation={'S3Object':{'Bucket': bucket_name,'Name': file_name}},JobTag=key_id + loc, NotificationChannel={'SNSTopicArn': snsarn,'RoleArn': rolearn},OutputConfig={'S3Bucket': output_bucket,'S3Prefix': str(datetime.now()).replace(" ", "_") + key_id + "_" + loc + "_textract_output"})
    return response['JobId']

def fetch_textract_async(jobid):
    response = textract_client.get_document_text_detection(JobId=jobid,MaxResults=1000)
    status = response['JobStatus']
    text_len = {}
    for y in range(len(response['Blocks'])):
        if 'Text' in response['Blocks'][y]:
            text_len[y] = len(response['Blocks'][y]['Text'])
        else:
            pass
    if bool(text_len):
        extracted_text = response['Blocks'][max(text_len, key=text_len.get)]['Text']
        if extracted_text == '-':
            extracted_text = ''
        else:
            pass
    else:
        extracted_text = ''
    return extracted_text

# example function calls
s1_1 = extract_text_async(cropped_images['Section 1']['1'],"s1_1")
s1_2 = extract_text_async(cropped_images['Section 1']['2'],"s1_2")
s1_3 = extract_text_async(cropped_images['Section 1']['3'],"s1_3")
s1_1_result = fetch_textract_async(s1_1)
s1_2_result = fetch_textract_async(s1_2)
s1_3_result = fetch_textract_async(s1_3)

Solution

  • In a well-architected, scalable setup for running Amazon Textract, the callback itself should be event-driven through SNS (which it looks from your snippet like you're already using?)... So your Lambdas will just be 1) kicking off jobs and 2) reading the results.

    If you're considering spiky workloads with very high concurrency (e.g. dump a large number of documents at once and process them as fast as possible), it's worth checking your applied quotas for e.g. StartDocumentTextDetection and GetDocumentTextDetection TPS, and max concurrent jobs. As mentioned on this page, smoothing out the workload is one good way to improve overall throughput.

    On the (1) job kick-off side:

    On the (2) job retrieval side:

    In terms of staying under throttling quotas, AWS typically suggests retries and backoff rather than explicit quota management, because the former scales to distributed systems while the latter requires a central monitor of current consumption, with all the associated possible risks of deadlock and etc.


    In summary, focussing on multiprocessing might be a bit premature because it only addresses scaling within one running instance. It might be better to check whether the overall distributed architecture is well-optimized, including how those Lambdas get invoked and requests get batched.

    For more examples, I'd suggest to check out: