amazon-web-services state-machine aws-step-functions

What is the average performance benchmark on AWS Step Functions?

We adopted AWS Step Functions a couple of months ago. We expect a lot of payload to run through each (~1M records) in the near future and we're busy testing performance. The results on almost a blank state machine is pretty bad from what expected.

EXAMPLE

// Payload

{
        "uuid": "2f20e60e-494d-4597-9760-2e103b1a6379",
        "amount": 17554,
        "accountNumber": "1111111",
        "bank": "TYME",
        "branchCode": "979592",
        "type": "RTC",
        "date": "2024-01-08T15:10:37.979Z",
        "description": "test 6",
        "accountHolder": "Bob",
        "accountType": "SAVINGS"
    },

The below Express state machine was working off data from s3. There were 15k records (above example) being processed. The Prepare Data state just passes the data, it doesn't performance any kind of IO or computation

"Prepare Data": {
            "Type": "Pass",
            "Parameters": {
              "pk.$": "States.Format('Payment#{}', $.uuid)",
              "sk": "PAYMENT",
              "amount.$": "States.Format('{}', $.amount)",
              "bank.$": "$.bank",
              "accountNumber.$": "$.accountNumber",
              "accountHolder.$": "$.accountHolder",
              "accountType.$": "$.accountType",
              "branchCode.$": "$.branchCode",
              "type.$": "$.type",
              "date.$": "$.date",
              "description.$": "$.description",
              "status": "CREATED"
            },
            "ResultPath": "$.preparedItem",
            "Next": "Finish"
          }

The Maximum Concurrency for this execution was set to 10000.

Is this the kind of performance we should expect from Step Functions, or is there something I'm missing? I expected this configuration to process everything in less than a minute.

RESULTS

Our Step Functions use services like SQS, DynamoDB and Lambda, and we thought these might be degrading a bit of performance. But after stripping things bare, it seems the issue might be the step functions themselves.

Solution

When processing large numbers of small items with Step Functions Distributed Map, you will want to employ batching. Batching reduces the cumulative overhead of managing discrete units of work when that isn't necessary or helpful.

Batching is a built in capability with Distributed Map. You configure it using the ItemBatcher property. You can control the maximum batch size by the number of items per match or the size per batch.

Once employ batching, you will need to handle the array of items in your ItemProcessor, which you can either do sequentially or in parallel (using the Inline Map state in your Express ItemProcessor).

If you want to learn more about Distributed Map and how to use it at scale, here are a couple of re:Invent sessions from 2022 and 2023 that you might find useful.