pythonamazon-web-servicesamazon-systems-manager

AWS Systems Manager "In Progress" commands limit to 5?


So firstly, I looked around for an existing thread on the issue I'm facing, but I haven't found anything. I've also posted this on AWS forums and got no answer. If there is an existing thread here for this already, I apologize. Furthermore, I will apologize for the upcoming relative long post.

Now, what I am trying to do is to run multiple (blocking) processes of the same app, using the AWS-RunShellScript document. Problem is, I can't have more than 5 processes started using this method. If I start them via SSH or even manually, I can start dozens without any issues.

The instance I am using is Ubuntu. I am doing AWS resource manipulation using Python 3.7.4, but the same occurs when using the AWS Console as well.

Each command would normally block the terminal (i.e. prevent you for issuing further commands in that instance of the terminal, if you were to do it manually) - which, in turn, sets its status, as seen by the AWS SSM - In Progress. Essentially, the command is not complete from AWS SSM point of view, until the process is killed or stopped (more on that below).

The problem is that I can run up to 4 processes through SSM and be able to still manipulate them using SSM (killing, inspecting, etc) - meaning having a maximum of 4 commands In Progress. However, when I launch a 5th one, while they all continue to work, I can't use SSM anymore, no other command gets executed (either being a new process or any other command)

Easiest way to reproduce this is to send 5 simple sleep 60 commands, via AWS-RunShellScript document, and then attempt any new command - you'll notice in the SSM they will pop as In Progress, but if you tail the amazon-ssm-agent.log file, no new commands will actually be executed. What's more odd, you'll notice that the log stops after this block:

2019-08-13 08:25:12 INFO [MessagingDeliveryService] SendReply Response{
  Description: "Reply e82b5dcb-0e81-4698-8f6e-fe1411f18300 was successfully sent.",
  MessageId: "aws.ssm.1af47ba7-0d28-41ac-83dd-3bffbaa7db2d.i-08d3f4176a025a07b",
  ReplyId: "e82b5dcb-0e81-4698-8f6e-fe1411f18300",
  ReplyStatus: "QUEUED"

No further commands will be processed past this point, no further information being logged. However, using our example, when the sleep ends, the QUEUED commands will get executed as soon as another slot is opened (assuming you can only queue 5 commands at a time, as I believe it's the case, but it's nowhere mentioned).

Note: As I've mentioned AWS-RunShellScript document, the same issue occurs with the AWS-RunRemoteScript document as well.

Since I have to provide some code, please find below snippets from the example mentioned, using Python:

run_cmd_shell = lambda: ssm.send_command(
        Targets=[{
            'Key': 'tag:Name',
            'Values': ['test_ssm']
        },
        {
            'Key': 'tag:Role',
            'Values': ['slave']
        }
        ],
        DocumentName='AWS-RunShellScript',
        Parameters={'commands': [f'sleep {sleep_time}'],
                    'workingDirectory': [workingDirectory],
                    'executionTimeout': [executionTimeout]
            },
        OutputS3BucketName=bucket_name,
        OutputS3KeyPrefix=bucket_prefix,
        MaxConcurrency='150'
    )


remote_cmd_script = lambda: ssm.send_command(
        Targets=[{
            'Key': 'tag:Name',
            'Values': ['test_ssm']
        },
        {
            'Key': 'tag:Role',
            'Values': ['slave']
        }
        ],
        DocumentName='AWS-RunRemoteScript',
        Parameters={'sourceType': ['S3'],
                    'sourceInfo': [f'{{"path":"https://s3.amazonaws.com/{bucket_name}/agents/{project_name}"}}'],
                    'commandLine': [f'sleep {sleep_time}'],
                    'workingDirectory': [workingDirectory],
                    'executionTimeout': [executionTimeout]
            },
        OutputS3BucketName=bucket_name,
        OutputS3KeyPrefix=bucket_prefix,
        MaxConcurrency='150'
    )

I would expect to be able to run as many blocking commands as I can via SSH or manually (which is a lot more than 5), but either I am doing something wrong SSM-wise, or AWS SSM is limited.


Solution

  • Short answer. Increase the CommandWorkersLimit setting in the amazon-ssm-agent.json file

    Slightly longer response of how I tracked it down.

    From ReleaseNotes in the source code

    Removed the upper limit for the maximum number of parallel executing documents on the agent (previously the max was 10) You can configure this number by setting the “CommandWorkerLimit” attribute in amazon-ssm-agent.json file

    And if we take a peak amazon-ssm-agent.json.template file in the Mds section you can see it set to 5.

    {
        "Profile":{
            "ShareCreds" : true,
            "ShareProfile" : ""
        },
        "Mds": {
            "CommandWorkersLimit" : 5,
            "StopTimeoutMillis" : 20000,
            "Endpoint": "",
            "CommandRetryLimit": 15
        },
    ... <LOTS DELETED> 
    }
    

    Direction on editing the config file