azureazure-machine-learning-serviceazure-ml-pipelinesazure-ml-component

Concurrency on Azure VM for ML Endpoint


I am currently working on deploying an Azure ML endpoint and facing an issue with achieving parallel execution. Below is the code snippet I am using:

def create_online_deployment(self, ml_client, build_suffix, online_endpoint_name):
    env = Environment(
    conda_file='./env.yml',
    image='mcr.microsoft.com/azureml/minimal-ubuntu20.04-py38-cpu-inference:latest',
    )
    deployment_name = f'deploy-{build_suffix}'
    lc_deployment = ManagedOnlineDeployment(
        name=deployment_name,
        environment=env,
        code_configuration=CodeConfiguration(
            code='./', scoring_script='ai_api/amp.py'
        ),
        request_settings=OnlineRequestSettings(request_timeout_ms=120000, 
                                               max_concurrent_requests_per_instance=4,
                                               max_queue_wait_ms=60000),
        endpoint_name=online_endpoint_name,
        instance_type='Standard_E2s_v3',
        instance_count=1,
    )
    ml_client.online_deployments.begin_create_or_update(lc_deployment).result()
    self.logger.info(f'Created online deployment: {deployment_name}')

In this setup, our scoring script amp.py is designed to run serially, and we don't want to implement threading within this script. However, we want to process requests in parallel at the endpoint level.

For example, if amp.py takes 10 seconds to execute, and 10 requests hit the API simultaneously, I expect that: In the first 10 seconds, 4 requests should be processed concurrently. In the next 10 seconds, the following 4 requests should be processed. Finally, in the last 10 seconds, the remaining 2 requests should be processed. This would result in a total execution time of 30 seconds for all 10 requests. However, what I'm currently seeing in the Azure ML endpoint logs is that the requests are being processed serially, leading to a total time of 100 seconds.

Could anyone guide me on how to configure the deployment to achieve this parallelism without modifying the scoring script to handle threading? Any code snippets or advice would be greatly appreciated. Thanks in advance!

For example, if amp.py takes 10 seconds to execute, and 10 requests hit the API simultaneously, I expect that: In the first 10 seconds, 4 requests should be processed concurrently. In the next 10 seconds, the following 4 requests should be processed. Finally, in the last 10 seconds, the remaining 2 requests should be processed. This would result in a total execution time of 30 seconds for all 10 requests. However, what I'm currently seeing in the Azure ML endpoint logs is that the requests are being processed serially, leading to a total time of 100 seconds.


Solution

  • WORKER_COUNT needs to be added environment variables. Refer: https://learn.microsoft.com/en-us/azure/machine-learning/reference-yaml-deployment-managed-online?view=azureml-api-2

    Code changes required:

    code_configuration=CodeConfiguration(code="./", scoring_script="ai_api/amp.py"),
    environment_variables={"WORKER_COUNT": 4},
    request_settings=OnlineRequestSettings(request_timeout_ms=120000, max_concurrent_requests_per_instance=4),