pythonamazon-web-servicesdistributed-computingamazon-sagemakerembarrassingly-parallel

How to parallel for loop in Sagemaker Processing job


I'm running a python code on Sagemaker Processing job, specifically SKLearnProcessor. The code run a for-loop for 200 times (each iteration is independent), each time takes 20 minutes. for example: script.py

for i in list:
   run_function(i)

I'm kicking off the job from a notebook:

sklearn_processor = SKLearnProcessor(
    framework_version="1.0-1", role=role,
    instance_type="ml.m5.4xlarge", instance_count=1,
    sagemaker_session = Session()
)

out_path = 's3://' + os.path.join(bucket, prefix,'outpath')

sklearn_processor.run(
    code="script.py",
    outputs=[
        ProcessingOutput(output_name="load_training_data",
                         source = f'/opt/ml/processing/output}',
                         destination = out_path),
    ],
    arguments=["--some-args", "args"]
)

I want to parallel this code and make the Sagemaker processing job use it best capacity to run as many concurrent jobs as possible. How can I do that


Solution

  • There are basically 3 paths you can take, depending on the context.

    Parallelising function execution

    This solution has nothing to do with SageMaker. It is applicable to any python script, regardless of the ecosystem, as long as you have the necessary resources to parallelise a task.

    Based on the needs of your software, you have to work out whether to parallelise multi-thread or multi-process. This question may clarify some doubts in this regard: Multiprocessing vs. Threading Python

    Here is a simple example on how to parallelise:

    from multiprocessing import Pool
    import os
    
    POOL_SIZE = os.cpu_count()
    
    your_list = [...]
    
    def run_function(i):
        # ...
        return your_result
    
    
    if __name__ == '__main__':
        with Pool(POOL_SIZE) as pool:
            print(pool.map(run_function, your_list))
    

    Splitting input data into multiple instances

    This solution is dependent on the quantity and size of the data. If they are completely independent of each other and have a considerable size, it may make sense to split the data over several instances. This way, execution will be faster and there may also be a reduction in costs based on the instances chosen over the initial larger instance.

    It is clear in your case it is the instance_count parameter to set, as the documentation says:

    instance_count (int or PipelineVariable) - The number of instances to run the Processing job with. Defaults to 1.

    This should be combined with the ProcessingInput split.

    P.S.: This approach makes sense to use if the data can be retrieved before the script is executed. If the data is generated internally, the generation logic must be changed so that it is multi-instance.

    Combined approach

    One can undoubtedly combine the two previous approaches, i.e. create a script that parallelises the execution of a function on a list and have several parallel instances.

    An example of use could be to process a number of csvs. If there are 100 csvs, we may decide to instantiate 5 instances so as to pass 20 files per instance. And in each instance decide to parallelise the reading and/or processing of the csvs and/or rows in the relevant functions.

    To pursue such an approach, one must monitor well whether one is really bringing improvement to the system rather than wasting resources.