azuremachine-learningpytorchazure-sdk

How can I run EAST / custom pytorch code on Azure Machine Learning?


I train EAST locally without no problems:

https://github.com/foamliu/EAST

But the thing is, I need to train it on the cloud / on Azure Machine Learning and all I can see there are "Notebooks". Is there a way that I can access the azure SDK directly so that I can access the datasets on the cloud and save the model as a job output without using any jupyter notebooks? How can I run Pytorch on Azure Machine Learning?


Solution

  • You can follow below approach.

    First, create a compute cluster in your ML workspace. I have created gpu cluster as below.

    enter image description here

    Next, make connection to ml workspace ang get client object.

    from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
    from azure.ai.ml import MLClient, Input
    credential = DefaultAzureCredential()
    
    ml_client =  None
    try:
        ml_client = MLClient.from_config(credential)
    except  Exception  as ex:
        print(ex)
        subscription_id =  "your_sub_id"
        resource_group =  "You_resource_grp"
        workspace =  "your_workspace_name"
        ml_client = MLClient(credential, subscription_id, resource_group, workspace)
    

    Next, submit the command job by providing arguments and inputs as below.

    from azure.ai.ml import command
    
    inputs = {
        "network":"r50",
        "pretrain":True,
        "epoch": 10,
        "batch_size": 64,
        "weight_decay":0.1,
        "optim": 'adam',
        "lr": 0.001,
        "mom": 0.9
        }
        
    job = command(
        code="./",
        command="python train.py --network ${{inputs.network}} --end-epoch ${{inputs.epoch}} --pretrained ${{inputs.pretrain}} --batch-size ${{inputs.batch_size}} --weight-decay ${{inputs.weight_decay}} --lr ${{inputs.lr}} --mom ${{inputs.mom}} --optimizer ${{inputs.optim}}",
        inputs=inputs,
        environment="azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6",
        compute="gpu-cluster",
        instance_count=2,
        distribution={
            "type": "PyTorch",
            "process_count_per_instance": 1,
            },
        )
    

    Here, give source directory for all your scripts in code parameter. In my case it's current directory itself. Next based on your requirements give the arguments as above in command parameter. Then name of the compute cluster you created.

    Next submit the job.

    ml_client.jobs.create_or_update(job)
    

    You will get output as below. All required code is uploaded and starts to run.

    enter image description here

    Output:

    enter image description here

    Here you can see job is created under experiment EAST. According to training script checkpoint is saved for each epoch.