[SOLVED] How can I run EAST / custom pytorch code on Azure Machine Learning?

How can I run EAST / custom pytorch code on Azure Machine Learning?

I train EAST locally without no problems:

But the thing is, I need to train it on the cloud / on Azure Machine Learning and all I can see there are "Notebooks". Is there a way that I can access the azure SDK directly so that I can access the datasets on the cloud and save the model as a job output without using any jupyter notebooks? How can I run Pytorch on Azure Machine Learning?

Solution

You can follow below approach.

First, create a compute cluster in your ML workspace. I have created gpu cluster as below.

enter image description here

Next, make connection to ml workspace ang get client object.

from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient, Input
credential = DefaultAzureCredential()

ml_client =  None
try:
    ml_client = MLClient.from_config(credential)
except  Exception  as ex:
    print(ex)
    subscription_id =  "your_sub_id"
    resource_group =  "You_resource_grp"
    workspace =  "your_workspace_name"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

Next, submit the command job by providing arguments and inputs as below.

from azure.ai.ml import command

inputs = {
    "network":"r50",
    "pretrain":True,
    "epoch": 10,
    "batch_size": 64,
    "weight_decay":0.1,
    "optim": 'adam',
    "lr": 0.001,
    "mom": 0.9
    }
    
job = command(
    code="./",
    command="python train.py --network ${{inputs.network}} --end-epoch ${{inputs.epoch}} --pretrained ${{inputs.pretrain}} --batch-size ${{inputs.batch_size}} --weight-decay ${{inputs.weight_decay}} --lr ${{inputs.lr}} --mom ${{inputs.mom}} --optimizer ${{inputs.optim}}",
    inputs=inputs,
    environment="azureml:AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu:6",
    compute="gpu-cluster",
    instance_count=2,
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 1,
        },
    )

Here, give source directory for all your scripts in code parameter. In my case it's current directory itself. Next based on your requirements give the arguments as above in command parameter. Then name of the compute cluster you created.

Next submit the job.

ml_client.jobs.create_or_update(job)

You will get output as below. All required code is uploaded and starts to run.

enter image description here

Output:

enter image description here

Here you can see job is created under experiment EAST. According to training script checkpoint is saved for each epoch.