pytorchazure-machine-learning-serviceazure-ml-pipelines

Can't prepare Pytorch environment in ML Studio


My model worked in notebooks, where I just pip installed the required packages. But I am trying to run on a pipeline, and when I submit the pipeline as a job I get stuck in the Preparing phase. I believe this is due to the environment not resolving correctly. I've attached my .yml files, and I'm wondering where the issue might be.

conda_depencies.yml

name: pytorch-env-with-optuna
channels:
  - conda-forge
dependencies:
  - python=3.8
  - numpy
  - pandas
  - scikit-learn
  - pip
  - pip:
      - torch
      - optuna
      - mltable
      - azure-ai-ml
      - azure-identity

component.yml

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name:
display_name:
version: 6.0.0
type: command

description: >

inputs:

outputs:

code: .
environment:
  image: mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9
  conda_file: conda_dependencies.yml

command: >

I've also tried removing optuna from the dependencies list just so I can get past the preparing phase, and that didn't work either. I'm not sure what I'm doing wrong.


Solution

  • The issue you're facing is a common one when working with Azure Machine Learning's curated environments. Your job is getting stuck in the "Preparing" phase due to dependency conflicts between your conda_dependencies.yml file and the packages that are already pre-installed in the Docker image.

    The Problem: Redundant Dependencies

    The curated image you've selected, mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9, is designed to be a ready-to-use environment. It already includes:

    When you provide a conda_file that lists these same packages again, you are asking the conda package manager to resolve a complex and conflicting set of requirements. The resolver tries to find a version of every package that satisfies both the pre-built environment and your new requests, which often leads to an endless loop or a timeout, causing your job to appear "stuck."

    The Solution: Specify Only New Packages

    To fix this, you should modify your Conda dependencies file to only include packages that are not already in the base image. Your goal is to add to the environment, not redefine it.

    Here is a corrected version of your conda_dependencies.yml that should resolve correctly:

    YAML

    name: pytorch-env-with-optuna
    channels:
      - conda-forge
    dependencies:
      - pip
      - pip:
          - optuna
          - mltable
          - azure-ai-ml
          - azure-identity
    

    By removing python, torch, numpy, pandas, and scikit-learn from the file, you eliminate the conflicts. Azure ML will use the versions already present in the curated image and simply pip install the additional libraries you need (optunaand the specific Azure SDKs).

    Best Practice

    Before customizing a curated environment, it's always a good idea to check what's already included. You can find the full package list for all Azure ML curated environments in the official documentation:

    This will help you create minimal dependency files that build quickly and reliably.