I am running into a ModuleNotFoundError for pandas while using the following code to orchestrate my Azure Machine Learning Pipeline:
# Loading run config
print("Loading run config")
task_1_run_config = RunConfiguration.load(
os.path.join(WORKING_DIR + '/pipeline/task_runconfigs/T01_Test_Task.yml')
)
task_1_script_run_config = ScriptRunConfig(
source_directory=os.path.join(WORKING_DIR + '/pipeline/task_scripts'),
run_config=task_1_run_config
)
task_1_py_script_step = PythonScriptStep(
name='Task_1_Step',
script_name=task_1_script_run_config.script,
source_directory=task_1_script_run_config.source_directory,
compute_target=compute_target
)
pipeline_run_config = Pipeline(workspace=workspace, steps=[task_1_py_script_step])#, task_2])
pipeline_run = Experiment(workspace, 'Test_Run_New_Pipeline').submit(pipeline_run_config)
pipeline_run.wait_for_completion()
The environment.yml
name: phinmo_pipeline_env
dependencies:
- python=3.8
- pip:
- pandas
- azureml-core==1.43.0
- azureml-sdk
- scipy
- scikit-learn
- numpy
- pyyaml==6.0
- datetime
- azure
channels:
- conda-forge
The loaded RunConfiguration in T01_Test_Task.yml looks like this:
# The script to run.
script: T01_Test_Task.py
# The arguments to the script file.
arguments: [
"--test", False,
"--date", "2022-07-26"
]
# The name of the compute target to use for this run.
compute_target: phinmo-compute-cluster
# Framework to execute inside. Allowed values are "Python", "PySpark", "CNTK", "TensorFlow", and "PyTorch".
framework: Python
# Maximum allowed duration for the run.
maxRunDurationSeconds: 6000
# Number of nodes to use for running job.
nodeCount: 1
#Environment details.
environment:
# Environment name
name: phinmo_pipeline_env
# Environment version
version:
# Environment variables set for the run.
#environmentVariables:
# EXAMPLE_ENV_VAR: EXAMPLE_VALUE
# Python details
python:
# user_managed_dependencies=True indicates that the environmentwill be user managed. False indicates that AzureML willmanage the user environment.
userManagedDependencies: false
# The python interpreter path
interpreterPath: python
# Path to the conda dependencies file to use for this run. If a project
# contains multiple programs with different sets of dependencies, it may be
# convenient to manage those environments with separate files.
condaDependenciesFile: environment.yml
# The base conda environment used for incremental environment creation.
baseCondaEnvironment: AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
# Docker details
# History details.
history:
# Enable history tracking -- this allows status, logs, metrics, and outputs
# to be collected for a run.
outputCollection: true
# Whether to take snapshots for history.
snapshotProject: true
# Directories to sync with FileWatcher.
directoriesToWatch:
- logs
# data reference configuration details
dataReferences: {}
# The configuration details for data.
data: {}
# Project share datastore reference.
sourceDirectoryDataStore:
I already tried a few things like overwriting the environment attribute in the RunConfiguration object with a environment.python.conda_dependencies object or assigning a version number to pandas in the environment.yml, changing the location of the environment.yml. But I am at a loss at what else to try. the T01_Test_Task.py runs without issues on its own. But putting it into a pipeline just does not seem to work.
Okay I found the issue. I am unnecessarily using the ScriptRunConfig which overwrites the assigned environment with some default azureml environment. I was able to see that only in the Task description in the Azure Machine Learning Studio UI.
I was able to just remove that part and now it works:
task_1_run_config = RunConfiguration.load(
os.path.join(WORKING_DIR + '/pipeline/task_runconfigs/T01_Test_Task.yml')
)
task_1_py_script_step = PythonScriptStep(
name='Task_1_Step',
script_name='T01_Test_Task.py',
source_directory=os.path.join(WORKING_DIR + '/pipeline/task_scripts'),
runconfig=task_1_run_config,
compute_target=compute_target
)