pythonamazon-web-servicesamazon-sagemakermlops

Share code across sagemaker pipeline steps without


I am trying to create Sagemaker pipeline with multiple steps. I have some code which I would like to share across different steps. Next example is not exact but simplified version for illustration.

I have folder structure which looks next:

source_scripts/
├── utils
│   ├── logger.py
├── models/
│   ├── ground_truth.py
│   ├── document.py
├── processing/
│   ├── processing.py
│   └── main.py
└── training/
    ├── training.py
    └── main.py

I would like to use code from models and utils inside training.py as I don't know where exactly code mounted on Sagemaker instance I am using:

from ..common.ground_truth import GroundTruthRow

As I am building pipeline I create processing and training step:

script_processor = FrameworkProcessor()
args = script_processor.get_run_args(
    source_dir="source_scripts"
    code="processing/main.py"
)
step_process = ProcessingStep(
    code=args.code
)

estimator = Estimator(
    source_dir="source_scripts"
    code="training/main.py"
)
step_train = TrainingStep(
   estimator=estimator
)

But during pipeline execution it results in error:

ImportError: attempted relative import with no known parent package

Any suggestions how to share code across several SageMaker jobs in single pipeline without building custom docker image?


Solution

  • Quickest solution for this issue is to move the entry point scripts (processing/main.py, training/main.py) to the upper level directly under "source_scripts/" like this:

    source_scripts/
    ├── utils
    │   └── logger.py
    ├── models/
    │   ├── ground_truth.py
    │   └── document.py
    ├── processing_main.py
    └── training_main.py
    

    and avoid using relative import with .. in the entry point scripts.

    from models.ground_truth import GroundTruthRow
    

    The reason why you see the ImportError is because the entry point script's __name__ variable doesn't have package structure information unlike other modules.

    If you print __name__ in models.ground_truth.py, you would see something like models.ground_truth and it contains package structure. But if you print __name__ in training/main.py, you see __main__, so Python cannot understand how to handle ...

    If you need to keep the current directory structure, there may be more complicated solutions, but moving entry point files to upper level would be simplest.

    Also, I recommend testing the solution in your handy local Python environment before running your SageMaker Pipeline.