amazon-web-servicesmachine-learningcontainersamazon-sagemakeramz-sagemaker-distributed-training

Unable to run training using a custom algorithm


I am trying to run training using Sagemaker's training jobs and the Sagemaker Python SDK, the training script relies on some custom libraries. From my understanding, because of the custom script, I need to generate a custom image using a docker container that's been registered to ECR (Elastic Container Registry). The environment below is a Sagemaker Studio Code Editor.

The error I get is Failed to parse hyperparameter. See below for my set up and what I've tried as a solution.

Directory

working directory
    —Dockerfile
    —train.py
    —requirements.txt 

Dockerfile

# Use python image as base
FROM python:3.10

# Install system dependencies
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        libpq-dev \
        gcc \
    && rm -rf /var/lib/apt/lists/*

# Set working directory in container
COPY code /opt/program
WORKDIR /code

# Install Python dependencies
COPY requirements.txt /code/
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install sagemaker-training

# Copies the training code inside the container
COPY train.py /opt/ml/code/train.py

# Defines train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py

# Set environment variables
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

requirements.txt

simpletransformers==0.70.0
pandas==2.1.1
numpy==1.26.0
torch==2.2.1
sklearn-deap==0.3.0
sklearn-genetic-opt==0.10.1
boto3==1.33.3
sagemaker

train.py

import argparse
import os
import logging
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from simpletransformers.classification import ClassificationModel
import torch

from sagemaker_pytorch_estimator.pytorch_estimator import PyTorchModel
from sagemaker_containers.data_instances.data_buffer import BufferDataset, BufferedShuffledDataset

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--batch_size", type=int, default=32)
    parser.add_argument("--test_size", type=float, default=0.2)
    parser.add_argument("--target_column", type=str, default="annotation")
    parser.add_argument("--vertical", type=str, default="some_category")
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    parser.add_argument("--train", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    parser.add_argument("--val", type=str, default=os.environ.get("SM_CHANNEL_VAL"))
    parser.add_argument("--test", type=str, default=os.environ.get("SM_CHANNEL_TEST"))

    args, _ = parser.parse_known_args()

    model_data = None
    role = None
    entry_point = None

    ....(script continues)

Launching script:

import sagemaker
from sagemaker.session import TrainingInput
from sagemaker.estimator import Estimator

vertical = 'some_category'
s3_bucket = 'some_bucker'
prefix = 'classification'
instance_type = 'ml.m4.xlarge' 
print("Instance Type: {}".format(instance_type))

region = sagemaker.Session().boto_region_name
print("AWS Region: {}".format(region))

role = sagemaker.get_execution_role()
print("RoleArn: {}".format(role))

s3_output_location='s3://{}/{}/{}'.format(s3_bucket, prefix, 'classifier')
container = '############.###.###.##-####-#.amazonaws.com/some-name/ml-training:latest'
print("Image Container: {}".format(container))

estimator = Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    instance_type=instance_type,
    volume_size=10,
    output_path=s3_output_location,
    sagemaker_session=sagemaker.Session()
)

estimator.set_hyperparameters(vertical=vertical,
                              s3_bucket=s3_bucket,
                              target_column='annotation',
                              test_size=0.2)

estimator.fit()

Error

Failed to parse hyperparameter 

What I've tried as a solution:

  1. Seems to be an open issue for the sagemaker-training library. The only suggestion there was to wrap the hyperparameters around a function some user suggested but that doesn't seem to be a current working solution (got this error: TypeError: Estimator.set_hyperparameters() takes 1 positional argument but 2 were given)
  2. There are some suggestions here but I could not figure out how to apply them for my situation.
  3. This SO post claims argparse is not compatible with Sagemaker (all the official aws sagemaker documentation uses argparse). Their suggested solution is unclear to me.

Solution

  • There are several topics to address here. First of all, you don't need to create a container just to include additional dependencies.

    Using additional dependencies with an Estimator

    You can add dependencies to an Estimator by providing source_dir and including a requirements.txt file in the referenced source directory.

    From the Estimator API documentation:

    source_dir The absolute, relative, or S3 URI Path to a directory with any other training source code dependencies aside from the entry point file. If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory is preserved when training on Amazon SageMaker.

    The most straightforward way to include a source_dir is to have it locally next to your notebook.

    |----- example-notebook.ipynb
    |----- src
            |----- train.py
            |----- requirements.txt
    

    You can then configure your estimator to use the source directory with the following configuration:

    estimator = Estimator(
        [...]
        entry_point="train.py",
        source_dir="src",
        [...]
    )
    

    If source_dir is specified, then entry_point must point to a file located at the root of source_dir. The training job will automatically install dependencies from the provided requirements.txt.

    Since you're using Scikit-Learn, you could also use the SKLearn Estimator, which already bundles several dependencies and provides a simplified interface compared to the general Estimator.

    Fixing Hyperparameter Parsing

    If you'd like to use your code as is, then you could adapt your code as follows:

    import json
    
    # JSON encode hyperparameters
    def json_encode_hyperparameters(hyperparameters):
        return {str(k): json.dumps(v) for (k, v) in hyperparameters.items()}
    
    
    hyperparameters = json_encode_hyperparameters({
        "vertical": vertical,
        "s3_bucket": s3_bucket,
        "target_column": target_column, 
        "test_size": 0.2
    })
    
    estimator = Estimator(
        image_uri=container,
        role=role,
        instance_count=1,
        instance_type=instance_type,
        volume_size=10,
        output_path=s3_output_location,
        sagemaker_session=sagemaker.Session(),
        hyperparameters=hyperparameters
    )
    

    set_hyperparameters expects the input as kwargs, while the hyperparameters property accepts the input in different formats. Therefore, you can't use the JSON-encoded dict with set_hyperparameters; instead, use it with the hyperparameters property.