pythontensorflowamazon-sagemaker

SageMaker pipeline endpoint failing to deploy - CannotStartContainerError


I have a SageMaker pipeline that looks as follows (with parameters and other variables ommitted):

# 1. Model training step
estimator = TensorFlow(
    entry_point="train.py",
    source_dir=src_dir,
    role=role,
    instance_count=1,
    instance_type="ml.m4.4xlarge",
    framework_version="2.1",
    py_version="py3",
    base_job_name="quantitative-scores-training",
    output_path=s3_training_output_file,
    code_location=f"{base_dir}/code/"
)

training_inputs = {
    'train': TrainingInput(
        s3_data=s3_training_data_input_file,
        content_type='text/csv',
        input_mode='FastFile'
    )
}

training_step = TrainingStep(
    name='Train',
    estimator=estimator,
    inputs=training_inputs,
)

# 2. Create model step
model = Model(
    entry_point='inference.py',
    source_dir=src_dir,
    model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
    role=role,
    sagemaker_session=sagemaker_session,
    image_uri=estimator.training_image_uri(),
)

create_model_step = ModelStep(
    name="ModelStep",
    step_args=model.create(
        instance_type='ml.m4.4xlarge'
    ),
)

# 3. Deploy model to endoint step
deploy_model_lambda_function = Lambda(
    function_name="sagemaker-deploy-quant-score",
    execution_role_arn=create_sagemaker_lambda_role("deploy-model-lambda-role"),
    script="/home/ec2-user/SageMaker/my_path/src/util/deploy_model_lambda.py",
    handler="deploy_model_lambda.lambda_handler",
)

deploy_model_step = LambdaStep(
    name="DeployModelStep",
    lambda_func=deploy_model_lambda_function,
    inputs={
        "model_name": create_model_step.properties.ModelName,
        "endpoint_config_name": "quantitative-scoring-pipeline-config",
        "endpoint_name": endpoint_name,
        "endpoint_instance_type": "ml.m4.xlarge",
    },
)

# Connect pipeline
pipe = Pipeline(
    name="QuantitativeScoringPipeline",
    steps=[
        training_step,
        create_model_step,
        deploy_model_step
    ],
    parameters=[
        # I omitted these definitions above
        s3_training_data_input_file,
        s3_training_output_file,
        endpoint_name
    ],
)
pipe.upsert(role_arn=role)
execution = pipe.start()

However, when it comes to the lambda deploying the endpoint, the lambda succeeds, however the creation of the endpoint always fails sometime later. The container is never spun up, so there are no logs in CloudWatch, but I get the message CannotStartContainerError. Please ensure the model container for variant AllTraffic starts correctly when invoked with 'docker run <image> serve'. Clearly I'm not using a custom container though.

Curiously, if I create/update the endpoint using the SageMaker SDK as below, using the same model S3 URI, it works absolutely fine. This is the exact same model that fails above.

model = TensorFlowModel(
    entry_point='inference.py',
    source_dir='src',
    model_data="s3://sagemaker-eu-west-1-558091818291/tensorflow-training-2024-04-25-12-11-21-401/pipelines-dgstz6rrp8u9-ModelStep-RepackMode-P5O95TSntC/output/model.tar.gz",
    role=role,
    framework_version="2.1",
)
predictor = model.deploy(instance_type='ml.m4.xlarge', initial_instance_count=1, endpoint_name=endpoint_name)

This second approach creates a new model tar ball, however. I have inspected the contents of both model tar balls and both inference code and model data look identical. I'm really stumped why the pipeline fails to update the endpoint but this succeeds. All I can think of is that I'm using the framework version instead of a specific image URI, but not sure how to get around that.


Solution

  • The problem was solved by replacing the Model object with a TensorFlowModel. This allowed me to stop specifying image URI and just pass the framework version to use.

    I.e. this:

    model = Model(
        entry_point='inference.py',
        source_dir=src_dir,
        model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
        role=role,
        sagemaker_session=sagemaker_session,
        image_uri=estimator.training_image_uri(),
    )
    

    Became this:

    model = TensorFlowModel(
        entry_point='inference.py',
        source_dir=src_dir,
        framework_version="2.1",
        model_data=training_step.properties.ModelArtifacts.S3ModelArtifacts,
        sagemaker_session=sagemaker_session,
        role=role
    )
    

    Despite coming from the SageMaker examples, it seems estimator.training_image_uri() is not appropriate for TensorFlow 🤷.