huggingface-transformersamazon-sagemakerendpointhuggingfaceamazon-sagemaker-studio

Received server error (500) while deploying HuggingFace model on Sgaemaker


I've successfully fine tuned a sentence-transformers model all-MiniLM-L12-v2 on our data in SageMaker Studio and the model was saved in S3 as a model.tar.gaz.

I want to deploy this model for inference (all code snippets included below). According to HuggingFace doc these types of model required a Custom Inference module. So I've downloaded and unpacked the model.tar.gz created, then followed the tutorial to add the code/inference.py and pushed it back to S3 as new model.tar.gz

The endpoint is created successfully, but as soon as I call the predictor.predict() it crashes with the following error:

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Worker died."
}

looking in CloudWatch I got a lot of info messages, where the instance seems to be setting up successfully then I get this warning message:

2024-07-30T13:19:09,702 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.

Here are the relevant code snippets:

End point creation:

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker import get_execution_role, image_uris

role            = get_execution_role()
estimator_image = image_uris.retrieve(framework='pytorch',region='eu-west-1',version='2.0.0',py_version='py310',image_scope='inference', instance_type='ml.g5.4xlarge')
sm_model_ref    = model_path

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    model_data    = sm_model_ref,
    role          = role,                                                     
    image_uri     = estimator_image,
)

The custom inference.py file and its location in the model.tar.gz:

%%writefile models/model/code/inference.py

import torch

# Create a custom inference to overwrite the default method
def predict_fn(data, model):

    # create sentences pair
    sentences1 = data["premise"]
    sentences2 = data["hypothesis"]
 
    # Compute token embeddings
    with torch.no_grad():
        embeddings1 = model.encode(sentences1)
        embeddings2 = model.encode(sentences2)
        
        # Compute cosine similarities        
        similarities = model.similarity(embeddings1, embeddings2)
 
    return similarities

And its location:

model.tar.gz
 |_ _ 1_Pooling
 |_ _ 2_Normalize
 |_ _ checkpoint-8300
 |_ _ checkpoint-8334
 |_ _ code
   |_ _ inference.py
 |_ _ config_sentence_transformers.json
 |_ _ config.json
 |_ _ model.safetensors
 |_ _ module.json
 |_ _ README.md
 |_ _ sentence_bert_config.json
 |_ _ special_token_map.json
 |_ _ tokenizer_config.json
 |_ _ tokenizer.json
 |_ _ vocab.txt

Solution

  • It seems most of the doc on the topic, including HuggingFace doc was out of date. You no longer need the repackage the model.tar.gz with code/inference.py

    All I had to do was pass the S3 path to my initial model.tar.gz after training to the estimator, and pass the location of inference.py and requirements.txt in the source_dir and entry_point.

    huggingface_model = HuggingFaceModel(
        entry_point   = 'inference.py',
        source_dir    = 'code',
        model_data    = sm_model_ref,
        role          = role,                                                     # IAM role with permissions to create an endpoint
        image_uri     = estimator_image,
    )