I've successfully fine tuned a sentence-transformers
model all-MiniLM-L12-v2
on our data in SageMaker Studio and the model was saved in S3 as a model.tar.gaz
.
I want to deploy this model for inference (all code snippets included below). According to HuggingFace doc these types of model required a Custom Inference module. So I've downloaded and unpacked the model.tar.gz
created, then followed the tutorial to add the code/inference.py
and pushed it back to S3 as new model.tar.gz
The endpoint is created successfully, but as soon as I call the predictor.predict()
it crashes with the following error:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "{
"code": 500,
"type": "InternalServerException",
"message": "Worker died."
}
looking in CloudWatch I got a lot of info messages, where the instance seems to be setting up successfully then I get this warning message:
2024-07-30T13:19:09,702 [WARN ] W-9000-model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: model, error: Worker died.
Here are the relevant code snippets:
End point creation:
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker import get_execution_role, image_uris
role = get_execution_role()
estimator_image = image_uris.retrieve(framework='pytorch',region='eu-west-1',version='2.0.0',py_version='py310',image_scope='inference', instance_type='ml.g5.4xlarge')
sm_model_ref = model_path
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data = sm_model_ref,
role = role,
image_uri = estimator_image,
)
The custom inference.py
file and its location in the model.tar.gz
:
%%writefile models/model/code/inference.py
import torch
# Create a custom inference to overwrite the default method
def predict_fn(data, model):
# create sentences pair
sentences1 = data["premise"]
sentences2 = data["hypothesis"]
# Compute token embeddings
with torch.no_grad():
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)
return similarities
And its location:
model.tar.gz
|_ _ 1_Pooling
|_ _ 2_Normalize
|_ _ checkpoint-8300
|_ _ checkpoint-8334
|_ _ code
|_ _ inference.py
|_ _ config_sentence_transformers.json
|_ _ config.json
|_ _ model.safetensors
|_ _ module.json
|_ _ README.md
|_ _ sentence_bert_config.json
|_ _ special_token_map.json
|_ _ tokenizer_config.json
|_ _ tokenizer.json
|_ _ vocab.txt
It seems most of the doc on the topic, including HuggingFace doc was out of date. You no longer need the repackage the model.tar.gz
with code/inference.py
All I had to do was pass the S3 path to my initial model.tar.gz
after training to the estimator, and pass the location of inference.py
and requirements.txt
in the source_dir
and entry_point
.
huggingface_model = HuggingFaceModel(
entry_point = 'inference.py',
source_dir = 'code',
model_data = sm_model_ref,
role = role, # IAM role with permissions to create an endpoint
image_uri = estimator_image,
)