
Deploy AWS SageMaker endpoint for Hugging Face embedding model

I would like to deploy a huggingface text embedding model endpoint via aws sagemaker.

Here is my code so far:

import sagemaker
from sagemaker.huggingface.model import HuggingFaceModel

# sess = sagemaker.Session()
role = sagemaker.get_execution_role()

# Hub Model configuration. <>
hub = {
  'HF_MODEL_ID':'sentence-transformers/all-MiniLM-L12-v2', # model_id from
  'HF_TASK':'feature-extraction' # NLP task you want to use for predictions

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    env=hub, # configuration for loading model from Hub
    role=role, # iam role with permissions to create an Endpoint
    transformers_version="4.6", # transformers version used
    pytorch_version="1.7", # pytorch version used

predictor = huggingface_model.deploy(

data = {
"inputs": ["This is an example sentence", "Each sentence is converted"]

result = predictor.predict(data)

While this does deploy a endpoint successfully, it does not behave the way it should. I expect for each string in the input list to get a 1x384 list of floats as output. But instead i get 7x384 lists for each sentence. Did I maybe use the wrong pipeline?


  • There are two ways to deploy HuggingFace Models as Sagemaker Endpoints:

    1. The way you have done, defining env=hub inside HuggingFaceModel class. This is a nice and quick way to get inferences from the model without any custom preprocessing. You send a request and you will get a response in the raw form with which the model was created.
    2. If you want to do more with each request sent to model i.e. preprocess the inputs, change the behaviour of model and/or postprocess the output, you will need to use custom scripts. Sample HuggingFaceModel class parameters are:
    huggingface_model = HuggingFaceModel(
       model_data=s3_location,       # path to your model and script
       role=role,                    # iam role with permissions to create an Endpoint
       transformers_version="4.37.0",  # transformers version used
       pytorch_version="2.1.0",        # pytorch version used
       py_version='py310',            # python version used

    This is the complete reference you need:

    Additional Info: The handler file that will run with each request your endpoint receives: