based on documentation here, https://github.com/aws/amazon-sagemaker-examples/blob/main/inference/nlp/realtime/triton/multi-model/bert_trition-backend/bert_pytorch_trt_backend_MME.ipynb, I have set up a multi model utilizing gpu instance type and nvidia triton container. looking at the set up in the link, the model is invoked by passing tokens instead of passing text directly to the model. is it possible to pass text directly to the model, given the input type is set to string data type in the config.pbtxt (sample code below) . looking for any examples around this.
config.pbtxt
name: "..."
platform: "..."
max_batch_size : 0
input [
{
name: "INPUT_0"
data_type: TYPE_STRING
...
}
]
output [
{
name: "OUTPUT_1"
....
}
]
multi-model invocation
text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)
payload = {
"inputs": [
{"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
{"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
]
}
response = client.invoke_endpoint(
EndpointName=endpoint_name,
ContentType="application/octet-stream",
Body=json.dumps(payload),
TargetModel=f"bert-{i}.tar.gz",
)
If you want you could make use of an ensemble model in Triton where the first model tokenizes the text and passes it onto the model.
Take a look at this link that describes the strategy: https://blog.ml6.eu/triton-ensemble-model-for-deploying-transformers-into-production-c0f727c012e3