I have been looking examples and ran into this from aws, https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/ensemble/sentence-transformer-trt/examples/ensemble_hf/bert-trt/config.pbtxt. based on this example , we need to define the input and output , data types for those input and output. the example is not clear on what the dims ( probably dimensions) represent , is it number of elements in an array of inputs ? also , what Is max_batch_size ? and at the bottom , we have to specify instance group and kind is set to KIND_GPU, I assume if we are using a cpu based instance , we can change this to cpu. do we need to specify , how many cpu we want to use?
name: "bert-trt"
platform: "tensorrt_plan"
max_batch_size: 16
input [
{
name: "token_ids"
data_type: TYPE_INT32
dims: [128]
}...
]
output [
{
name: "output"
data_type: TYPE_FP32
dims: [128, 384]
}...
]
instance_group [
{
kind: KIND_GPU
}
]
I have tested the given example , but if we want to use a text based input and do tokenization in the server, how does this config.pbtxt file look like?
The max_batch_size
entry specifies the maximum batch size to use with Triton dynamic batching. Triton will combine multiple requests into a single batch in order to increase throughput.
If you set max_batch_size
to zero you need to define the batch dimension in the config.pbtxt, i.e.
name: "bert-trt"
platform: "tensorrt_plan"
max_batch_size: 0
input [
{
name: "token_ids"
data_type: TYPE_INT32
dims: [-1, 128]
}...
]
In this case -1 implies that the batch dimension is variable (you can also set the sequence dimension to -1)
In order to tokenize on the server, you need to create a python backend and either an ensemble model, or use python for both the tokeniser and the model
name: "tokenizer"
max_batch_size: 0
backend: "python"
input [
{
name: "text"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
I found triton-ensemble-model-for-deploying-transformers-into-production a good resource.