nvidiaamazon-sagemakerinferencetritonservertriton

How to set up configuration file for sagemaker triton inference?


I have been looking examples and ran into this from aws, https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/ensemble/sentence-transformer-trt/examples/ensemble_hf/bert-trt/config.pbtxt. based on this example , we need to define the input and output , data types for those input and output. the example is not clear on what the dims ( probably dimensions) represent , is it number of elements in an array of inputs ? also , what Is max_batch_size ? and at the bottom , we have to specify instance group and kind is set to KIND_GPU, I assume if we are using a cpu based instance , we can change this to cpu. do we need to specify , how many cpu we want to use?

name: "bert-trt"
platform: "tensorrt_plan"
max_batch_size: 16
input [
  {
    name: "token_ids"
    data_type: TYPE_INT32
    dims: [128]
  }...
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [128, 384]
  }...
]
instance_group [
    {
      kind: KIND_GPU
    }
  ]

I have tested the given example , but if we want to use a text based input and do tokenization in the server, how does this config.pbtxt file look like?


Solution

  • The max_batch_size entry specifies the maximum batch size to use with Triton dynamic batching. Triton will combine multiple requests into a single batch in order to increase throughput.

    If you set max_batch_size to zero you need to define the batch dimension in the config.pbtxt, i.e.

    name: "bert-trt"
    platform: "tensorrt_plan"
    max_batch_size: 0
    input [
      {
        name: "token_ids"
        data_type: TYPE_INT32
        dims: [-1, 128]
      }...
    ]
    

    In this case -1 implies that the batch dimension is variable (you can also set the sequence dimension to -1)

    In order to tokenize on the server, you need to create a python backend and either an ensemble model, or use python for both the tokeniser and the model

    name: "tokenizer"
    max_batch_size: 0
    backend: "python"
    
    input [
        {
            name: "text"
            data_type: TYPE_STRING
            dims: [ -1 ]
        }
    ]
    

    I found triton-ensemble-model-for-deploying-transformers-into-production a good resource.