pythonnumpytorchbytesiotorchserve

TorchServe: How to convert bytes output to tensors


I have a model that is served using TorchServe. I'm communicating with the TorchServe server using gRPC. The final postprocess method of the custom handler defined returns a list which is converted into bytes for transfer over the network.

The post process method

def postprocess(self, data):
    # data type - torch.Tensor
    # data shape - [1, 17, 80, 64] and data dtype - torch.float32
    return data.tolist()

The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval

# This takes 0.3 seconds
response = self.inference_stub.Predictions(
            inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
# This takes 0.84 seconds
predictions = torch.as_tensor(literal_eval(
            response.prediction.decode('utf-8')))

Using numpy.frombuffer or torch.frombuffer return the following error.

import numpy as np

np.frombuffer(response.prediction)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size

np.frombuffer(response.prediction, dtype=np.float32)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size

Using torch

import torch
torch.frombuffer(response.prediction, dtype = torch.float32)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)

Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor?


Solution

  • Update:

    There's an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor from tensorflow to serialize your tensor inside postprocess

    def postprocess(self, data):
        return [tf.io.serialize_tensor(data.cpu()).numpy()]
    

    Decode it using tf.io.parse_tensor

    response = self.inference_stub.Predictions(
                inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
    prediction = response.prediction
    torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())
    

    Previous:

    One hack I've found that has significantly increased the performance while sending large tensors is to return a list of json.

    In your handler's postprocess function:

    def postprocess(self, data):
        output_data = {}
        output_data['data'] = data.tolist()
        return [output_data]
    

    At the clients side when you receive the grpc response, decode it using json.loads

    response = self.inference_stub.Predictions(
                inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
    decoded_output = response.prediction.decode('utf-8')
    preds = torch.as_tensor(json.loads(decoded_output))
    

    preds should have the output tensor