I have a model that is served using TorchServe. I'm communicating with the TorchServe server using gRPC. The final postprocess
method of the custom handler defined returns a list which is converted into bytes for transfer over the network.
The post process method
def postprocess(self, data):
# data type - torch.Tensor
# data shape - [1, 17, 80, 64] and data dtype - torch.float32
return data.tolist()
The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval
# This takes 0.3 seconds
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
# This takes 0.84 seconds
predictions = torch.as_tensor(literal_eval(
response.prediction.decode('utf-8')))
Using numpy.frombuffer
or torch.frombuffer
return the following error.
import numpy as np
np.frombuffer(response.prediction)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size
np.frombuffer(response.prediction, dtype=np.float32)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer size must be a multiple of element size
Using torch
import torch
torch.frombuffer(response.prediction, dtype = torch.float32)
Traceback (most recent call last):
File "<string>", line 1, in <module>
ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)
Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor
?
Update:
There's an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor
from tensorflow to serialize your tensor inside postprocess
def postprocess(self, data):
return [tf.io.serialize_tensor(data.cpu()).numpy()]
Decode it using tf.io.parse_tensor
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
prediction = response.prediction
torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())
Previous:
One hack I've found that has significantly increased the performance while sending large tensors is to return a list of json.
In your handler's postprocess function:
def postprocess(self, data):
output_data = {}
output_data['data'] = data.tolist()
return [output_data]
At the clients side when you receive the grpc response, decode it using json.loads
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
decoded_output = response.prediction.decode('utf-8')
preds = torch.as_tensor(json.loads(decoded_output))
preds
should have the output tensor