pythondockerpytorchcaffe2onnx

Caffe2: Load ONNX model, and inference single threaded on multi-core host / docker


I'm having trouble running inference on a model in docker when the host has several cores. The model is exported via PyTorch 1.0 ONNX exporter:

torch.onnx.export(pytorch_net, dummyseq, ONNX_MODEL_PATH)

Starting the model server (wrapped in Flask) with a single core yields acceptable performance (cpuset pins the process to specific cpus) docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0 my_container

response from ab -c 1 -n 1000 http://0.0.0.0:8081/predict\?itemids\=5,100

Percentage of the requests served within a certain time (ms)
  50%      5
  66%      5
  75%      5
  80%      5
  90%      7
  95%     46
  98%     48
  99%     49

But pinning it to four cores gives completely different stats for the same ab-call docker run --rm -p 8081:8080 --cpus 0.5 --cpuset-cpus 0,1,2,3 my_container

Percentage of the requests served within a certain time (ms)
  50%      9
  66%     12
  75%     14
  80%     18
  90%     62
  95%     66
  98%     69
  99%     69
 100%     77 (longest request)

Model inference is done like this, and except this issue it seems to work as expected. (This runs in a completely separate environment from the model export of course)

from caffe2.python import workspace
from caffe2.python.onnx.backend import Caffe2Backend as c2
from onnx import ModelProto


class Model:
    def __init__(self):
        self.predictor = create_caffe2_predictor(path)

    @staticmethod
    def create_caffe2_predictor(onnx_file_path):
        with open(onnx_file_path, 'rb') as onnx_model:
            onnx_model_proto = ModelProto()
            onnx_model_proto.ParseFromString(onnx_model.read())
            init_net, predict_net = c2.onnx_graph_to_caffe2_net(onnx_model_proto)
            predictor = workspace.Predictor(init_net, predict_net)
        return predictor


    def predict(self, numpy_array):
        return self.predictor.run({'0': numpy_array})

** wrapper flask app which calls Model.predict() on calls to /predict **

OMP_NUM_THREADS=1 is also present in the container environment, which had some effect, but it is not the end issue.

The benchmark stats you're seeing here are run on a local machine with 8 hyperthreads, so I shouldn't be saturating my machine and affect the test. These results also show up in my kubernetes environment, and I'm getting a large amount of CFS (Completely Fair Scheduler) throttling there.

I'm running in a kubernetes environment, so there's no way for me to control how many CPUs the host exposes, and doing some sort of pinning there seems a bit hacky as well.

Is there any way to pin caffe2 model inference to a single processor? Am I doing something obviously wrong here? Is the caffe2.Predictor object not suited to this task?

Any help appreciated.

EDIT:

I've added the simplest possible reproducable example I can think of here, with a docker-container and run-script included: https://github.com/NegatioN/Caffe2Struggles


Solution

  • This is not a direct answer to the question, but if your goal is to serve PyTorch models (and only PyTorch models, as mine is now) in production, simply using PyTorch Tracing seems to be the better choice.

    You can then load it directly into a C++ frontend similarly to what you would do through Caffe2, but PyTorch tracing seems more well maintained. From what I can see there are no speed slowdowns, but it is a whole lot easier to configure.

    An example of this to get good performance on a single-core container is to run with OMP_NUM_THREADS=1 as before, and export the model as follows:

    from torch import jit
    ### Create a model
    model.eval()
    traced = jit.trace(model, torch.from_numpy(an_array_with_input_size))
    traced.save("traced.pt")
    

    And then simply run the model in production in pure C++ following the above guide, or through the Python interface as such:

    from torch import jit
    model = jit.load("traced.pt")
    output = model(some_input)