tensorrt

Why does a TensorRT engine need to "warmUp" to get accurate inference profiling?


The trtexec --help binary shows that --warmUp=N Run for N milliseconds to warmup before measuring performance (default = 200).

However, why is a warmup needed? If the model (and thus intermediate buffers necessary for the forward pass) are allocated during model load time, then the only performance bottleneck would be the Host to Device Memory transfers. The nvidia docs indicate that this is corrected for by their enqueing strategy.

Therefore I'm not sure what else could result in an initial performance bottleneck. Any insight on why this is needed would be much appreciated.


Solution

  • TensorRT needs warmup for multiple reasons:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 525.89.02    Driver Version: 528.49       CUDA Version: 12.0     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  On   | 00000000:65:00.0  On |                  N/A |
    |  0%   47C    P8    36W / 350W |    473MiB / 12288MiB |     14%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+