I have been experimenting with various options to speed up some for-loop-heavy-logic in PyTorch. The two obvious options to do so are either using numba or writing a custom C++ extension.
As an example I have picked a "variable length delay line" from digital signal processing. This can be written trivially but inefficiently using a plain Python for loop:
def delay_line(samples, delays):
"""
:param samples: Float tensor of shape (N,)
:param delays: Int tensor of shape (N,)
The goal is basically to mix each `samples[i]` with the delayed sample
specified by a per-sample `delays[i]`.
"""
for i in range(len(samples)):
delay = int(delays[i].item())
index_delayed = i - delay
if index_delayed < 0:
index_delayed = 0
samples[i] = 0.5 * (samples[i] + samples[index_delayed])
Knowing how bad for loops perform in Python, my hope was that I can get significantly better performance by implementing the same in C++. Following the tutorial, I came up with this literal translation from Python to C++:
void delay_line(torch::Tensor samples, torch::Tensor delays) {
int64_t input_size = samples.size(-1);
for (int64_t i = 0; i < input_size; ++i) {
int64_t delay = delays[i].item<int64_t>();
int64_t index_delayed = i - delay;
if (index_delayed < 0) {
index_delayed = 0;
}
samples[i] = 0.5 * (samples[i] + samples[index_delayed]);
}
}
I have also taken the Python function and wrapped it into various jit decorators to get numba and torchscript versions of that function (see my other answer for details on the numba wrapping). I then executed my benchmark for all the versions, also depending on whether tensors reside on the CPU or GPU. The results are rather surprising:
╭──────────────┬──────────┬────────────────────╮
│ Method │ Device │ Median time [ms] │
├──────────────┼──────────┼────────────────────┤
│ plain_python │ CPU │ 13.481 │
│ torchscript │ CPU │ 6.318 │
│ numba │ CPU │ 0.016 │
│ cpp │ CPU │ 9.056 │
│ plain_python │ GPU │ 45.412 │
│ torchscript │ GPU │ 47.809 │
│ numba │ GPU │ 0.236 │
│ cpp │ GPU │ 31.145 │
╰──────────────┴──────────┴────────────────────╯
Notes: sample buffer size was fixed to 1024; results are medians of 100 executions to ignore artifacts from the initial jit overhead; input data creation and moving it to the device is excluded from the measurements; full benchmark script gist
The most notable result: The C++ variant seems to be surprisingly slow. The fact that numba is two orders of magnitude faster indicates that the problem indeed can be solved faster. And the fact that the C++ variant is still pretty close to the know-to-be-slow Python for loop may indicate that something isn't quite right.
I'm wondering what could explain the poor performance of the C++ extension. The first thing that comes to mind is missing optimization. However, I've made sure that the compilation uses optimization. Switching from -O2
to -O3
also didn't make a difference.
To isolate the overhead of the pybind11 function call, I have replaced the C++ function with an empty body, i.e., just doing nothing. This reduced the time to 2-3 μs, which means that the time really is spend in that particular function body.
Any ideas why I'm observing such a poor performance, and is there anything I could do on the C++ side to match the performance of the numba implementation?
Bonus question: Is it expected that the GPU versions are much slower than the CPU versions?
I realize that this is an older question, but I'd like to provide an answer for anyone who ends up here looking to improve the speed of their C++ extension. As mentioned in the github issue, the issue is that the torch::Tensor::operator[]
is slower than you might expect as it needs to fetch the data and convert it to the relevant type, which is slower than your typical std::vector::operator[]
. The solution is to access the raw data in the Tensor directly.
For a contiguous Tensor such as in this case, that is not too difficult:
#include <span>
#include <torch/extension.h>
void delay_line_forward(torch::Tensor samples, torch::Tensor delays) {
const int64_t input_size = samples.size(-1);
assert(samples.is_contiguous() && delays.is_contiguous());
std::span<float> samples_span(samples.data_ptr<float>(), input_size);
std::span<float> delays_span(delays.data_ptr<float>(), input_size);
for (int64_t i = 0; i < input_size; ++i) {
int64_t delay = static_cast<int64_t>(delays_span[i]);
int64_t index_delayed = i - delay;
if (index_delayed < 0) {
index_delayed = 0;
}
samples_span[i] = 0.5 * (samples_span[i] + samples_span[index_delayed]);
}
}
And we can see that it has the desired effect (I was having issues with the GPU execution that I didn't feel like debugging, so I'm only showing CPU results):
╭──────────────┬──────────┬────────────────────╮
│ Method │ Device │ Median time [ms] │
├──────────────┼──────────┼────────────────────┤
│ plain_python │ CPU │ 6.077 │
│ torchscript │ CPU │ 4.273 │
│ numba │ CPU │ 0.007 │
│ cpp │ CPU │ 0.002 │
╰──────────────┴──────────┴────────────────────╯
Also as for why the GPU execution is so much slower, the problem is the code here is inherently serial, and CPU will always win out over GPU in serial execution. If this code could be parallelized using Tensor operators to run the operations as a batch, then I imagine you'd see the GPU really shine. However, I don't think that's possible since earlier operations affect later ones.