pythonc++python-bindings

Pybind11 is slower than Pure Python


I created Python Bindings using pybind11. Everything worked perfectly, but when I did a speed check test the result was disappointing.

Basically, I have a function in C++ that adds two numbers and I want to use that function from a Python script. I also included a for loop to ran 100 times to better view the difference in the processing time.

For the function "imported" from C++, using pybind11, I obtain: 0.002310514450073242 ~ 0.0034799575805664062

For the simple Python script, I obtain: 0.0012788772583007812 ~ 0.0015883445739746094

main.cpp file:

#include <pybind11/pybind11.h>
namespace py = pybind11;

double sum(double a, double b) {
    return a + b;
}

PYBIND11_MODULE(SumFunction, var) {
    var.doc() = "pybind11 example module";
    var.def("sum", &sum, "This function adds two input numbers");
}

main.py file:

from build.SumFunction import *
import time

start = time.time()
for i in range(100):
    print(sum(2.3,5.2))
end = time.time()

print(end - start)

CMakeLists.txt file:

cmake_minimum_required(VERSION 3.0.0)
project(Projectpybind11 VERSION 0.1.0)

include(CTest)
enable_testing()

add_subdirectory(pybind11)
pybind11_add_module(SumFunction main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

Simple Python script:

import time

def summ(a,b):
        return a+b
start = time.time()
for i in range(100):
        print(summ(2.3,5.2))
end = time.time()

print(end - start)

Solution

    1. Benchmarking is a very complicated thing, even can be called as a Systemic Engineering.

      Because there are many processes will interference our benchmarking job. For example: NIC interrupt responsing / keyboard or mouse input / OS scheduling... I have encountered my producing process being blocked by OS for up to 15 seconds! So as the other advisors have pointed out, the print() invokes more unnecessary interference.

    2. Your testing computation is too simple.

      You must think it out clearly what are you comparing for. The speed of passing arguments between Python and C++ is obviously slower than that of within Python side. So I assume that you want to compare the computing speed of both, instead of arguments passing speed. If so, I think your computing codes are too simple, and these will lead to the time we counted is mainly the time for passing args, while the time for computing is merely the minor of the total. So, I put out my sample below, I will be glad to see anyone polish it.

    3. Your loop count is too less.

      The less loops, the more randomness. Similar with my opinion 1, testing time is merely 0.000x second. It is possible, that the running process be interferenced by OS. I think we should make the testing time to last at least a few of seconds.

    4. C++ is not always faster than Python. Now time there are so many Python modules/libs can use GPU to execute heavy computation, and parallelly do matrix operations even only by using CPU. I guess that perhaps you are evaluating whether or not using Pybind11 in your project. I think that comparing like this worth nothing, because what is the best tool depends on what is the real requirement, but it is a good lesson to learn things. I recently encountered a case, Python is faster than C++ in a Deep Learning. Haha, funny?

    At the end, I run my sample in my PC, and found that the C++ computing speed is faster up to 100 times than that in Python.

    ComplexCpp.cpp:

    #include <cmath>
    #include <pybind11/numpy.h>
    #include <pybind11/pybind11.h>
    
    namespace py = pybind11;
    
    double Compute( double x, py::array_t<double> ys ) {
    //  std::cout << "x:" << std::setprecision( 16 ) << x << std::endl;
        auto r = ys.unchecked<1>();
        for( py::ssize_t i = 0; i < r.shape( 0 ); ++i ) {
            double y = r( i );
    //      std::cout << "y:" << std::setprecision( 16 ) << y << std::endl;
            x += y;
            x *= y;
            y = std::max( y, 1.001 );
            x /= y;
            x *= std::log( y );
        }
        return x;
    };
    
    PYBIND11_MODULE( ComplexCpp, m ) {
        m.def( "Compute", &Compute, "a more complicated computing" );
    };
    

    tryComplexCpp.py

    import ComplexCpp
    import math
    import numpy as np
    import random
    import time
    
    
    def PyCompute(x: float, ys: np.ndarray) -> float:
        #print(f'x:{x}')
        for y in ys:
            #print(f'y:{y}')
            x += y
            x *= y
            y = max(y, 1.001)
            x /= y
            x *= math.log(y)
        return x
    
    
    LOOPS: int = 100000000
    
    if __name__ == "__main__":
        # initialize random
        x0 = random.random()
    
        """ We store all args in a array, then pass them into both C++ func and
            python side, to ensure that args for both sides are same. """
        args = np.ndarray(LOOPS, dtype=np.float64)
        for i in range(LOOPS):
            args[i] = random.random()
    
        print('Args are ready, now start...')
    
        # try it with C++
        start_time = time.time()
        x = ComplexCpp.Compute(x0, args)
        print(f'Computing with C++ in { time.time() - start_time }.\n')
        # forcely use the result to prevent the entire procedure be optimized(omit)
        print(f'The result is {x}\n')
    
        # try it with python
        start_time = time.time()
        x = PyCompute(x0, args)
        print(f'Computing with Python in { time.time() - start_time }.\n')
        # forcely use the result to prevent the entire procedure be optimized(omit)
        print(f'The result is {x}\n')