python performance assembly time integer-arithmetic

Time it takes to square in python

I was wondering whether x**2 or x*x is faster

def sqr(x):
    for i in range (20):
        x = x**2
    return x
def sqr_(x):
    for i in range (20):
        x = x*x
    return x

When I time it, this is what I get:

The time it takes for x**2: 101230500
The time it takes for x*x: 201469200

I have tried it many many times, they are either equal, or x ** 2 is faster than x * x. But x*x is never faster than x**2.

So I disassembled the code:

For x**2:

  5          12 LOAD_FAST                0 (x)
             14 LOAD_CONST               2 (2)
             16 BINARY_POWER
             18 STORE_FAST               0 (x)
             20 JUMP_ABSOLUTE            8

For x*x:

  9          12 LOAD_FAST                0 (x)
             14 LOAD_FAST                0 (x)
             16 BINARY_MULTIPLY
             18 STORE_FAST               0 (x)
             20 JUMP_ABSOLUTE            8

Is it about load_const being slightly faster than load_fast?

LOAD_CONST: takes the literal value at index 1 of co_consts and pushes it

LOAD_FAST is accessing the value in an array by index

Or binary_power is faster than binary_multiply (I actually don't know the binary_power algorithm)?

Solution

For small integers, x*x is significantly faster than x**2 since CPython does a lot more operation internally to compute a**b. Actually, on my machine x*x is 4 times faster (processor i5-9600KF, CPython 3.8.1, on Windows). That being said, in you code, numbers grows very quickly and Python integers are unbounded. In fact, each exponentiation cause the binary representation to be twice bigger. The exponents are multiplied together resulting in the computation of x**(2*2*2*...*2) = x**(2**20) = x**1048576. For big x=2, the number takes 128 KiB in memory and for x=100 it takes 850 KiB. This is pretty big. Each iteration of your loop is bounded by the computation of such huge numbers in memory. As a result, for large numbers, x*x and x**2 are as fast because the same internal computation is done for both cases and the overhead of the CPython interpreter becomes negligible compared to the computation of the huge integers.

Under the hood

Internally, CPython appears to use _PyNumber_PowerNoMod which calls PyNumber_Power which calls ternary_op, and PyNumber_Multiply which calls binary_op1. Note that CPython is not optimized to compute x**2: internally CPython compute pow(x, 2, None) which is the function to compute a modular exponentiation (though the call to pow is a bit less efficient way as CPython has to check pow has not been overwritten). This modular exponentiation function is much more expensive since it is a very generic function compared to x * x.

In the end, it appears long_mul and long_pow are called in your case (note that long_pow calls long_mul internally so long_pow actually need to compute more instructions).

For large numbers, CPython uses the Karatsuba multiplication (see: k_mul).

Note that CPython is actually very slow in both cases since it takes several nanoseconds (at least on my machine) and performs dozens of checks and many function calls just to multiply two integers. This can be done natively in only 1 cycle for 64-bit integers on mainstream x86-64 processors. Large integers cannot be natively computed by mainstream processor and require a much more expensive computation.