pythonmultithreadinggil

Why is this multi-threaded code faster than the sequential one, if the task is 100% on the CPU? - Python3.10


I have two simple code examples of the same calculation, which creates load on the CPU. The sequential one is significantly slower then the multi-threaded one. If I understand the concept of the GIL correctly, that should not be the case, as just one thread can run at the same time. So the overall time should be the same, as there is not wait time in I/O.

What could be an explanation for this? I have another example with the multiprocessing library, which runs even faster then the multi-threaded one, which is expected, as it creates load on 3 cores in parallel.

Sequential code - ~31 seconds:

import time

start_time = time.time()
my_threads = [100000000, 200000000, 300000000]
for x in my_threads:
    print(f"start {x}")
    for y in range(x):
        y*y
    print(f"finished input {x}: {time.time() - start_time:.2f} seconds")

print(f"{time.time() - start_time:.2f} seconds")

Multi-threaded code - ~15 seconds execution time:

EDIT: removed all unnecessary code, to make it simpler (does not change the outcome)

import threading
import time

def my_threaded_function(s, x):
    with s:
        start_time = time.time()
        for y in range(x):
            y*y
        print(f"finished input {x}: {time.time() - start_time:.2f} seconds")

s = threading.Semaphore(25)

my_threads = [100000000, 200000000, 300000000]
threads = []

start_time = time.time()

for x in my_threads:
    t = threading.Thread(
        target=my_threaded_function, name=f"Thread-{x}",
        args=(s, x,)
    )
    t.daemon = True
    threads.append(t)
    t.start()

[thread.join() for thread in threads]

print(f"{time.time() - start_time:.2f} seconds")

I expect the two examples to have the same execution time.

EDIT: full output:

❯ python3 code/sequential.py
finished input 100000000: 4.57 seconds
finished input 200000000: 13.61 seconds
finished input 300000000: 27.09 seconds
27.09 seconds
❯ python3 code/threading_example.py
finished input 100000000: 7.84 seconds
finished input 200000000: 12.44 seconds
finished input 300000000: 14.65 seconds
14.68 seconds

Solution

  • The reason is that in the sequential code, you have global variables for reading and writing but in the threaded code, the variables are local.

    Local variables are stored(STORE_FAST) and loaded(LOAD_FAST) much faster because the namespace of the functions are actually an array, instead of the dictionary and Python accesses them by index.

    Put all the code in the sequential code in a function and run that again. You'll see that they are almost the same.

    Having:

    for i in range(n):
        x = i * i
    

    In the first line of the for loop, you store the returned number in i. In the second line, you look up i twice (i * i) and store x. These are all slower in global namespace.