Python single thread performance vs main thread performance

I was experimenting with pythons multiprocessing and threading modules and, stumbled upon something rather strange.

have a look at the following program:

import threading
import time
import multiprocessing

target_count = int(100e6)

mutex = threading.Lock()

def thread_fn(thread_count):
    y = 0
    for i in range(target_count // thread_count):
        y += 1

x = 0
start = time.time()
for i in range(target_count):
    x += 1
end = time.time()

print(f"single thread time: {(end - start) * 1000}ms")

thread_count = 1
trds = []
start = time.time()
for i in range(thread_count):
    t = threading.Thread(target=thread_fn, args=(thread_count, ))
    t.start()
    trds.append(t)

for i in range(thread_count):
    trds[i].join()
end = time.time()

print(f"{thread_count} threads time: {(end - start) * 1000}ms")

Main thread increments variable until it reaches target_count, similarly opened single thread function thread_fn() increments a local variable.

I was expecting them to be around the same value but, the result I get from running this code is:

single thread time: 9375.770568847656ms
1 threads time: 5802.408456802368ms

Is this due to erroneous code or is something weird about GIL happening?

Also, if I set thread_count=8 change the thread_fn() program as follows (I know that using mutex here is non-sensical but I happen to have left it there accidentally):

def thread_fn(thread_count):
    y = 0
    with mutex:
        for i in range(target_count // thread_count):
            y += 1

running the program gives:

single thread time: 10064.738273620605ms
8 threads time: 6456.046581268311ms

Why with the addition of mutex i get a speed boost? I was expecting it to be slower due to unnecessary locking and unlocking since as far as i know because of GIL threads run sequentially.

and without using mutex and using 8 threads:

single thread time: 10910.805225372314ms
8 threads time: 13055.136442184448ms

Solution

The difference you are seeing with 1 worker - thread -

single thread time: 9375.770568847656ms
1 threads time: 5802.408456802368ms

has nothing to do with threads - a worker thread would not be "faster" than the main thread - however, using local variables inside a function, as compared to running your for loop at the module top level, where x is a global variable does: function local variable access is much more optmized than global variable usage, which always require a dictionary lookup in the namespace (whereas, in cPython, local variables inside a function use a single slot, with the index to access it embedded in the code object at compile time)

As for why the timing using the mutex is much faster than without it: you have some few bytecodes which depend on a local context - without the mutex, the code really doesn't run concurrently, but on the other hand, there are no context-switches on the Python side - execution frame and other state in the Python VM is fixed for the count of i to the end. You don't say which Python version you are using - but this context switching should take less overhead in Python 3.12 (and still less in upcoming versions) than it took up to version 3.11 - if you run with it with different interpreters there should be a noticeable difference.

Also, while this context switching adds to the overhead - entering the lock is done just once in each thread - it is 1 << 100_000_000 numbers: it had to be negligible.

If you change your lock to be acquired inside the for i loop, then you should get much more time running the same code. (still, when using mutexes and real-world code is the pattern we tend to write ):

def thread_fn(thread_count):
    y = 0
    for i in range(target_count // thread_count):
        with mutex:  
            y += 1

Still, if y is made a global variable all worker-threads would be adding too - this usage of the mutex is the only one which would ensure you had a correct result while running concurrent code (all the counts ocurring interleaving ina transparent way) .