I was experimenting with pythons multiprocessing and threading modules and, stumbled upon something rather strange.
have a look at the following program:
import threading
import time
import multiprocessing
target_count = int(100e6)
mutex = threading.Lock()
def thread_fn(thread_count):
y = 0
for i in range(target_count // thread_count):
y += 1
x = 0
start = time.time()
for i in range(target_count):
x += 1
end = time.time()
print(f"single thread time: {(end - start) * 1000}ms")
thread_count = 1
trds = []
start = time.time()
for i in range(thread_count):
t = threading.Thread(target=thread_fn, args=(thread_count, ))
t.start()
trds.append(t)
for i in range(thread_count):
trds[i].join()
end = time.time()
print(f"{thread_count} threads time: {(end - start) * 1000}ms")
Main thread increments variable until it reaches target_count
, similarly opened single thread function thread_fn()
increments a local variable.
I was expecting them to be around the same value but, the result I get from running this code is:
single thread time: 9375.770568847656ms
1 threads time: 5802.408456802368ms
Is this due to erroneous code or is something weird about GIL happening?
Also, if I set thread_count=8
change the thread_fn()
program as follows (I know that using mutex here is non-sensical but I happen to have left it there accidentally):
def thread_fn(thread_count):
y = 0
with mutex:
for i in range(target_count // thread_count):
y += 1
running the program gives:
single thread time: 10064.738273620605ms
8 threads time: 6456.046581268311ms
Why with the addition of mutex i get a speed boost? I was expecting it to be slower due to unnecessary locking and unlocking since as far as i know because of GIL threads run sequentially.
and without using mutex and using 8 threads:
single thread time: 10910.805225372314ms
8 threads time: 13055.136442184448ms
The difference you are seeing with 1 worker - thread -
single thread time: 9375.770568847656ms
1 threads time: 5802.408456802368ms
has nothing to do with threads - a worker thread would not be "faster" than the main thread - however, using local variables inside a function, as compared to running your for
loop at the module top level, where x
is a global variable does: function local variable access is much more optmized than global variable usage, which always require a dictionary lookup in the namespace (whereas, in cPython, local variables inside a function use a single slot, with the index to access it embedded in the code object at compile time)
As for why the timing using the mutex
is much faster than without it: you have some few bytecodes which depend on a local context - without the mutex, the code really doesn't run concurrently, but on the other hand, there are no context-switches on the Python side - execution frame and other state in the Python VM is fixed for the count of i
to the end. You don't say which Python version you are using - but this context switching should take less overhead in Python 3.12 (and still less in upcoming versions) than it took up to version 3.11 - if you run with it with different interpreters there should be a noticeable difference.
Also, while this context switching adds to the overhead - entering the lock is done just once in each thread - it is 1 << 100_000_000 numbers: it had to be negligible.
If you change your lock to be acquired inside the for i
loop, then you should get much more time running the same code. (still, when using mutexes and real-world code is the pattern we tend to write ):
def thread_fn(thread_count):
y = 0
for i in range(target_count // thread_count):
with mutex:
y += 1
Still, if y
is made a global variable all worker-threads would be adding too - this usage of the mutex is the only one which would ensure you had a correct result while running concurrent code (all the counts ocurring interleaving ina
transparent way) .