I did a simple performance test on python 3.12.0
against python 3.13.0b3
compiled with a --disable-gil
flag. The program executes calculations of a Fibonacci sequence using ThreadPoolExecutor
or ProcessPoolExecutor
. The docs on the PEP introducing disabled GIL says that there is a bit of overhead mostly due to biased reference counting followed by per-object locking (https://peps.python.org/pep-0703/#performance). But it says the overhead on pyperformance benchmark suit is around 5-8%. My simple benchmark shows a significant difference in the performance. Indeed, python 3.13 without GIL utilize all CPUs
with a ThreadPoolExecutor
but it is much slower than python 3.12 with GIL. Based on the CPU utilization and the elapsed time we can conclude that with python 3.13 we do multiple times more clock cycles comparing to the 3.12.
Program code:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import datetime
from functools import partial
import sys
import logging
import multiprocessing
logging.basicConfig(
format='%(levelname)s: %(message)s',
)
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
cpus = multiprocessing.cpu_count()
pool_executor = ProcessPoolExecutor if len(sys.argv) > 1 and sys.argv[1] == '1' else ThreadPoolExecutor
python_version_str = f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}'
logger.info(f'Executor={pool_executor.__name__}, python={python_version_str}, cpus={cpus}')
def fibonacci(n: int) -> int:
if n < 0:
raise ValueError("Incorrect input")
elif n == 0:
return 0
elif n == 1 or n == 2:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)
start = datetime.datetime.now()
with pool_executor(8) as executor:
for task_id in range(30):
executor.submit(partial(fibonacci, 30))
executor.shutdown(wait=True)
end = datetime.datetime.now()
elapsed = end - start
logger.info(f'Elapsed: {elapsed.total_seconds():.2f} seconds')
Test results:
# TEST Linux 5.15.0-58-generic, Ubuntu 20.04.6 LTS
INFO: Executor=ThreadPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 10.54 seconds
INFO: Executor=ProcessPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 4.33 seconds
INFO: Executor=ThreadPoolExecutor, python=3.13.0b3, cpus=2
INFO: Elapsed: 22.48 seconds
INFO: Executor=ProcessPoolExecutor, python=3.13.0b3, cpus=2
INFO: Elapsed: 22.03 seconds
Can anyone explain why do I experience such a difference when comparing the overhead to the one from pyperformance benchmark suit?
pool_executor(cpus)
instead of pool_executor(8)
-> still got the similar results.Results:
Version of python: 3.12.0a7 (main, Oct 8 2023, 12:41:37) [GCC 9.4.0]
GIL cannot be disabled
Single-threaded: 78498 primes in 6.67 seconds
Threaded: 78498 primes in 7.89 seconds
Multiprocessed: 78498 primes in 5.85 seconds
Version of python: 3.13.0b3 experimental free-threading build (heads/3.13.0b3:7b413952e8, Jul 27 2024, 11:19:31) [GCC 9.4.0]
GIL is disabled
Single-threaded: 78498 primes in 61.42 seconds
Threaded: 78498 primes in 32.29 seconds
Multiprocessed: 78498 primes in 39.85 seconds
so yet another test on my machine when we end up with multiple times slower performance. Btw. On the video we can see the similar overhead results as it is described in the PEP.
As @ekhumoro suggested I did configure the build with the following flags:
./configure --disable-gil --enable-optimizations
and it seems the --enable-optimizations
flag makes a significant difference in the considered benchmarks. The previous build was done with the following configuration:
./configure --with-pydebug --disable-gil
.
Tests results:
INFO: Executor=ThreadPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 10.25 seconds
INFO: Executor=ProcessPoolExecutor, python=3.12.0, cpus=2
INFO: Elapsed: 4.27 seconds
INFO: Executor=ThreadPoolExecutor, python=3.13.0, cpus=2
INFO: Elapsed: 6.94 seconds
INFO: Executor=ProcessPoolExecutor, python=3.13.0, cpus=2
INFO: Elapsed: 6.94 seconds
Version of python: 3.12.0a7 (main, Oct 8 2023, 12:41:37) [GCC 9.4.0]
GIL cannot be disabled
Single-threaded: 78498 primes in 5.77 seconds
Threaded: 78498 primes in 7.21 seconds
Multiprocessed: 78498 primes in 3.23 seconds
Version of python: 3.13.0b3 experimental free-threading build (heads/3.13.0b3:7b413952e8, Aug 3 2024, 14:47:48) [GCC 9.4.0]
GIL is disabled
Single-threaded: 78498 primes in 7.99 seconds
Threaded: 78498 primes in 4.17 seconds
Multiprocessed: 78498 primes in 4.40 seconds
So the general gain from moving from python 3.12 multiprocessing to python 3.12 no-gil multi-threading are significant memory savings (we do have only a single process).
When we compare CPU overhead for the machine with only 2 cores:
[Fibonacci] Python 3.13 multi-threading against Python 3.12 multiprocessing: (6.94 - 4.27) / 4.27 * 100% ~= 63% overhead
[Prime numbers] Python 3.13 multi-threading against Python 3.12 multiprocessing: (4.17 - 3.23) / 3.23 * 100% ~= 29% overhead
From the latest question edits, it seems the version of Python-3.13 used for testing was built with debug mode enabled and without optimisations enabled. The former flag in particular can have a large impact on performance testing, whilst the latter will have a much smaller, but still significant, impact. In general, it's best to avoid drawing any conclusions about performance issues when testing with development builds of Python.