pythonperformancepython-internals

Why is bytes(lst) slower than bytearray(lst)?


With lst = [0] * 10**6 I get times like these:

  5.4 ± 0.4 ms  bytearray(lst)
  5.6 ± 0.4 ms  bytes(bytearray(lst))
 13.1 ± 0.7 ms  bytes(lst)

Python:
3.13.0 (main, Nov  9 2024, 10:04:25) [GCC 14.2.1 20240910]
namespace(name='cpython', cache_tag='cpython-313', version=sys.version_info(major=3, minor=13, micro=0, releaselevel='final', serial=0), hexversion=51183856, _multiarch='x86_64-linux-gnu')

I would've expected bytes(lst) and bytearray(lst) to be equally fast or bytearray(lst) to be the slower one, since it's the more complicated type (has more functionality, as it's mutable). But it's the other way around. And even the detour bytes(bytearray(lst)) is much faster than bytes(lst)! Why is bytes(lst) so slow?

Benchmark script (Attempt This Online!):

from timeit import timeit
from statistics import mean, stdev
import random
import sys

setup = 'lst = [0] * 10**6'

codes = [
    'bytes(lst)',
    'bytearray(lst)',
    'bytes(bytearray(lst))'
]

times = {c: [] for c in codes}
def stats(c):
    ts = [t * 1e3 for t in sorted(times[c])[:5]]
    return f'{mean(ts):5.1f} ± {stdev(ts):3.1f} ms '
for _ in range(25):
    random.shuffle(codes)
    for c in codes:
        t = timeit(c, setup, number=10) / 10
        times[c].append(t)
for c in sorted(codes, key=stats):
    print(stats(c), c)

print('\nPython:')
print(sys.version)
print(sys.implementation)

Inspired by an answer that also found bytearray to be faster.


Solution

  • As pointed out by @Homer512 in the comments, bytearray and bytes are implemented quite differently in CPython. While bytearray was given a fast path for creation from list and tuple with GitHub issue #91149, bytes did not receive the same optimization.

    The said fast path for bytearray takes advantage of the PySequence_Fast_ITEMS, which returns an array of objects, when given a list or a tuple, so that all items can be obtained by simply iterating over the array.

    The creation of bytes from a list, on the other hand, has to make a PyList_GET_ITEM call to obtain each item in the list with each index, hence the comparatively low performance.

    I've refactored the code for bytes creation from list and tuple in my PR #128214, which, with a benchmark using:

    python -m timeit -n 100 -s 'a = [40] * 100000' 'bytes(a)'
    

    shows a ~81% increase in performance:

    100 loops, best of 5: 3.43 msec per loop # current
    100 loops, best of 5: 1.89 msec per loop # PR #128214
    

    and is indeed faster than the same benchmark constructing bytearray instead:

    100 loops, best of 5: 2.14 msec per loop