python string python-internals python-3.12

In Python 3.12, why does 'Öl' take less memory than 'Ö'?

I just read PEP 393 and learned that Python's str type uses different internal representations, depending on the content. So, I experimented a little bit and was a bit surprised by the results:

>>> sys.getsizeof('')
41
>>> sys.getsizeof('H')
42
>>> sys.getsizeof('Hi')
43
>>> sys.getsizeof('Ö')
61
>>> sys.getsizeof('Öl')
59

I understand that in the first three cases, the strings don't contain any non-ASCII characters, so an encoding with 1 byte per char can be used. Putting a non-ASCII character like Ö in a string forces the interpreter to use a different encoding. Therefore, I'm not surprised that 'Ö' takes more space than 'H'.

However, why does 'Öl' take less space than 'Ö'? I assumed that whatever internal representation is used for 'Öl' allows for an even shorter representation of 'Ö'.

I'm using Python 3.12, apparently it is not reproducible in earlier versions.

Solution

This test code (the structures are only correct according to 3.12.4 source, and even so I didn't quite double-check them)

import ctypes
import sys


class PyUnicodeObject(ctypes.Structure):
    _fields_ = [
        ("ob_refcnt", ctypes.c_ssize_t),
        ("ob_type", ctypes.c_void_p),
        ("length", ctypes.c_ssize_t),
        ("hash", ctypes.c_ssize_t),
        ("state", ctypes.c_uint64),
    ]


class StateBitField(ctypes.LittleEndianStructure):
    _fields_ = [
        ("interned", ctypes.c_uint, 2),
        ("kind", ctypes.c_uint, 3),
        ("compact", ctypes.c_uint, 1),
        ("ascii", ctypes.c_uint, 1),
        ("statically_allocated", ctypes.c_uint, 1),
        ("_padding", ctypes.c_uint, 24),
    ]

    def __repr__(self):
        return ", ".join(f"{k}: {getattr(self, k)}" for k, *_ in self._fields_ if not k.startswith("_"))


def dump_s(s: str):
    o = PyUnicodeObject.from_address(id(s))
    state_int = o.state
    state = StateBitField.from_buffer(ctypes.c_uint64(state_int))
    print(f"{s!r}".ljust(8), f"{o.length=}, {sys.getsizeof(s)=}, {state}")


dump_s('5')
dump_s('a')
dump_s('ä')
dump_s('vvv')
dump_s('ÖÖÖ')
dump_s(str(chr(214)))  # avoid the string having been interned into module source
dump_s(str(chr(214) + chr(108)))  # avoid the string having been interned into module source

prints out

'5'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'a'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
'ä'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'vvv'    o.length=3, sys.getsizeof(s)=44, interned: 2, kind: 1, compact: 1, ascii: 1, statically_allocated: 0
'ÖÖÖ'    o.length=3, sys.getsizeof(s)=60, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
'Öl'     o.length=2, sys.getsizeof(s)=59, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1

– the smoking gun seems to be statically_allocated on Ö etc..

I think that stems from this line in pycore_runtime_init_generated where it looks like the runtime statically objects for all Latin-1 strings (among others). As discussed in the comments, this CPython PR added UTF-8 representations of all of these statically allocated strings, so Ö is statically stored as both Latin-1 (1 character) and UTF-8 (2 characters).

Also, I should note getsizeof() actually forwards to unicode_sizeof_impl, it's not just measuring memory.