pythonstringpython-internalspython-3.12

In Python 3.12, why does 'Öl' take less memory than 'Ö'?


I just read PEP 393 and learned that Python's str type uses different internal representations, depending on the content. So, I experimented a little bit and was a bit surprised by the results:

>>> sys.getsizeof('')
41
>>> sys.getsizeof('H')
42
>>> sys.getsizeof('Hi')
43
>>> sys.getsizeof('Ö')
61
>>> sys.getsizeof('Öl')
59

I understand that in the first three cases, the strings don't contain any non-ASCII characters, so an encoding with 1 byte per char can be used. Putting a non-ASCII character like Ö in a string forces the interpreter to use a different encoding. Therefore, I'm not surprised that 'Ö' takes more space than 'H'.

However, why does 'Öl' take less space than 'Ö'? I assumed that whatever internal representation is used for 'Öl' allows for an even shorter representation of 'Ö'.

I'm using Python 3.12, apparently it is not reproducible in earlier versions.


Solution

  • This test code (the structures are only correct according to 3.12.4 source, and even so I didn't quite double-check them)

    import ctypes
    import sys
    
    
    class PyUnicodeObject(ctypes.Structure):
        _fields_ = [
            ("ob_refcnt", ctypes.c_ssize_t),
            ("ob_type", ctypes.c_void_p),
            ("length", ctypes.c_ssize_t),
            ("hash", ctypes.c_ssize_t),
            ("state", ctypes.c_uint64),
        ]
    
    
    class StateBitField(ctypes.LittleEndianStructure):
        _fields_ = [
            ("interned", ctypes.c_uint, 2),
            ("kind", ctypes.c_uint, 3),
            ("compact", ctypes.c_uint, 1),
            ("ascii", ctypes.c_uint, 1),
            ("statically_allocated", ctypes.c_uint, 1),
            ("_padding", ctypes.c_uint, 24),
        ]
    
        def __repr__(self):
            return ", ".join(f"{k}: {getattr(self, k)}" for k, *_ in self._fields_ if not k.startswith("_"))
    
    
    def dump_s(s: str):
        o = PyUnicodeObject.from_address(id(s))
        state_int = o.state
        state = StateBitField.from_buffer(ctypes.c_uint64(state_int))
        print(f"{s!r}".ljust(8), f"{o.length=}, {sys.getsizeof(s)=}, {state}")
    
    
    dump_s('5')
    dump_s('a')
    dump_s('ä')
    dump_s('vvv')
    dump_s('ÖÖÖ')
    dump_s(str(chr(214)))  # avoid the string having been interned into module source
    dump_s(str(chr(214) + chr(108)))  # avoid the string having been interned into module source
    

    prints out

    '5'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
    'a'      o.length=1, sys.getsizeof(s)=42, interned: 3, kind: 1, compact: 1, ascii: 1, statically_allocated: 1
    'ä'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
    'vvv'    o.length=3, sys.getsizeof(s)=44, interned: 2, kind: 1, compact: 1, ascii: 1, statically_allocated: 0
    'ÖÖÖ'    o.length=3, sys.getsizeof(s)=60, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
    'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
    'Öl'     o.length=2, sys.getsizeof(s)=59, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 0
    'Ö'      o.length=1, sys.getsizeof(s)=61, interned: 0, kind: 1, compact: 1, ascii: 0, statically_allocated: 1
    

    – the smoking gun seems to be statically_allocated on Ö etc..

    I think that stems from this line in pycore_runtime_init_generated where it looks like the runtime statically objects for all Latin-1 strings (among others). As discussed in the comments, this CPython PR added UTF-8 representations of all of these statically allocated strings, so Ö is statically stored as both Latin-1 (1 character) and UTF-8 (2 characters).

    Also, I should note getsizeof() actually forwards to unicode_sizeof_impl, it's not just measuring memory.