pythoncachingimplementationcpythonstring-interning

How to create the str "1" at two different memory locations?


We are able to defeat the small integer intern in this way (a calculation allows us to avoid the caching layer):

>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False

How can you defeat the small string intern, i.e. to see the following result:

>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False

sys.intern mentions that "Interned strings are not immortal", but there's no context about how a string could kicked out of the intern, or how you can create a str instance avoiding the caching layer.

Since interning is CPython implementation detail, answers relying on undocumented implementation details are ok/expected.


Solution

  • Unicode consisting of only one character (with value smaller than 128 or more precisely from latin1) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior for bytes) are created at the start and are stored in an array as long as the interpreter is alive:

    truct _Py_unicode_state {
        ...
        /* Single character Unicode strings in the Latin-1 range are being
           shared as well. */
        PyObject *latin1[256];
        ...
        /* This dictionary holds all interned unicode strings...
        */
        PyObject *interned;
        ...
    };
    

    So every time a length 1 unicode is created, the character value gets looked up if it is in the latin1-array. E.g. in unicode_decode_utf8:

    /* ASCII is equivalent to the first 128 ordinals in Unicode. */
        if (size == 1 && (unsigned char)s[0] < 128) {
            if (consumed) {
                *consumed = 1;
            }
            return get_latin1_char((unsigned char)s[0]);
        }
    

    One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.

    A possibility is to populate the unicode-data by ourselves using C-API. I use Cython for the proof of concept, but also ctypes could be used to the same effect:

    %%cython
    cdef extern from *:
        """
        PyObject* create_new_unicode(char *ch) 
        {
           PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
           Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
           data[0]=ch[0]; //fill data without using the unicode_decode_utf8
           return (PyObject*)ob;
        }
        """
        object create_new_unicode(char *ch)
        
    def gen1():
        return create_new_unicode(b"1")
    

    Noteworthy details:

    And now:

    a,b=gen1(), gen1()
    a is b, a == b
    # yields (False, True)
    

    as wanted.


    Here is a similar idea, but implemented with ctypes:

    from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
    PyUnicode_New = pythonapi.PyUnicode_New
    PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
    PyUnicode_New.restype = py_object
    PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
    PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
    PyUnicode_CopyCharacters.restype = c_ssize_t
    
    def clone(orig):
        cloned = PyUnicode_New(1,127)
        PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
        return cloned
    

    Noteworthy details:

    And now:

    a="1"
    b=clone(a)
    a is b, a==b
    # yields (False, True)
    

    For strings longer than 1 character, it is a lot easier to avoid interning, e.g.:

    a="12"
    b="123"[0:2]
    a is b, a == b
    #yields (False, True)