We are able to defeat the small integer intern in this way (a calculation allows us to avoid the caching layer):
>>> n = 674039
>>> one1 = 1
>>> one2 = (n ** 9 + 1) % (n ** 9)
>>> one1 == one2
True
>>> one1 is one2
False
How can you defeat the small string intern, i.e. to see the following result:
>>> one1 = "1"
>>> one2 = <???>
>>> type(one2) is str and one1 == one2
True
>>> one1 is one2
False
sys.intern
mentions that "Interned strings are not immortal", but there's no context about how a string could kicked out of the intern, or how you can create a str
instance avoiding the caching layer.
Since interning is CPython implementation detail, answers relying on undocumented implementation details are ok/expected.
Unicode consisting of only one character (with value smaller than 128 or more precisely from latin1
) is the most complicated case, because those strings aren't really interned but (more similar to the integer pool or identically to the behavior for bytes
) are created at the start and are stored in an array as long as the interpreter is alive:
truct _Py_unicode_state {
...
/* Single character Unicode strings in the Latin-1 range are being
shared as well. */
PyObject *latin1[256];
...
/* This dictionary holds all interned unicode strings...
*/
PyObject *interned;
...
};
So every time a length 1 unicode is created, the character value gets looked up if it is in the latin1
-array. E.g. in unicode_decode_utf8
:
/* ASCII is equivalent to the first 128 ordinals in Unicode. */
if (size == 1 && (unsigned char)s[0] < 128) {
if (consumed) {
*consumed = 1;
}
return get_latin1_char((unsigned char)s[0]);
}
One could even argue, if there is a way to circumvent this in the interpreter - we speak about a (performance-) bug.
A possibility is to populate the unicode-data by ourselves using C-API. I use Cython
for the proof of concept, but also ctypes
could be used to the same effect:
%%cython
cdef extern from *:
"""
PyObject* create_new_unicode(char *ch)
{
PyUnicodeObject *ob = (PyUnicodeObject *)PyUnicode_New(1, 127);
Py_UCS1 *data = PyUnicode_1BYTE_DATA(ob);
data[0]=ch[0]; //fill data without using the unicode_decode_utf8
return (PyObject*)ob;
}
"""
object create_new_unicode(char *ch)
def gen1():
return create_new_unicode(b"1")
Noteworthy details:
PyUnicode_New
would not look up in latin1
, because the characters aren't set yet.127
as maxchar
to PyUnicode_New
. As result, we can interpret data via PyUnicode_1BYTE_DATA
which makes it easy to manipulate it without much ado manually.And now:
a,b=gen1(), gen1()
a is b, a == b
# yields (False, True)
as wanted.
Here is a similar idea, but implemented with ctypes
:
from ctypes import POINTER, py_object, c_ssize_t, byref, pythonapi
PyUnicode_New = pythonapi.PyUnicode_New
PyUnicode_New.argtypes = (c_ssize_t, c_ssize_t)
PyUnicode_New.restype = py_object
PyUnicode_CopyCharacters = pythonapi._PyUnicode_FastCopyCharacters
PyUnicode_CopyCharacters.argtypes = (py_object, c_ssize_t, py_object, c_ssize_t, c_ssize_t)
PyUnicode_CopyCharacters.restype = c_ssize_t
def clone(orig):
cloned = PyUnicode_New(1,127)
PyUnicode_CopyCharacters(cloned, 0, orig, 0, 1)
return cloned
Noteworthy details:
PyUnicode_1BYTE_DATA
with ctypes, because it is a macro. An alternative would be to calculate the offset to data
-member and directly access this memory (but it depends on the platform and doesn't feel very portable)PyUnicode_CopyCharacters
is used (there are probably also other possibilities to achieve the same), which is more abstract and portable than directly calculating/accessing the memory._PyUnicode_FastCopyCharacters
is used, because PyUnicode_CopyCharacters
would check, that the target-unicode has multiple references and throw. _PyUnicode_FastCopyCharacters
doesn't perform those checks and does as asked.And now:
a="1"
b=clone(a)
a is b, a==b
# yields (False, True)
For strings longer than 1 character, it is a lot easier to avoid interning, e.g.:
a="12"
b="123"[0:2]
a is b, a == b
#yields (False, True)