I was looking into how Python represents string after PEP 393 and I am not understanding the difference between PyASCIIObject and PyCompactUnicodeObject.
My understanding is that strings are represented with the following structures:
typedef struct {
PyObject_HEAD
Py_ssize_t length; /* Number of code points in the string */
Py_hash_t hash; /* Hash value; -1 if not set */
struct {
unsigned int interned:2;
unsigned int kind:3;
unsigned int compact:1;
unsigned int ascii:1;
unsigned int ready:1;
unsigned int :24;
} state;
wchar_t *wstr; /* wchar_t representation (null-terminated) */
} PyASCIIObject;
typedef struct {
PyASCIIObject _base;
Py_ssize_t utf8_length;
char *utf8;
Py_ssize_t wstr_length;
} PyCompactUnicodeObject;
typedef struct {
PyCompactUnicodeObject _base;
union {
void *any;
Py_UCS1 *latin1;
Py_UCS2 *ucs2;
Py_UCS4 *ucs4;
} data;
} PyUnicodeObject;
Correct me if I am wrong, but my understanding is that PyASCIIObject is used for strings with ASCII characters only, PyCompactUnicodeObject uses the PyASCIIObject structure and it is used for strings with at least one non-ASCII character, and PyUnicodeObject is used for legacy functions. Is that correct?
Also, why PyASCIIObject uses wchar_t? Isn't a char enough to represent ASCII strings? In addition, if PyASCIIObject already has a wchar_t pointer, why does PyCompactUnicodeObject also have a char pointer? My understanding is that both pointers point to the same location, but why would you include both?
PEP 373 is really the best reference for your questions, though the C-API docs are sometimes needed too. Lets address your questions one by one:
You have the types right. But there is one non-obvious wrinkle: When you're using either of the "compact" types (either PyASCIIObject
or PyCompactUnicodeObject
), the structure itself is just a header. The string's actual data is stored immediately after the structure in memory. The encoding used by the data is described by the kind
field, and will depend on the largest character value in the string.
The wstr
and utf8
pointers in the first two structures are places where a transformed representation can be stored if one is requested by C code. For an ASCII string (using the PyASCIIObject
), no cache pointer is needed for UTF-8 data, since the ASCII data itself is UTF-8 compatible. The wide character cache is only used by deprecated functions.
The two cache pointers will never point to the same place, since their types are not directly compatible. For compact strings, they are only allocated when a function that needs a UTF-8 buffer (e.g. PyUnicode_AsUTF8AndSize
) or a Py_UNICODE
buffer (e.g. the deprecated PyUnicode_AS_UNICODE
) gets called.
For strings created with the deprecated Py_UNICODE
based APIs, the wstr
pointer has an extra use. It points to the only version of the string data until the PyUnicode_READY
macro is called on the string. The first time the string is readied, a new data
buffer will be created, and the characters will be stored in it, using the most compact encoding possible among Latin-1, UTF-16 and UTF-32. The wstr
buffer will be kept, as it might be needed later by other deprecated API functions that want to look up a PY_UNICODE
string.
It is interesting that you're asking about CPython's internal string representations right now, as there's a discussion currently ongoing about whether deprecated string API functions and implementation details like the wchar *
pointer can be removed in an upcoming version of Python. It looks like it might happen for Python 3.11.0 (which is expected to be released in 2022), though plans could still change before then, especially if the impact on code being used in the wild is more severe than expected.