pythonpython-c-api

u# format character removed from Python 3.12 C-API, how to account for it?


A bunch of unicode-related functionality was removed from the Python 3.12 C-API. Unfortunately for me, there's a very old piece of code (~2010) in our library that uses these and I need to migrate this functionality somehow over to 3.12 since we're looking to upgrade to 3.12 eventually. One thing I'm specifically struggling with is the removal of the u# parameter. The following piece of code would parse any positional parameters passed to foo (including unicode strings), and store them in input:

static PyObject *
foo(PyObject *self, PyObject *args) {
    Py_UNICODE *input;
    Py_ssize_t length;
    
    if (!PyArg_ParseTuple(args, "u#", &input, &length)) {
        return NULL;
    }
    ...
}

However, according to the docs, the u# has been removed:

Changed in version 3.12: u, u#, Z, and Z# are removed because they used a legacy Py_UNICODE* representation.

and the current code simply throws something like bad-format-character when this is compiled and used in pure python.

Py_UNICODE is just wchar_t so that's easily fixed. But with the removal of u# I am not sure how to get PyArg_ParseTuple to accept unicode input arguments. Using s# instead of u# does not work since it won't handle anything widechar. How do I migrate this call in Python 3.12?


Solution

  • s# handles Unicode fine, but it gives you UTF-8 rather than wchar_t. If you specifically need a wchar representation, you can get one from a string object with PyUnicode_AsWideCharString:

    Py_ssize_t size;
    wchar_t *wchar_representation = PyUnicode_AsWideCharString(stringobj, &size);
    
    if (!wchar_representation) {
        // error occurred. do something about that.
    }
    
    // do stuff with wchar_representation, then once you're done,
    
    PyMem_Free(wchar_representation);
    

    Unlike the old Py_UNICODE API, this allocates a new buffer, which you have to free with PyMem_Free when you're done with it.