c++cutf-8strncpy

utf8 aware strncpy


I find it hard to believe I'm the first person to run into this problem but searched for quite some time and didn't find a solution to this.

I'd like to use strncpy but have it be UTF8 aware so it doesn't partially write a utf8 code-point into the destination string.

Otherwise you can never be sure that the resulting string is valid UTF8, even if you know the source is (when the source string is larger than the max length).

Validating the resulting string can work but if this is to be called a lot it would be better to have a strncpy function that checks for it.

glib has g_utf8_strncpy but this copies a certain number of unicode chars, whereas Im looking for a copy function that limits by the byte length.

To be clear, by "utf8 aware", I mean that it should not exceed the limit of the destination buffer and it must never copy only part of a utf-8 code-point. (Given valid utf-8 input must never result in having invalid utf-8 output).


Note:

Some replies have pointed out that strncpy nulls all bytes and that it wont ensure zero termination, in retrospect I should have asked for a utf8 aware strlcpy, however at the time I didn't know of the existence of this function.


Solution

  • To reply to own question, heres the C function I ended up with (Not using C++ for this project):

    Notes: - Realize this is not a clone of strncpy for utf8, its more like strlcpy from openbsd. - utf8_skip_data copied from glib's gutf8.c - It doesn't validate the utf8 - which is what I intended.

    Hope this is useful to others and interested in feedback, but please no pedantic zealot's about NULL termination behavior unless its an actual bug, or misleading/incorrect behavior.

    Thanks to James Kanze who provided the basis for this, but was incomplete and C++ (I need a C version).

    static const size_t utf8_skip_data[256] = {
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
        2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
        3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,6,6,1,1
    };
    
    char *strlcpy_utf8(char *dst, const char *src, size_t maxncpy)
    {
        char *dst_r = dst;
        size_t utf8_size;
    
        if (maxncpy > 0) {
            while (*src != '\0' && (utf8_size = utf8_skip_data[*((unsigned char *)src)]) < maxncpy) {
                maxncpy -= utf8_size;
                switch (utf8_size) {
                    case 6: *dst ++ = *src ++;
                    case 5: *dst ++ = *src ++;
                    case 4: *dst ++ = *src ++;
                    case 3: *dst ++ = *src ++;
                    case 2: *dst ++ = *src ++;
                    case 1: *dst ++ = *src ++;
                }
            }
            *dst= '\0';
        }
        return dst_r;
    }