c++unicodecompiler-warningsc++23char8-t

Can you warn/error when mixing char8_t and char32_t in expressions?


I have a code base which makes extensive use of char8_t and char32_t to represent UTF-8 code units and Unicode code points respectively. A common mistake/bug in this code base is to compare char8_t to char32_t literals, or call functions taking char32_t using a char8_t argument.

While no loss of precision occurs in char8_t -> char32_t, it is conceptually wrong:

bool contains_oe(std::u8string_view str) {
    for (char8_t c : str)
        if (c == U'ö') // comparison always fails
            return true;
    return false;
}

Assuming that str is correctly UTF-8 encoded, this function always returns false because ö is UTF-8 encoded as 0xC3 0xB6. Also, ö is U+00F6, and no UTF-8 code unit can be 0xF6.

A bug like this could have been easily prevented if I could somehow detect comparisons of char8_t and char32_t automatically.

Is there a way to do that using GCC compiler flags, Clang compiler flags, clang-tidy, or some other automatic tool?


Solution

  • As of GCC 15, Clang and Clang-Tidy 20, it seems there are no warnings or checks that would be helpful in this case. However, you could write your own Clang-Tidy check.


    Alternatively, you could design your interfaces to avoid implicit conversions by using either an enum class or a class. For instance:

    class CodeUnit
    {
    public:
        explicit CodeUnit(char8_t codeUnit = {}) : codeUnit(codeUnit) {}
        explicit CodeUnit(char32_t codeUnit) = delete;
        bool operator==(const CodeUnit&) const = default;
    
        // ...
    
    private:
        char8_t codeUnit;
    };
    

    You can then keep functions that need to use code units (e.g. conversion to UTF-32) inside the CodeUnit class. Then if you try to do

    bool contains_oe(std::span<CodeUnit> str) {
        for (CodeUnit c : str)
            if (c == U'ö') // comparison always fails
                return true;
        return false;
    }
    

    you just get an error.

    Of course, that doesn't solve the root issue, but it will at least limit where it can occur.