c++unicode compiler-warnings c++23 char8-t

Can you warn/error when mixing char8_t and char32_t in expressions?

I have a code base which makes extensive use of char8_t and char32_t to represent UTF-8 code units and Unicode code points respectively. A common mistake/bug in this code base is to compare char8_t to char32_t literals, or call functions taking char32_t using a char8_t argument.

While no loss of precision occurs in char8_t -> char32_t, it is conceptually wrong:

bool contains_oe(std::u8string_view str) {
    for (char8_t c : str)
        if (c == U'ö') // comparison always fails
            return true;
    return false;
}

Assuming that str is correctly UTF-8 encoded, this function always returns false because ö is UTF-8 encoded as 0xC3 0xB6. Also, ö is U+00F6, and no UTF-8 code unit can be 0xF6.

A bug like this could have been easily prevented if I could somehow detect comparisons of char8_t and char32_t automatically.

Is there a way to do that using GCC compiler flags, Clang compiler flags, clang-tidy, or some other automatic tool?

Solution

As of GCC 15, Clang and Clang-Tidy 20, it seems there are no warnings or checks that would be helpful in this case. However, you could write your own Clang-Tidy check.

Alternatively, you could design your interfaces to avoid implicit conversions by using either an enum class or a class. For instance:

class CodeUnit
{
public:
    explicit CodeUnit(char8_t codeUnit = {}) : codeUnit(codeUnit) {}
    explicit CodeUnit(char32_t codeUnit) = delete;
    bool operator==(const CodeUnit&) const = default;

    // ...

private:
    char8_t codeUnit;
};

You can then keep functions that need to use code units (e.g. conversion to UTF-32) inside the CodeUnit class. Then if you try to do

bool contains_oe(std::span<CodeUnit> str) {
    for (CodeUnit c : str)
        if (c == U'ö') // comparison always fails
            return true;
    return false;
}

you just get an error.

Of course, that doesn't solve the root issue, but it will at least limit where it can occur.