I have a code base which makes extensive use of char8_t
and char32_t
to represent UTF-8 code units and Unicode code points respectively.
A common mistake/bug in this code base is to compare char8_t
to char32_t
literals, or call functions taking char32_t
using a char8_t
argument.
While no loss of precision occurs in char8_t -> char32_t
, it is conceptually wrong:
bool contains_oe(std::u8string_view str) {
for (char8_t c : str)
if (c == U'ö') // comparison always fails
return true;
return false;
}
Assuming that str
is correctly UTF-8 encoded, this function always returns false
because ö
is UTF-8 encoded as 0xC3 0xB6
.
Also, ö
is U+00F6, and no UTF-8 code unit can be 0xF6
.
A bug like this could have been easily prevented if I could somehow detect comparisons of char8_t
and char32_t
automatically.
Is there a way to do that using GCC compiler flags, Clang compiler flags, clang-tidy, or some other automatic tool?
As of GCC 15, Clang and Clang-Tidy 20, it seems there are no warnings or checks that would be helpful in this case. However, you could write your own Clang-Tidy check.
Alternatively, you could design your interfaces to avoid implicit conversions by using either an enum class
or a class
. For instance:
class CodeUnit
{
public:
explicit CodeUnit(char8_t codeUnit = {}) : codeUnit(codeUnit) {}
explicit CodeUnit(char32_t codeUnit) = delete;
bool operator==(const CodeUnit&) const = default;
// ...
private:
char8_t codeUnit;
};
You can then keep functions that need to use code units (e.g. conversion to UTF-32) inside the CodeUnit
class. Then if you try to do
bool contains_oe(std::span<CodeUnit> str) {
for (CodeUnit c : str)
if (c == U'ö') // comparison always fails
return true;
return false;
}
you just get an error.
Of course, that doesn't solve the root issue, but it will at least limit where it can occur.