[SOLVED] Can isspace() give false positives with UTF-8 text?

Can isspace() give false positives with UTF-8 text?

I know isspace() is meant to work for ASCII, but I have UTF-8 text. If isspace() looks only at the lower 7 bits, where UTF-8 and ASCII overlaps, it should be safe to use.

By safe to use I mean that it won't detect a Unicode character that is not a whitespace as whitespace. I know that there might be special Unicode whitespaces which it will not detect, but that is not a problem for me.

I.e. I'm OK with false negatives, so long as there are no positives. Is it correct to assume that?

Solution

isspace() is subject to locale definitions of whitespace characters at runtime.

In C, whitespace characters are defined by the locale specified in a call to setlocale(LC_ALL) or setlocale(LC_CTYPE).

In C++, whitespace characters are defined by the locale specified by either:

a call to std::setlocale(LC_ALL) or std::setlocale(LC_CTYPE), when using the version of std::isspace() from the <cctype> header.
an input locale parameter, when using the version of std::isspace() from the <locale> header.

The default locale used is the "C" locale, which defines the following whitespace characters, which are the same in UTF-8 and ASCII, and most locales that are ASCII-compatible, but may be different in other locales:

' '  (0x20) space (SPC) 
'\t' (0x09) horizontal tab (TAB) 
'\n' (0x0a) newline (LF) 
'\v' (0x0b) vertical tab (VT) 
'\f' (0x0c) feed (FF) 
'\r' (0x0d) carriage return (CR)