c++utf-8isspace

Can isspace() give false positives with UTF-8 text?


I know isspace() is meant to work for ASCII, but I have UTF-8 text. If isspace() looks only at the lower 7 bits, where UTF-8 and ASCII overlaps, it should be safe to use.

By safe to use I mean that it won't detect a Unicode character that is not a whitespace as whitespace. I know that there might be special Unicode whitespaces which it will not detect, but that is not a problem for me.

I.e. I'm OK with false negatives, so long as there are no positives. Is it correct to assume that?


Solution

  • isspace() is subject to locale definitions of whitespace characters at runtime.

    In C, whitespace characters are defined by the locale specified in a call to setlocale(LC_ALL) or setlocale(LC_CTYPE).

    In C++, whitespace characters are defined by the locale specified by either:

    1. a call to std::setlocale(LC_ALL) or std::setlocale(LC_CTYPE), when using the version of std::isspace() from the <cctype> header.

    2. an input locale parameter, when using the version of std::isspace() from the <locale> header.

    The default locale used is the "C" locale, which defines the following whitespace characters, which are the same in UTF-8 and ASCII, and most locales that are ASCII-compatible, but may be different in other locales:

    ' '  (0x20) space (SPC) 
    '\t' (0x09) horizontal tab (TAB) 
    '\n' (0x0a) newline (LF) 
    '\v' (0x0b) vertical tab (VT) 
    '\f' (0x0c) feed (FF) 
    '\r' (0x0d) carriage return (CR)