c++language-lawyerundefined-behaviortouppertolower

Do I need to cast to unsigned char before calling toupper(), tolower(), et al.?


A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char-argument to unsigned char before calling std::toupper and std::tolower (and similar functions).

On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper like

string name = "Niels Stroustrup";

void m3() {
  string s = name.substr(6,10);  // s = "Stroustr up"
  name.replace(0,5,"nicholas");  // name becomes "nicholas Stroustrup"
  name[0] = toupper(name[0]);   // name becomes "Nicholas Stroustrup"
}

(Quoted from said book, 4th edition.)

The reference says that the input needs to be representable as unsigned char. For me this sounds like it holds for every char since char and unsigned char have the same size.

So is this cast unnecessary or was Stroustrup careless?

Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by @Keith Thompson's reply, they all have a positive representation as signed char and unsigned char?


Solution

  • Yes, the argument to toupper needs to be converted to unsigned char to avoid the risk of undefined behavior.

    The types char, signed char, and unsigned char are three distinct types. char has the same range and representation as either signed char or unsigned char. (Plain char is very commonly signed and able to represent values in the range -128..+127.)

    The toupper function takes an int argument and returns an int result. Quoting the C standard, section 7.4 paragraph 1:

    In all cases the argument is an int, the value of which shall be representable as an unsigned char or shall equal the value of the macro EOF . If the argument has any other value, the behavior is undefined.

    (C++ incorporates most of the C standard library, and defers its definition to the C standard.)

    The [] indexing operator on std::string returns a reference to char. If plain char is a signed type, and if the value of name[0] happens to be negative, then the expression

    toupper(name[0])
    

    has undefined behavior.

    The language guarantees that, even if plain char is signed, all members of the basic character set have non-negative values, so given the initialization

    string name = "Niels Stroustrup";
    

    the program doesn't risk undefined behavior. But yes, in general a char value passed to toupper (or to any of the functions declared in <cctype> / <ctype.h>) needs to be converted to unsigned char, so that the implicit conversion to int won't yield a negative value and cause undefined behavior.

    The <ctype.h> functions are commonly implemented using a lookup table. Something like:

    // assume plain char is signed
    char c = -2;
    c = toupper(c); // undefined behavior
    

    may index outside the bounds of that table.

    Note that converting to unsigned:

    char c = -2;
    c = toupper((unsigned)c); // undefined behavior
    

    doesn't avoid the problem. If int is 32 bits, converting the char value -2 to unsigned yields 4294967294. This is then implicitly converted to int (the parameter type), which probably yields -2.

    toupper can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN to UCHAR_MAX), but it's not required to do so. Furthermore, the functions in <ctype.h> are required to accept an argument with the value EOF, which is typically -1.

    The C++ standard makes adjustments to some C standard library functions. For example, strchr and several other functions are replaced by overloaded versions that enforce const correctness. There are no such adjustments for the functions declared in <cctype>.