A while ago, someone with high reputation here on Stack Overflow wrote in a comment that it is necessary to cast a char
-argument to unsigned char
before calling std::toupper
and std::tolower
(and similar functions).
On the other hand, Bjarne Stroustrup does not mention the need to do so in the C++ Programming Language. He just uses toupper
like
string name = "Niels Stroustrup"; void m3() { string s = name.substr(6,10); // s = "Stroustr up" name.replace(0,5,"nicholas"); // name becomes "nicholas Stroustrup" name[0] = toupper(name[0]); // name becomes "Nicholas Stroustrup" }
(Quoted from said book, 4th edition.)
The reference says that the input needs to be representable as unsigned char
.
For me this sounds like it holds for every char
since char
and unsigned char
have the same size.
So is this cast unnecessary or was Stroustrup careless?
Edit: The libstdc++ manual mentions that the input character must be from the basic source character set, but does not cast. I guess this is covered by @Keith Thompson's reply, they all have a positive representation as signed char
and unsigned char
?
Yes, the argument to toupper
needs to be converted to unsigned char
to avoid the risk of undefined behavior.
The types char
, signed char
, and unsigned char
are three distinct types. char
has the same range and representation as either signed char
or unsigned char
. (Plain char
is very commonly signed and able to represent values in the range -128..+127.)
The toupper
function takes an int
argument and returns an int
result. Quoting the C standard, section 7.4 paragraph 1:
In all cases the argument is an
int
, the value of which shall be representable as anunsigned char
or shall equal the value of the macroEOF
. If the argument has any other value, the behavior is undefined.
(C++ incorporates most of the C standard library, and defers its definition to the C standard.)
The []
indexing operator on std::string
returns a reference to char
. If plain char
is a signed type, and if the value of name[0]
happens to be negative, then the expression
toupper(name[0])
has undefined behavior.
The language guarantees that, even if plain char
is signed, all members of the basic character set have non-negative values, so given the initialization
string name = "Niels Stroustrup";
the program doesn't risk undefined behavior. But yes, in general a char
value passed to toupper
(or to any of the functions declared in <cctype>
/ <ctype.h>
) needs to be converted to unsigned char
, so that the implicit conversion to int
won't yield a negative value and cause undefined behavior.
The <ctype.h>
functions are commonly implemented using a lookup table. Something like:
// assume plain char is signed
char c = -2;
c = toupper(c); // undefined behavior
may index outside the bounds of that table.
Note that converting to unsigned
:
char c = -2;
c = toupper((unsigned)c); // undefined behavior
doesn't avoid the problem. If int
is 32 bits, converting the char
value -2
to unsigned
yields 4294967294
. This is then implicitly converted to int
(the parameter type), which probably yields -2
.
toupper
can be implemented so it behaves sensibly for negative values (accepting all values from CHAR_MIN
to UCHAR_MAX
), but it's not required to do so. Furthermore, the functions in <ctype.h>
are required to accept an argument with the value EOF
, which is typically -1
.
The C++ standard makes adjustments to some C standard library functions. For example, strchr
and several other functions are replaced by overloaded versions that enforce const
correctness. There are no such adjustments for the functions declared in <cctype>
.