The PHP strtolower()
function is supposed to convert strings to lowercase. But, it says in the PHP Manual (emphasis added):
Returns string with all alphabetic characters converted to lowercase.
Note that 'alphabetic' is determined by the current locale. This means that in i.e. the default "C" locale, characters such as umlaut-A (Ä) will not be converted.
The manual is silent about encodings here, but it is known that strtolower()
will corrupt UTF-8 strings, where you are supposed to use mb_strtolower()
instead.
I'm looking for a solution in cases where the mbstring
extension is not available, and wanted to know when it is safe to use strtolower()
.
Thanks to pointers given to me by people commenting this question, it seems that the relevant part of the PHP source is to the call to the tolower()
function in the ctype.h
library. The library documentation says (emphasis added):
If the argument of tolower() represents an uppercase letter, and there exists a corresponding lowercase letter (as defined by character type information in the program locale category LC_CTYPE ), the result shall be the corresponding lowercase letter.
According to my tests, in PHP with set_locale( LC_CTYPE, 'C' );
characters such as Ä
(encoded in ISO-8859-1) are left untouched. But in some other locales, the function returns the lowercase ä
(again, in ISO-8859-1). Anyway, changing the locale to one that uses a UTF-8 character set does not make PHP strtolower()
work on the UTF-8 character Ä
.
Considering the increasing amount of I18N-related issues and multilingual environments, this information can be critically important. Many applications rely on strtolower()
for a simple case-insensitive check. Consider:
$_POST['username'] = 'Michèlle';
if ( strtolower( $_POST['username'] ) == $database['username'] ) ...
Now, depending on the encoding, locales and maybe some other variables, the above code will work in some environments, but not in others.
The question is: Given that the PHP strtolower()
function uses ctype.h
library's tolower
function, which depends on the "program locale category", when is it safe to count on this function? Can the behaviour be counted on in the following cases?
(Edit: Question reworded completely on 26 Nov 2013.)
The strtolower()
PHP function does use the tolower()
C function within its implementation that operates on each single byte (octet) of the passed string parameter.
This is the reason why setlocale(LC_CTYPE, 'C' );
does not corrupt UTF-8 encoded strings because it won't change bytes > 127. That is it does only change the case of the US-ASCII characters A-Z.
The "C
" locale is set by default and you do not need to set it explicitly with setlocale()
, only if other parts of the application have set it to a different value.
This also explains why setting LC_CTYPE
to an UTF8 locale like "de_DE.UTF-8
" would not convert "Ä
" to "ä
": That letter is encoded with two bytes 0xC3 0x84 of which both are passed as a single character (octet) to the tolower()
C function - therefore they are unchanged as on a single byte an UTF-8 to lower processing could only deal with characters < 128 which again is effectively A-Z only. Which is effectively like the C locale.
So setting LC_CTYPE
to "C
" prevents breaking UTF-8 strings in use with strtolower()
.