The PHP strtolower()
function is supposed to convert strings to lowercase. But, it says in the PHP Manual (emphasis added):
Returns string with all alphabetic characters converted to lowercase.
Note that 'alphabetic' is determined by the current locale. This means that in i.e. the default "C" locale, characters such as umlaut-A (Ä) will not be converted.
The manual is silent about encodings here, but it is known that strtolower()
will corrupt UTF-8 strings, where you are supposed to use mb_strtolower()
instead.
I'm looking for a solution in cases where the mbstring
extension is not available, and wanted to know when it is safe to use strtolower()
.
Thanks to pointers given to me by people commenting this question, it seems that the relevant part of the PHP source is to the call to the tolower()
function in the ctype.h
library. The library documentation says (emphasis added):
If the argument of tolower() represents an uppercase letter, and there exists a corresponding lowercase letter (as defined by character type information in the program locale category LC_CTYPE ), the result shall be the corresponding lowercase letter.
According to my tests, in PHP with set_locale( LC_CTYPE, 'C' );
characters such as Ä
(encoded in ISO-8859-1) are left untouched. But in some other locales, the function returns the lowercase ä
(again, in ISO-8859-1). Anyway, changing the locale to one that uses a UTF-8 character set does not make PHP strtolower()
work on the UTF-8 character Ä
.
Considering the increasing amount of I18N-related issues and multilingual environments, this information can be critically important. Many applications rely on strtolower()
for a simple case-insensitive check. Consider:
$_POST['username'] = 'Michèlle';
if ( strtolower( $_POST['username'] ) == $database['username'] ) ...
Now, depending on the encoding, locales and maybe some other variables, the above code will work in some environments, but not in others.
The question is: Given that the PHP strtolower()
function uses ctype.h
library's tolower
function, which depends on the "program locale category", when is it safe to count on this function? Can the behaviour be counted on in the following cases?
- The string is ASCII
- The string is encoded in ISO-8859-1
- The string is encoded in some other encoding with the corresponding locale set.
(Edit: Question reworded completely on 26 Nov 2013.)