When is it safe to use the PHP strtolower() function?
Asked Answered
P

2

6

The PHP strtolower() function is supposed to convert strings to lowercase. But, it says in the PHP Manual (emphasis added):

Returns string with all alphabetic characters converted to lowercase.

Note that 'alphabetic' is determined by the current locale. This means that in i.e. the default "C" locale, characters such as umlaut-A (Ä) will not be converted.

The manual is silent about encodings here, but it is known that strtolower() will corrupt UTF-8 strings, where you are supposed to use mb_strtolower() instead.

I'm looking for a solution in cases where the mbstring extension is not available, and wanted to know when it is safe to use strtolower().

Thanks to pointers given to me by people commenting this question, it seems that the relevant part of the PHP source is to the call to the tolower() function in the ctype.h library. The library documentation says (emphasis added):

If the argument of tolower() represents an uppercase letter, and there exists a corresponding lowercase letter (as defined by character type information in the program locale category LC_CTYPE ), the result shall be the corresponding lowercase letter.

According to my tests, in PHP with set_locale( LC_CTYPE, 'C' ); characters such as Ä (encoded in ISO-8859-1) are left untouched. But in some other locales, the function returns the lowercase ä (again, in ISO-8859-1). Anyway, changing the locale to one that uses a UTF-8 character set does not make PHP strtolower() work on the UTF-8 character Ä.

Considering the increasing amount of I18N-related issues and multilingual environments, this information can be critically important. Many applications rely on strtolower() for a simple case-insensitive check. Consider:

$_POST['username'] = 'Michèlle';
if ( strtolower( $_POST['username'] ) == $database['username'] ) ...

Now, depending on the encoding, locales and maybe some other variables, the above code will work in some environments, but not in others.

The question is: Given that the PHP strtolower() function uses ctype.h library's tolower function, which depends on the "program locale category", when is it safe to count on this function? Can the behaviour be counted on in the following cases?

  1. The string is ASCII
  2. The string is encoded in ISO-8859-1
  3. The string is encoded in some other encoding with the corresponding locale set.

(Edit: Question reworded completely on 26 Nov 2013.)

Poteat answered 20/11, 2013 at 16:25 Comment(7)
PHP is open source, so find it in the source code.Geodesic
Here's the relevant part of the source.Sectarianism
@AmalMurali Actually, the work is done here: lxr.php.net/xref/PHP_TRUNK/ext/standard/string.c#1376Communion
"Note that 'alphabetic' is determined by the current locale". So you may want to take a look to this function called setlocale. It reports "LC_CTYPE for character classification and conversion, for example strtoupper()" so i Guess strtolower as well. Take a look because for a locale you can also specify an encoding, so maybe it could helpChartist
@KevinCittadini thanks, I know about this function and the locales, but that still doesn't answer the question of character sets and how they are used here.Poteat
@HeikkiU: That's why I posted a comment. Anyway you said "Is this indeed the internal encoding used by strtolower". I don't know the answer but using logic, doing some tests with some different configs of setlocale and it's encoding, maybe could answer your questions. OR of course check the source if you prefer.Chartist
I think the <ctype> tag is appropriate here, since the answer is actually buried somewhere in there.Poteat
R
0

The strtolower() PHP function does use the tolower() C function within its implementation that operates on each single byte (octet) of the passed string parameter.

This is the reason why setlocale(LC_CTYPE, 'C' ); does not corrupt UTF-8 encoded strings because it won't change bytes > 127. That is it does only change the case of the US-ASCII characters A-Z.

The "C" locale is set by default and you do not need to set it explicitly with setlocale(), only if other parts of the application have set it to a different value.

This also explains why setting LC_CTYPE to an UTF8 locale like "de_DE.UTF-8" would not convert "Ä" to "ä": That letter is encoded with two bytes 0xC3 0x84 of which both are passed as a single character (octet) to the tolower() C function - therefore they are unchanged as on a single byte an UTF-8 to lower processing could only deal with characters < 128 which again is effectively A-Z only. Which is effectively like the C locale.

So setting LC_CTYPE to "C" prevents breaking UTF-8 strings in use with strtolower().

Repatriate answered 24/1, 2016 at 14:58 Comment(0)
C
0

It uses the C function tolower (ref: http://www.acm.uiuc.edu/webmonkeys/book/c_guide/2.2.html) from the ctype.h library.

You can view the relevant sections of the source here:

Communion answered 20/11, 2013 at 16:35 Comment(5)
From the link you provided: "If the character matches the appropriate condition, then it is converted. [...] If the character is an uppercase character (A to Z), then it is converted to lowercase (a to z)" This is apparently not the whole truth, since strtolower() WILL on my system convert (ISO-8859-1 encoded) Ä to ä.Poteat
@HeikkiU hmm, I am looking at the source, and php_strtolower is really straightforward. If you have a C/C++ test environment, try to reproduce those results using tolower directly. The only other thing I can see is that strtolower calls zend_parse_parameters, but I don't see anything in there that would indicate some change of the value to cause tolower to behave differently than normal.Communion
Haven't got that option for testing. But there must be something more to it, otherwise the manual would just say "Converts A-Z to a-z", wouldn't it? And, before you mention it, I don't have mbstring overloading enabled.Poteat
I've edited the original question using some of the information found in your answer.Poteat
Three out of four links are (effectively) broken (the ctype.h one works). The first link times out and the last two redirect to https://heap.space/.Halpin
R
0

The strtolower() PHP function does use the tolower() C function within its implementation that operates on each single byte (octet) of the passed string parameter.

This is the reason why setlocale(LC_CTYPE, 'C' ); does not corrupt UTF-8 encoded strings because it won't change bytes > 127. That is it does only change the case of the US-ASCII characters A-Z.

The "C" locale is set by default and you do not need to set it explicitly with setlocale(), only if other parts of the application have set it to a different value.

This also explains why setting LC_CTYPE to an UTF8 locale like "de_DE.UTF-8" would not convert "Ä" to "ä": That letter is encoded with two bytes 0xC3 0x84 of which both are passed as a single character (octet) to the tolower() C function - therefore they are unchanged as on a single byte an UTF-8 to lower processing could only deal with characters < 128 which again is effectively A-Z only. Which is effectively like the C locale.

So setting LC_CTYPE to "C" prevents breaking UTF-8 strings in use with strtolower().

Repatriate answered 24/1, 2016 at 14:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.