What does set_locale(LC_CTYPE, 'C'); actually do?
Asked Answered
C

1

15

When my PHP script is run with UTF-8 encoding, using non-ASCII characters, some PHP functions like strtolower() don't work.

I could use mb_strtolower, but this script can be run on all sorts of different platforms and configurations, and the multibyte string extension might not be available. I could check whether the function exists before use, but I have string functions littered throughout my code and would rather not replace every instance.

Someone suggested using set_locale(LC_CTYPE, 'C'), which he says causes the string functions to work correctly. This sounds fine, but I don't want to introduce that change without understanding exactly what it is doing. I have used set_locale to change the formatting of numbers before, but I have not used the LC_CTYPE flag before, and I don't really understand what it does. What does the value 'C' mean?

Chiquia answered 8/3, 2011 at 11:8 Comment(1)
Reference: php.net/manual/en/function.setlocale.php (It doesn't explain what C does, not meant as a RTFM, just for completeness' sake)Venditti
G
12

C means "use whatever locale is hard coded" (and since most *NIX programs are written in C, it's called C). However, it is usually not an UTF-8 locale.

If you are using multibyte charsets such as UTF-8 you cannot use the regular string functions - using the mb_ counterparts is required. However, almost every PHP installation should have this extension enabled.

Guthry answered 8/3, 2011 at 11:13 Comment(7)
Thanks for the explanation - if I make the value configurable by the user, would that work? Eg. user could enter their actual locale in a config file, and I then call set_locale(LC_TYPE, $config_value); - would that negate the need for using mb_ functions? Or would I still have to use them anyway?Chiquia
You can activate the mb_* functions globally!Enrollment
@Enrollment not if he's on shared hosting that doesn't support it.Venditti
I have no control over the environment it is run on - this is a script that is out in the wild! I don't think Multibyte string function overloading would work with ini_set, so it is out of my hands.Chiquia
link says: "you can’t rely on users to be able to change the locale correctly to suit your applications needs - on a shared host they probably won’t be able to change the locale for the user that Apache is running with. Bottom line - locales are not the way to go for applications intended to be “write once, run anywhere”." So I guess I will just have to do a search and replace to use mb_ wherever possible. :/Chiquia
@Guthry What do mean by saying "locale is hardcoded"? Does it mean that C locale is fully maintained by the creators of PHP at C/C++ level and normally it should work the same on all platforms and/or versions of PHP? Or perhaps it mostly depends on C/C++ compiler and/or used flags during compilation process and we should look into C locale more like a random one? The former or the latter?Spellbound
You say "you cannot use the regular string functions - using the mb_ counterparts is required", which isn't strictly true. Indeed, UTF-8 was devised as it means that a large number of programs, which use the regular string functions, can remain unmodified. The main exception is when calculating the visible width - that must be done using wcwidth.Mullin

© 2022 - 2024 — McMap. All rights reserved.