You're right UTF-8
is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
The OS
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII
which is safe for the Latin extended charsets like ISO-8859-1
in your case as well as for UTF-8
.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z
, A-Z
, 0-9
, .
, -
, _
), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode
(Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
HTML
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8
is a good choice so, but our job is to take care and make this precise and well defined.
PHP Settings
As a general rule of thumb, start reading the php.ini
file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
Strings
- StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support:
$binary = (binary) $string;
or $binary = b"binary string";
.
mb_internal_encoding()
Docs - Gain or set it; mbstring.internal_encoding
INI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
iconv_set_encoding()
Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
- Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example
htmlspecialchars
Docs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1
but you're looking for UTF-8
. Other functions like html_entity_decode
Docs are using UTF-8
per default. Some like htmlspecialchars_decode
do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.
To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8
. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.