Preparing PHP application to use with UTF-8
Asked Answered
H

5

14

UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by default.

How to overload the default settings in the .htaccess to be sure that everything goes well for UTF-8, locale etc.? Any options for the web server, Unix OS?

Is there any comprehensive list of those settings? E.g. mbstring options, iconv settings, locale etc I should set up for each multi language project? Any pre defined .htaccess as an example?

(In my particular case I need setup for the languages: English, Dutch and Russian. The server is in Ukraine).

Hoke answered 8/8, 2011 at 20:8 Comment(2)
PHP 6.0 is not -- and will never be ; at least, not as we expected it to be.Must
@hakre, Pascal; Right, I meant Unicode aware, not UTF-8Hoke
G
15

Some useful options to have in .htaccess:

########################################
# Locale settings
########################################

# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"

SetEnv   LC_ALL  nl_NL.UTF-8

########################################
# Set up UTF-8 encoding
########################################

AddDefaultCharset UTF-8
AddCharset UTF-8 .php

php_value default_charset "UTF-8"

php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"

php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6

# See also php functions:
# mysql_set_charset
# mysql_client_encoding

# database settings
#CREATE DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;
#
#ALTER DATABASE db_name
#   CHARACTER SET utf8
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   DEFAULT COLLATE utf8_general_ci
#   ;

#ALTER TABLE tbl_name
#   DEFAULT CHARACTER SET utf8
#   COLLATE utf8_general_ci
#   ;
Greerson answered 8/8, 2011 at 22:50 Comment(0)
C
5

You're right UTF-8 is a good choice for webapplications.

Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.

As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).

So let's tackle a short (and incomplete) list:

The OS

Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII which is safe for the Latin extended charsets like ISO-8859-1 in your case as well as for UTF-8.

Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z, A-Z, 0-9, ., -, _), even make them all lowercase for visual purposes.

If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.

Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.

HTML

This is merely independent to PHP, it's about the output your scripts provide so the field of work.

Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8 is a good choice so, but our job is to take care and make this precise and well defined.

PHP Settings

As a general rule of thumb, start reading the php.ini file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:

Strings

  • StringsDocs - By default strings in PHP are binary. As long as you use them with binary safe functions, you get what you expect. Since PHP 5.2.1 you can cast strings explicitly to binary strings. That's for forward compatibility of the said PHP 6 unicode support: $binary = (binary) $string; or $binary = b"binary string";.
  • mb_internal_encoding()Docs - Gain or set it; mbstring.internal_encodingINI. The internal encoding is the character encoding name used for the HTTP input character encoding conversion, HTTP output character encoding conversion, and the default character encoding for string functions defined by the mbstring module.
  • iconv_set_encoding()Docs - Comparable for the iconv extension. See as well the iconv configuration settings.
  • Various: Some functions that deal with character sequences allow you to specify a charset encoding. For example htmlspecialcharsDocs. Make use of these parameters and check the docs for their default value. Often it is ISO-8859-1 but you're looking for UTF-8. Other functions like html_entity_decodeDocs are using UTF-8 per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.

To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.

Chemosphere answered 8/8, 2011 at 21:51 Comment(1)
Whats the difference between iconv and mb_string?Nostradamus
T
3
  1. All your files have to be saved in UTF-8 (without BOM) using your code editor.
  2. Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:

    header('Content-Type: text/html; charset=utf-8');
    
  3. Add HTML meta content-type:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    
  4. Use htmlspecialchars() instead of htmlentities() because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.

  5. Tend not to use PHP standard string functions because many of them are incompatible with utf-8. Try to find their counterparts in Multibyte String or other libraries. (Don't forget to set default charset for the library before using it because the library supports many encodings and utf-8 is just one of them.)
  6. For regular expressions use u modifier. For example:

    preg_match('/ž{3,5}/u', $string, $matches);
    

    Together this is the most reliable way to check if the given string is valid utf-8 string:

    if (@preg_match('//u', $string) === false) {
        // NOT valid!
    } else {
        // Valid!
    }
    
  7. If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:

    mysql_set_charset('utf8', $link);
    

    Also check if columns in the database are in utf-8. It's not always needed but recomended.

Titograd answered 8/8, 2011 at 22:32 Comment(3)
Does the /u modifier in regexes require any specific notation for the unicode characters?Hoke
@Hoke Not sure I understand what you mean. If you want you can use this notation for unicode characters: \x{nnnn}. But usually it's not needed if your files are saved in UTF-8, because you can write unicode characters directly into the regex like I did in my example. In UTF-8 some characters take more than 1 byte. Let's say we have this regex: /ž{3}/u. Here number 3 means characters (not bytes) when u modifier is on. Together there are special unicode properties for regular expressions: php.net/manual/en/regexp.reference.unicode.phpTitograd
Thanks, this is what I was asking about.Hoke
D
1

Basically I do three things to work correctly with czech language:

1) define locale in PHP:

setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");

so you would use something like:

setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");

based on language which is currently switched to.

2) define charset for the database:

mysql_query("set names latin2 collate latin2_czech_cs");

3) define the charset of PHP/HTML code:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8 (based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.

Diorio answered 8/8, 2011 at 20:20 Comment(0)
E
0

Try one of the following:

 AddDefaultCharset UTF-8
 AddCharset UTF-8 .php
Emplane answered 8/8, 2011 at 20:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.