My script works fine, but I'm confused about why I have to use utf8_decode()

Asked 22/3, 2012 at 19:6 Answered 25/3, 2012 at 15:48

I am confused about the behavior of utf8_decode() and just want a little clarification. I hope that's ok.

Here's a simple HTML form that I'm using to capture some text and save it to my MySQL database (which uses the utf8_general_ci collation):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<form action="update.php" method="post" accept-charset="utf-8"> 
<p> 
    Title: <input type="text" name="title" id="title" accept-charset="utf-8" size="75" value="" /> 
</p> 
<p> 
    <input type="submit" name="submit" value="Submit" /> 
</p> 
</form>
</body>
</html>

As you can see I've got this coded up with charset=utf8 in the appropriate places. We accept text that includes diacritics (eg., ñ, ó, etc.). In the end, we run a little script on all text input to check for diacritics and change them to HTML entities (eg., ñ becomes ñ).

When input is received by my script, I first have to do utf8_decode($input) and then run my little script to check for and change diacritics as needed. Everything works fine. I'm curious as to why I have to run the decode on this input. I understand that utf8_decode converts a string encoded in UTF-8 to ISO-8859-1. I want to make sure - even though everything works fine (or so I think) - that I'm not doing something screwy that will catch up to me later. For instance, that I'm sending ISO-8859-1 encoded characters to be stored in my database that is set up to store/serve UTF-8 characters. Should I do something like run utf8_encode() on the string that my diacritics-to-entities script returns? Eg:

$string = utf8_decode($string);
$search = explode(",","À,È,Ì,Ò,Ù,à,è,ì,ò,ù,Á,É,Í,Ó,Ú,Ý,á,é,í,ó,ú,ý,Â,Ê,Î,Ô,Û,â,ê,î,ô,û,Ã,Ñ,Õ,ã,ñ,õ,Ä,Ë,Ï,Ö,Ü,Ÿ,ä,ë,ï,ö,ü,ÿ,Å,å,Æ,æ,ß,Þ,þ,ç,Ç,Œ,œ,Ð,ð,Ø,ø,§,Š,š,µ,¢,£,¥,€,¤,ƒ,¡,¿");
$replace = explode(",","&Agrave;,&Egrave;,&Igrave;,&Ograve;,&Ugrave;,&agrave;,&egrave;,&igrave;,&ograve;,&ugrave;,&Aacute;,&Eacute;,&Iacute;,&Oacute;,&Uacute;,&Yacute;,&aacute;,&eacute;,&iacute;,&oacute;,&uacute;,&yacute;,&Acirc;,&Ecirc;,&Icirc;,&Ocirc;,&Ucirc;,&acirc;,&ecirc;,&icirc;,&ocirc;,&ucirc;,&Atilde;,Ntilde;,&Otilde;,&atilde;,&ntilde;,&otilde;,&Auml;,&Euml;,&Iuml;,&Ouml;,&Uuml;,&Yuml;,&auml;,&euml;,&iuml;,&ouml;,&uuml;,&yuml;,&Aring;,&aring;,&AElig;,&aelig;,&szlig;,&THORN;,&thorn;,&ccedil;,&Ccedil;,&OElig;,&oelig;,&ETH;,&eth;,&Oslash;,&oslash;,&sect;,&Scaron;,&scaron;,&micro;&cent;,&pound;,&yen;,&euro;,&curren;,&fnof;,&iexcl;,&iquest;");
$new_input = str_replace($search, $replace, $string);
return utf8_encode($new_input); // right now i just return $new_input.

Appreciate any insight anyone has to offer about this.

Tapir answered 22/3, 2012 at 19:6 Comment(1)

+1 for not letting "it works" be good enough – Sapphirine 22/3, 2012 at 19:9

Do not use "accept-charset". It's broken. Most browsers have stopped sending it in their own http requests. Some browsers (IE) completely ignore this attribute when they parse a form, and others do a very limited job with it. In practice, the "accept-charset" will do more harm than good.

The convention is that the browser will send the data in the same encoding as it received the form. So make sure your page is sent as UTF-8. Your meta-tag in the HTML's head isn't enough. For a PHP page, this setting can be set in 3 places:

A HTML tag <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the "head".
An AddDefautCharset UTF8 line in the Apache configuration (or anything similar in other web servers).
A PHP call to header("Content-type=text/html; charset=utf-8"); (before anything is displayed on the page).

Each directive overrides the previous ones. So if your server already declares a charset, your meta tag will be ignored.

So you should:

Make sure your source file is in UTF-8, of course.
Fix your HTML source so that it validates at W3C. For instance, your meta tag should be closed in XHTML.
Remove the "accept-charset" attributes.
Eventually, force the encoding declaration in Apache or with PHP's header().
Ensure in your browser that the HTTP headers received from the server have the right encoding declared (or no encoding if you rely on your meta tag). On Linux curl -I <URL> displays the HTTP headers only.

Taunton answered 25/3, 2012 at 15:48 Comment(0)

When submitting a form with accept-charset="utf-8", the browser sends the form data to the server in ISO-8859-1 characters encoded with utf-8. utf8_decode turns the encoded data bact into strict ISO-8859-1. For example, if you submit "ñ", utf-8 encoding will submit "%F1" to your form action, which in turn must be converted back to "ñ" for your script to work.

Olga answered 22/3, 2012 at 19:22 Comment(0)

so will get the page to display the text to display in utf-8, but even if you switch it to utf8 using accept-charset="utf-8" the server concerts it to iso-8859-1 and then when it's displayed it's then converts to utf-8 again from iso-8859-1, but was able to convert a utf-8 only char, so it ends up displaying a weird char and every time you loop through this process it'll get worse and worse, so what I've found is even though you do everything on the html side there isn't a way to switch it on the server for it to read utf-8 and so you can't switch everything to utf-8. That is on apache and if there is a way I'd love to know.

Warison answered 22/3, 2012 at 20:14 Comment(0)

Recommended topics

Hot tags