Using UTF-8 charset with PHP - are mb functions required?
Asked Answered
A

8

6

These past few days I've been working toward converting my PHP code base from latin1 to UTF-8. I've read the two main solutions are to either replace the single byte functions with the built in multibyte functions, or set the mbstring.func_overload value in the php.ini file.

But then I came across this thread on stack overflow, where the post by thomasrutter seems to indicate that the multibyte functions aren't actually necessary for UTF-8, as long as the script and string literals are encoded in UTF-8.

I haven't found any other evidence whether this is true or not, and if it turns out I don't need to convert my code to the mb_functions then that would be a real time saver! Anyone able to shed some light on this?

Ambrosia answered 16/11, 2009 at 19:55 Comment(0)
H
11

As far as I understand the issue, as long as all your data is 100% in utf-8 - and that means user input, database, and also the encoding of the PHP files themselves if you have special characters in them - this is true true for search and comparison operations. As @ntd points out, a non-multibyte strlen() will produce wrong results when run on a string that contains multibyte characters.

THis is a great article on the basics of encoding.

Hereditable answered 16/11, 2009 at 20:2 Comment(1)
Thanks everyone who responded, I understand now. Much appreciated!Ambrosia
T
4

They aren't "necessary" unless you're using any of the functions they replace (and it's likely that you are using at least one of these) or otherwise explicitly need a feature of the extension such as HTTP handling.

When working towards UTF-8 compliance, I always fall back to the PHP UTF-8 Cheatsheet with one addition: PCRE patterns need to be updated to use the u modifier.

Tortilla answered 16/11, 2009 at 20:17 Comment(0)
S
3

As soon as you're examining or modifying a multibyte string, you need to use a mb_* function. A very quick example which demonstrates why:

$str = "abcžđščćöçefg";
mb_internal_encoding("UTF-8");

echo "strlen: ".strlen($str)."\n";
echo "mb_strlen: ".mb_strlen($str)."\n";

This prints out:

strlen: 20
mb_strlen: 13
Showiness answered 16/11, 2009 at 20:19 Comment(0)
T
2

thomasrutter indicates that the search does not need special handling. For example, if you need to check the length of an UTF8 string, I don't see how you can do that using plain strlen().

Tapley answered 16/11, 2009 at 20:10 Comment(0)
L
2

Functions such as mb_strtoupper may be necessary, too. strtoupper won't convert á to Á.

Lafrance answered 16/11, 2009 at 20:55 Comment(0)
E
1

There are a number of functions that expect strings to be single byte (And some even presume that it is iso-8859-1). In these cases, you need to be aware of what you're doing and possibly use replacement functions. There is a fairly comprehensive list at: http://www.phpwact.org/php/i18n/utf-8

Eddins answered 16/11, 2009 at 20:39 Comment(0)
C
0

You could use the mbfunctions library that extends the multibyte functions in PHP:

http://code.google.com/p/mbfunctions/

Concernment answered 22/12, 2009 at 10:12 Comment(0)
P
-1

You can use this http://php.net/manual/en/mbstring.overload.php setting in php.ini file, so you don't need to change you code.

But be careful, because not all string function will be automatically changed. This is one: http://php.net/manual/en/function.substr-replace.php

Paton answered 22/12, 2009 at 10:19 Comment(1)
Not anymore: "This feature has been DEPRECATED as of PHP 7.2.0. Relying on this feature is highly discouraged."Fredra

© 2022 - 2024 — McMap. All rights reserved.