UTF-8 problems in php: var_export() returns \0 null characters, and ucfirst(), strtoupper(), etc. behave strangely
Asked Answered
S

5

8

We are dealing with a strange bug in a Joyent Solaris server that never happened before (doesn't happen in localhost or two other Solaris servers with identical php configuration). Actually, I'm not sure if we have to look at php or solaris, and if it is a software or hardware problem...

I just want to post this in case somebody can point us in the right direction.

So, the problem seems to be in var_export()when dealing with strange characters. Executing this in the CLI, we get the expected result in our localhost machines and in two of the servers, but not in the 3rd one. All of them are configured to work with utf-8.

$ php -r "echo var_export('ñu', true);"

Gives this in older servers and localhost (expected):

'ñu'

But in the server we are having problems with (PHP Version => 5.3.6), it adds \0 null characters whenever it encounters an "uncommon" character: è, á, ç, ... you name it.

'' . "\0" . '' . "\0" . 'u'

Any idea on where should be looking at? Thanks in advance.


More info:

  • PHP version 5.3.6.
  • setlocale() is not solving anything.
  • default_charset is UTF-8 in php.ini.
  • mbstring.internal_encoding is set to UTF-8 in php.ini.
  • mbstring.func_overload = 0.
  • this happens in both CLI (example) and web application (php-fpm + nginx).
  • iconv encoding is also UTF-8
  • all files utf-8 encoded.

system('locale') returns:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Some of the tests done so far (CLI):

Normal behaviour:

$ php -r "echo bin2hex('ñu');" => 'c3b175'
$ php -r "echo mb_strtoupper('ñu');" => 'ÑU'
$ php -r "echo serialize(\"\\xC3\\xB1\");" => 's:2:"ñ";'
$ php -r "echo bin2hex(addcslashes(b\"\\xC3\\xB1\", \"'\\\\\"));" => 'c3b1'
$ php -r "echo ucfirst('iñu');" => 'Iñu'

Not normal:

$ php -r "echo strtoupper('ñu');" => 'U' 
$ php -r "echo ucfirst('ñu');" => '?u' 
$ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" => '?u' 
$ php -r "echo bin2hex(ucfirst('ñu'));" => '00b175'
$ php -r "echo bin2hex(var_export('ñ', 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'
$ php -r "echo bin2hex(var_export(b\"\\xC3\\xB1\", 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'

So the problem seems to be in var_export() and "string functions that use the current locale but operate byte-by-byte" Docs (view @hakre's answer).

Swatter answered 16/3, 2012 at 16:47 Comment(18)
I'd start by checking the version of software running on each server. Specifically php. A function in one version assumes UTF-8 while the same function in a different version assumes ISO-8859-1.Wooton
Also try comparing the output of locale(1) and/or checking the environment variables that start with LC.Bart
Does this only happen on the CLI? That may be some special case of how Solaris' terminal handles Unicode. Or does this happen as well when running from source code files which guaranteed do not contain NUL bytes?Agnosia
Check two things, one the php.ini that gets executed at CLI (might differ from the one over webserver), setting there the default_charset to "utf-8". Secondly check /etc/locale.gen if you even have an en_US.UTF-8 on that one server.Wally
@Agnosia this happens in both CLI (example) and web application (php-fpm + nginx).Swatter
@Christian php.ini's default_charset is UTF-8.Swatter
Interspersing characters with NULs means that you're probably looking at UTF-16.Pacifier
@DavidBélanger default iconv encoding in the server now is ISO-8859-1. I changed it to UTF-8 some days ago in php.ini but the problem remained, so I reverted it to the original configuration...Swatter
@DavidBélanger double checked :) iconv encoded as UTF-8 doesn't change a thing, both var_export() and ucfirst() don't work.Swatter
If it's available on your server, try detecting the actual encoding with mb_detect_encoding()Giraldo
@Giraldo mbstring extension is available in php.ini, and mbstring.internal_encoding is set to UTF-8. mb_detect_encoding('ñu') returns UTF-8.Swatter
@Swatter And how about mb_detect_encoding(ucfirst('ñu'))?Giraldo
@Giraldo mb_detect_encoding(ucfirst('ñu')) returns false.Swatter
mb_detect_encoding is broken by design. Don't rely on that function, it's outcome does not say much. Handle with care.Roband
I think your best bet, is to check if the string has valid caracters... yesterday I was trying to htmlentities a string and the result was NULL but my var had some char from word (that what I found)... so try to encodo something else manually.Fakir
@DavidBélanger thanks for the comment, but string is 'clean', and as you can see the problem happens both in CLI and server (all tests included are in CLI).Swatter
@Swatter Okay, it's very weird... I try to think why but it's hard when I am not in the hood...Fakir
I'm sure this is related to Solaris and the system C libraries that are used by PHP. I'd say that the compiled packages have been messed by the hoster, otherwise strtoupper must be working. Get proper binaries.Roband
R
5

I suggest you verify the PHP binary you've got problems with. Check the compiler flags and the libraries it makes use of.

Normally PHP internally uses binary strings, which means that functions like ucfirst work byte-to-byte and only support what your locale support (if and like configured). See Details of the String TypeDocs.

$ php -r "echo ucfirst('ñu');" 

returns

?u

This makes sense, ñ is

LATIN SMALL LETTER N WITH TILDE (U+00F1)    UTF8: \xC3\xB1

You have some locale configured that makes PHP change \xC3 into something else, breaking the UTF-8 byte-sequence and making your shell display the � replacement characterWikipedia.

I suggest if you really want to analyze the issues, you should start with hexdumps next to how things get displayed in shell and elsewhere. Know that you can explicitly define binrary strings b"string" (that's forward compatibility, mabye you've got enabled some compile flag and you're on unicode experimental?), and also you can write strings literally, here hex-way for UTF-8:

 $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");"

And there are a lot more settings that can play a role, I started to list some points in an answer to Preparing PHP application to use with UTF-8.


Example of a multibyte ucfirst variant:

/**
 * multibyte ucfirst
 *
 * @param string $str
 * @param string|null $encoding (optional)
 * @return string
 */
function mb_ucfirst($str, $encoding = NULL)
{
    $first = mb_substr($str, 0, 1, $encoding);
    $rest = mb_substr($str, 1, strlen($str), $encoding);
    return mb_strtoupper($first, $encoding) . $rest;
}

See mb_strtoupperDocs and as well mb_convert_caseDocs.

Roband answered 16/3, 2012 at 16:47 Comment(11)
I've made the 'hexadecimal test': all the servers, including the 'bad guy', return c3b175 when executing $ php -r "echo bin2hex('ñu');". Not sure how I should interpret this...Swatter
And $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" returns ?u.Swatter
and what does bin2hex(ucfirst('ñu')); give? (your report show that for both cases, PHP uses the UTF-8 sequences inside the strings, so that is the same across those systems).Roband
bin2hex(ucfirst('ñu')); returns 00b175.Swatter
On the system that's broken I guess. You locale is changing 0xC3 to 0x00. What does it give on a system that works? And keep in mind you need to use mb_ucfirst('ñu', 'UTF-8') (multibyte replacement for ucfirst) anyway, because it's a multibyte charset you use (UTF-8).Roband
Yes, 00b175 is in the 'broken' server. The others just return c3b175. Can't use mb_ucfirst(), are you sure that function exists? Is not here... php.net/manual/en/ref.mbstring.phpSwatter
That mb_ucfirst was virtual, I'll add one to the answer for an example.Roband
Yes, I see what you mean: I added a mb_strtoupper() test to the question. Results differ in the 'broken' server.Swatter
@eillarra: Can you please give the output of uname -a?Roband
SunOS 7aaefcc.local 5.11 joyent_20120126T071347Z i86pc i386 i86pc Solaris. As I say in the question is a Joyent machine, SmartOS they call it (based in Solaris), w/ PHP 5.3.6Swatter
Well, as this is related to strtoupper even, I suspect it's related to the underlying c libs when PHP has been compiled. You should check with Joynet support and ask them for providing a properly configured/compiled binary. Also I suggest you get a PHP version that's a more current PHP 5.3 one, like PHP 5.3.10.Roband
F
0

phpunit tests for this are being added to https://gist.github.com/68f5781a83a8986b9d30 - can we build up a better unit test suite so that we can figure out what the expected output should be?

Father answered 16/3, 2012 at 16:47 Comment(0)
R
0

I normally use utf8_encode('ñu') for all the french characters

Report answered 16/3, 2012 at 16:47 Comment(1)
Thanks Vinay, but it seems to be an underlying C problem, maybe a compilation problem. Still trying to find it out, but PHP doesn't seem to be the source of the problem.Swatter
S
0

Probably all your servers are in good state . In one of the comments you said that you have only issue with ucfirst() and var_export(). Depending on these responses you might be looking at this SOQ. Most of the php string function will not work properly when working with multibyte strings. That is why php has separate set of functions to deal with them.

This might be helpful

Schick answered 16/3, 2012 at 16:47 Comment(0)
W
0

try force utf-8 in php:

<? ini_set( 'default_charset', 'UTF-8' ); ?>

in very top (first line of code) of your any page/template. It helps me with my special characters mostly. Not sure that it can help you too, try it.

Windbag answered 11/4, 2012 at 13:12 Comment(1)
default_charset is UTF-8 in php.ini. Thanks anyway.Swatter

© 2022 - 2024 — McMap. All rights reserved.