json_encode() non utf-8 strings?
Asked Answered
R

6

30

So I have an array of strings, and all of the strings are using the system default ANSI encoding and were pulled from a SQL database. So there are 256 different possible character byte values (single byte encoding).
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like \u0082?

Or is that the standard for JSON?

Rollway answered 7/7, 2011 at 6:30 Comment(0)
S
37

Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like "\u0082"?

If you have an ANSI encoded string, using utf8_encode() is the wrong function to deal with this. You need to properly convert it from ANSI to UTF-8 first. That will certainly reduce the number of Unicode escape sequences like \u0082 from the json output, but technically these sequences are valid for json, you must not fear them.

Converting ANSI to UTF-8 with PHP

json_encode works with UTF-8 encoded strings only. If you need to create valid json successfully from an ANSI encoded string, you need to re-encode/convert it to UTF-8 first. Then json_encode will just work as documented.

To convert an encoding from ANSI (more correctly I assume you have a Windows-1252 encoded string, which is popular but wrongly referred to as ANSI) to UTF-8 you can make use of the mb_convert_encoding() function:

$str = mb_convert_encoding($str, "UTF-8", "Windows-1252");

Another function in PHP that can convert the encoding / charset of a string is called iconv based on libiconv. You can use it as well:

$str = iconv("CP1252", "UTF-8", $str);

Note on utf8_encode()

utf8_encode() does only work for Latin-1, not for ANSI. So you will destroy part of your characters inside that string when you run it through that function.


Related: What is ANSI format?


For a more fine-grained control of what json_encode() returns, see the list of predifined constants (PHP version dependent, incl. PHP 5.4, some constants remain undocumented and are available in the source code only so far).

Changing the encoding of an array/iteratively (PDO comment)

As you wrote in a comment that you have problems to apply the function onto an array, here is some code example. It's always needed to first change the encoding before using json_encode. That's just a standard array operation, for the simpler case of pdo::fetch() a foreach iteration:

while($row = $q->fetch(PDO::FETCH_ASSOC))
{
  foreach($row as &$value)
  {
    $value = mb_convert_encoding($value, "UTF-8", "Windows-1252");
  }
  unset($value); # safety: remove reference
  $items[] = array_map('utf8_encode', $row );
}
Synchrocyclotron answered 7/7, 2011 at 7:9 Comment(4)
This pretty much answer my question. However when I do use the mb_convert_encoding() as you said, json_encode() is still just giving me null values for the strings that have odd characters in them. However if I do something like this, (also note this array with values is from a PDOStatement->fetchAll()). But for this method I found I'm iterating through each item rather than just grabbing an array and the characters show up escaped such as \u0082. etc. "while($row = $q->fetch(PDO::FETCH_ASSOC)) { $items[] = array_map( utf8_encode, $row ); }" should I be using this or doing it differently?Rollway
Actually, the mb_convert_encoding() does work. My mistake. Now it's down to looping through the array fetched from PDOStatement->fetchAll() or iterating through and fetching each item individually and using the code I put in my other comment. :\ (Each item in this array has a couple strings inside of it by the way)Rollway
@Josh, I've edited the question. I'm not sure if I totally understood your problem, I added an example how you can change the encoding of each element in the $row array you get from PDO.Synchrocyclotron
I understand why this is the case, but /MAN/ this is annoying to have to deal with :D Thanks for the explanation @SynchrocyclotronInsociable
R
11

The JSON standard ENFORCES Unicode encoding. From RFC4627:

3.  Encoding

   JSON text SHALL be encoded in Unicode.  The default encoding is
   UTF-8.

   Since the first two characters of a JSON text will always be ASCII
   characters [RFC0020], it is possible to determine whether an octet
   stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
   at the pattern of nulls in the first four octets.

           00 00 00 xx  UTF-32BE
           00 xx 00 xx  UTF-16BE
           xx 00 00 00  UTF-32LE
           xx 00 xx 00  UTF-16LE
           xx xx xx xx  UTF-8

Therefore, on the strictest sense, ANSI encoded JSON wouldn't be valid JSON; this is why PHP enforces unicode encoding when using json_encode().

As for "default ANSI", I'm pretty sure that your strings are encoded in Windows-1252. It is incorrectly referred to as ANSI.

Ribwort answered 7/7, 2011 at 7:26 Comment(1)
Yes, a string containing JSON encoded data must be a valid UTF-8 encoded string, but this doesn't mean the data itself cannot contain non-UTF-8 strings. The string "\xff" for example is a valid UTF-8 string (valid ASCII even), and represents a non-UTF-8 encoded string.Grief
C
9
<?php
$array = array('first word' => array('Слово','Кириллица'),'second word' => 'Кириллица','last word' => 'Кириллица');
echo json_encode($array);
/*
return {"first word":["\u0421\u043b\u043e\u0432\u043e","\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"],"second word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430","last word":"\u041a\u0438\u0440\u0438\u043b\u043b\u0438\u0446\u0430"}
*/
echo json_encode($array,256);
/*
return {"first word":["Слово","Кириллица"],"second word":"Кириллица","last word":"Кириллица"}
*/
?>

JSON_UNESCAPED_UNICODE (integer) Encode multibyte Unicode characters literally (default is to escape as \uXXXX). Available since PHP 5.4.0.

http://php.net/manual/en/json.constants.php#constant.json-unescaped-unicode

Copp answered 10/6, 2015 at 15:39 Comment(1)
For me this worked: json_encode($obj, JSON_UNESCAPED_UNICODE);Maraca
L
-2

I found the following answer for an analogous problem with a nested array not utf-8 encoded that i had to json encode:

$inputArray = array(
    'a'=>'First item - à',
    'c'=>'Third item - é'
);
$inputArray['b']= array (
          'a'=>'First subitem - ù',
          'b'=>'Second subitem - ì'
    );
 if (!function_exists('recursive_utf8')) {
  function recursive_utf8 ($data) {
     if (!is_array($data)) {
        return utf8_encode($data);
     }
     $result = array();
     foreach ($data as $index=>$item) {
        if (is_array($item)) {
           $result[$index] = array();
           foreach($item as $key=>$value) {
              $result[$index][$key] = recursive_utf8($value);
           }
        }
        else if (is_object($item)) {
           $result[$index] = array();
           foreach(get_object_vars($item) as $key=>$value) {
              $result[$index][$key] = recursive_utf8($value);   
           }
        } 
        else {
           $result[$index] = recursive_utf8($item);
        }
     }
     return $result; 
   }
}
$outputArray =  json_encode(array_map('recursive_utf8', $inputArray ));
Lelahleland answered 25/6, 2014 at 9:24 Comment(0)
L
-3
json_encode($str,JSON_HEX_TAG|JSON_HEX_AMP|JSON_HEX_APOS|JSON_HEX_QUOT);

that will convert windows based ANSI to utf-8 and the error will be no more.

Largescale answered 18/6, 2014 at 3:16 Comment(1)
none of those flags have anything to do with encoding JSON_HEX_TAG: deals with "<" and ">"; JSON_HEX_AMP: deals with "&"; JSON_HEX_APOS: deals with single-quote "'"; JSON_HEX_QUOT: deals with double quote'"';Decameter
D
-4

Use this instead:

<?php 
//$return_arr = the array of data to json encode 
//$out = the output of the function 
//don't forget to escape the data before use it! 

$out = '["' . implode('","', $return_arr) . '"]'; 
?>

Copy from json_encode php manual's comments. Always read the comments. They are useful.

Dunagan answered 7/7, 2011 at 6:35 Comment(1)
That's not helpful because it doesn't deal with the encoding at all.Synchrocyclotron

© 2022 - 2024 — McMap. All rights reserved.