Detect encoding and make everything UTF-8
Asked Answered
M

26

332

I'm reading out lots of texts from various RSS feeds and inserting them into my database.

Of course, there are several different character encodings used in the feeds, e.g. UTF-8 and ISO 8859-1.

Unfortunately, there are sometimes problems with the encodings of the texts. Example:

  1. The "ß" in "Fußball" should look like this in my database: "Ÿ". If it is a "Ÿ", it is displayed correctly.

  2. Sometimes, the "ß" in "Fußball" looks like this in my database: "ß". Then it is displayed wrongly, of course.

  3. In other cases, the "ß" is saved as a "ß" - so without any change. Then it is also displayed wrongly.

What can I do to avoid the cases 2 and 3?

How can I make everything the same encoding, preferably UTF-8? When must I use utf8_encode(), when must I use utf8_decode() (it's clear what the effect is but when must I use the functions?) and when must I do nothing with the input?

How do I make everything the same encoding? Perhaps with the function mb_detect_encoding()? Can I write a function for this? So my problems are:

  1. How do I find out what encoding the text uses?
  2. How do I convert it to UTF-8 - whatever the old encoding is?

Would a function like this work?

function correct_encoding($text) {
    $current_encoding = mb_detect_encoding($text, 'auto');
    $text = iconv($current_encoding, 'UTF-8', $text);
    return $text;
}

I've tested it, but it doesn't work. What's wrong with it?

Moreover answered 26/5, 2009 at 13:50 Comment(3)
"The "ß" in "Fußball" should look like this in my database: "Ÿ".". No it should look like ß. Make sure you collation and connection are set up correctly. Otherwise sorting and searching will be broken for you.Estranged
Your database is badly setup. If you want to store Unicode content, just configure it for that. So instead of trying to workaround the issue in your PHP code, you should first fix the database.Transfuse
USE: $from=mb_detect_encoding($text); $text=mb_convert_encoding($text,'UTF-8',$from);Mime
S
386

If you apply utf8_encode() to an already UTF-8 string, it will return garbled UTF-8 output.

I made a function that addresses all this issues. It´s called Encoding::toUTF8().

You don't need to know what the encoding of your strings is. It can be Latin1 (ISO 8859-1), Windows-1252 or UTF-8, or the string can have a mix of them. Encoding::toUTF8() will convert everything to UTF-8.

I did it because a service was giving me a feed of data all messed up, mixing UTF-8 and Latin1 in the same string.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::toUTF8($utf8_or_latin1_or_mixed_string);

$latin1_string = Encoding::toLatin1($utf8_or_latin1_or_mixed_string);

Download:

https://github.com/neitanod/forceutf8

I've included another function, Encoding::fixUFT8(), which will fix every UTF-8 string that looks garbled.

Usage:

require_once('Encoding.php');
use \ForceUTF8\Encoding;  // It's namespaced now.

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

Examples:

echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");
echo Encoding::fixUTF8("FÃÂédÃÂération Camerounaise de Football");
echo Encoding::fixUTF8("Fédération Camerounaise de Football");

will output:

Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football
Fédération Camerounaise de Football

I've transformed the function (forceUTF8) into a family of static functions on a class called Encoding. The new function is Encoding::toUTF8().

Stadler answered 13/8, 2010 at 18:49 Comment(25)
Thank you very much, this is exactly what I was looking for :) But it would be best to have only one single function which does everything. So forceUTF8() should include fixUTF8()'s skills.Moreover
Well, if you look at the code, fixUTF8 simply calls forceUTF8 once and again until the string is returned unchanged. One call to fixUTF8() takes at least twice the time of a call to forceUTF8(), so it's a lot less performant. I made fixUTF8() just to create a command line program that would fix "encode-corrupted" files, but in a live environment is rarely needed.Jerryjerrybuild
How does this convert non-UTF8 characters to UTF8, without knowing what encoding the invalid characters are in to begin with?Mastodon
It assumes ISO-8859-1, the answer already says this. The only difference between forceUTF8() and utf8_encode() is that forceUTF8() recognizes UTF8 characters and keeps them unchanged.Jerryjerrybuild
i had to add $value = str_ireplace("�", "à", $value); before using fixUTF8Balkin
If you get a code 500 error it means that your php doesn't support namespaces. You can safely remove it in that case. (line 41)Hogtie
@SebastiánGrignoli would be nice if you could integrate fixUTF8 and toUTF8 into a single (additional?) function. Also an array_walk function with this would be nice :)Hogtie
These functions already walks arrays recursively if you provide them instead of strings. fixUTF8 is not really intended for production environments. See the second comment on this answer.Jerryjerrybuild
"You dont need to know what the encoding of your strings is." - I very much disagree. Guessing and trying may work, but you'll always sooner or later encounter edge cases where it doesn't.Sueannsuede
I totally agree. In fact, I didn't mean to state that as a general rule, just explain that this class might help you if that's the situation you happen to find yourself in.Jerryjerrybuild
By the way, if the enconding of your string is one of those that I listed, it will always work except for the cases that are mentioned in the comments of the class. Also, fixUTF8() -the second one- comes with a warning: don't use it on production. It will "fix" double encoded strings, but sometimes you want them unfixed, just like in my answer depicting them.Jerryjerrybuild
My experience tells me that your code is probably slow. Checking UTF-8 with a regex like in this answer of mine (halfway down) is probably much fasterBay
I don't completely understand your regex, but what I wanted to achieve was to be sure that any UTF-8+Win1252/Latin1 mixed encoding strings would always be converted to UTF8, and it does that well. These are sanitization functions, not intended for the frontend tier.Jerryjerrybuild
If you sanitize the Win1252 string "…Gruß…" ("\x85Gru\xDF\x85") your way, you end up with "…Gru߅" ("\xE2\x80\xA6Gru\xDF\x85")Bay
My regex checks that a string consists of UTF-8 characters from start to end. This is different from what you do, but I don't think that allowing mixed encodings is a good idea.Bay
It depend on your needs. It's a tradeoff and it's fine as long as you know what's going on. The edge case you mention is noted on the comments on the source code of the class.Jerryjerrybuild
The original version did not support Win1252, just Latin1+UTF-8. There were less probable misses then. Latin1 does not have an ellipsis where Win1252 does.Jerryjerrybuild
I've used your function fix a hacky problem. But I want to know, what exactly does your toWin1252 function do? Why is your toWin1252, toISO8859 and toLatin1 do all the same thing?Unicellular
They are all aliases. Latin1 is a nickname for the ISO8859-1 encoding. Win1252 is almost the same encoding, but with some added characters. At first my function did not recognize those extra characters, but all software that claims to support Latin1 are in fact using Win1252, so It's better to support it here, I guess.Jerryjerrybuild
working with f*** polish letters and Encoding::toUTF8 doesnt work... i receive "?" everywhere. One file is in windows-1250, other one is mixed with something - both failsCindicindie
require_once('Encoding.php'); and use \ForceUTF8\Encoding; need to use before declare classSpiculate
@SebastiánGrignoli fixUTF8() has problems with german umlauts. Lowercase chars are converted correctly ä => ä, ö => ö but uppercase don't work Ä => Ã? which has to be Ä. Also ß does not get converted to ß Is there a way to extend this list in source code?Japha
This looks amazing. Does anyone know how to turn it into a script that can look at all (txt, md, php, css, js, html, htm, ...) files in a directory and sub-directories and run @SebastiánGrignoli above script on them ? Or can I somehow add it to a file explorer (like xyPlorer for windows), or the windows context menu to apply to an entire folder ?Belomancy
Here you go: gist.github.com/neitanod/a5eff5bc5b7b49449ea4c952e2a02d28 Replace every force with fix and ::toUTF8( with ::fixUTF8( to use with FIX function instead of FORCE. Always backup your files first!Jerryjerrybuild
I used it in a php script with around 1000s of emails using toUTF8() in a loop. And the script crashes. Then i used if condition as suggested by Christian and harpax. And the combination brought faster results, no crashes.Imogen
P
79

You first have to detect what encoding has been used. As you’re parsing RSS feeds (probably via HTTP), you should read the encoding from the charset parameter of the Content-Type HTTP header field. If it is not present, read the encoding from the encoding attribute of the XML processing instruction. If that’s missing too, use UTF-8 as defined in the specification.


Here is what I probably would do:

I’d use cURL to send and fetch the response. That allows you to set specific header fields and fetch the response header as well. After fetching the response, you have to parse the HTTP response and split it into header and body. The header should then contain the Content-Type header field that contains the MIME type and (hopefully) the charset parameter with the encoding/charset too. If not, we’ll analyse the XML PI for the presence of the encoding attribute and get the encoding from there. If that’s also missing, the XML specs define to use UTF-8 as encoding.

$url = 'http://www.lr-online.de/storage/rss/rss/sport.xml';

$accept = array(
    'type' => array('application/rss+xml', 'application/xml', 'application/rdf+xml', 'text/xml'),
    'charset' => array_diff(mb_list_encodings(), array('pass', 'auto', 'wchar', 'byte2be', 'byte2le', 'byte4be', 'byte4le', 'BASE64', 'UUENCODE', 'HTML-ENTITIES', 'Quoted-Printable', '7bit', '8bit'))
);
$header = array(
    'Accept: '.implode(', ', $accept['type']),
    'Accept-Charset: '.implode(', ', $accept['charset']),
);
$encoding = null;
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
$response = curl_exec($curl);
if (!$response) {
    // error fetching the response
} else {
    $offset = strpos($response, "\r\n\r\n");
    $header = substr($response, 0, $offset);
    if (!$header || !preg_match('/^Content-Type:\s+([^;]+)(?:;\s*charset=(.*))?/im', $header, $match)) {
        // error parsing the response
    } else {
        if (!in_array(strtolower($match[1]), array_map('strtolower', $accept['type']))) {
            // type not accepted
        }
        $encoding = trim($match[2], '"\'');
    }
    if (!$encoding) {
        $body = substr($response, $offset + 4);
        if (preg_match('/^<\?xml\s+version=(?:"[^"]*"|\'[^\']*\')\s+encoding=("[^"]*"|\'[^\']*\')/s', $body, $match)) {
            $encoding = trim($match[1], '"\'');
        }
    }
    if (!$encoding) {
        $encoding = 'utf-8';
    } else {
        if (!in_array($encoding, array_map('strtolower', $accept['charset']))) {
            // encoding not accepted
        }
        if ($encoding != 'utf-8') {
            $body = mb_convert_encoding($body, 'utf-8', $encoding);
        }
    }
    $simpleXML = simplexml_load_string($body, null, LIBXML_NOERROR);
    if (!$simpleXML) {
        // parse error
    } else {
        echo $simpleXML->asXML();
    }
}
Plight answered 26/5, 2009 at 19:52 Comment(14)
Thanks. This would be easy. But would it really work? There are often wrong encodings given in the HTTP headers or in the attributes of XML.Moreover
Again: That’s not your problem. Standards were established to avoid such troubles. If others don’t follow them, it’s their problem, not yours.Plight
Thanks for the code. But why not simply use this? paste.bradleygill.com/index.php?paste_id=9651 Your code is much more complex, what's better with it?Moreover
Well, firstly you’re making two requests, one for the HTTP header and one for the data. Secondly, you’re looking for any appearance of charset= and encoding= and not just at the appropriate positions. And thirdly, you’re not checking if the declared encoding is accepted.Plight
You’re not sending any encoding information. Thus the default in HTML (ISO 8859-1) is used.Plight
No, that's not the cause. In line 26 of your code there is an error: undefined offset 2: $encoding = trim($match[2], '"\''); Sometimes the characters are correct (ö instead of ö), sometimes they aren't (À instead of ä). So there must be something wrong in your code or in the feed I want to parse.Moreover
Well then add a line to check if $match[2] exists before using it.Plight
If $match[2] is set, it's clear that everything is going on as normal. But what to do if $match[2] is not set? Return false?Moreover
No, just do nothing. If there is no encoding declared in the HTTP header, the encoding in the XML declaration is used. And if that’s missing too, the default encoding is used.Plight
Yes, logical. :) My very last question: Why is the following line there? if (!in_array($encoding, array_map('strtolower', $accept['charset']))) { // encoding not accepted } Can't I just let it out?Moreover
That piece of code was intended to accept just the charsets/encodings mb_convert_encoding accepts (see mb_list_encodings). Otherwise mb_convert_encoding will probably throw an error.Plight
But it doesn't prevent block wrong encodings/charsets since the following line is no elseif but a normal if, right? So the line can be deleted without changing something, can't it?Moreover
Your code also gives this error message: Warning: mb_convert_encoding() [function.mb-convert-encoding]: Illegal character encoding specifiedMoreover
Then try to find out the cause of this error. It took me just ten minutes to write that code and didn’t tested it well. It might have some errors more than this.Plight
I
44

Detecting the encoding is hard.

mb_detect_encoding works by guessing, based on a number of candidates that you pass it. In some encodings, certain byte-sequences are invalid, an therefore it can distinguish between various candidates. Unfortunately, there are a lot of encodings, where the same bytes are valid (but different). In these cases, there is no way to determine the encoding; You can implement your own logic to make guesses in these cases. For example, data coming from a Japanese site might be more likely to have a Japanese encoding.

As long as you only deal with Western European languages, the three major encodings to consider are utf-8, iso-8859-1 and cp-1252. Since these are defaults for many platforms, they are also the most likely to be reported wrongly about. Eg. if people use different encodings, they are likely to be frank about it, since else their software would break very often. Therefore, a good strategy is to trust the provider, unless the encoding is reported as one of those three. You should still doublecheck that it is indeed valid, using mb_check_encoding (note that valid is not the same as being - the same input may be valid for many encodings). If it is one of those, you can then use mb_detect_encoding to distinguish between them. Luckily that is fairly deterministic; You just need to use the proper detect-sequence, which is UTF-8,ISO-8859-1,WINDOWS-1252.

Once you've detected the encoding you need to convert it to your internal representation (UTF-8 is the only sane choice). The function utf8_encode transforms ISO-8859-1 to UTF-8, so it can only used for that particular input type. For other encodings, use mb_convert_encoding.

Invagination answered 26/5, 2009 at 14:38 Comment(13)
Thank you very much! What's better: mb-convert-encoding() or iconv()? I don't know what the differences are. Yes, I will only have to parse Western European languages, especially English, German and French.Moreover
I've just seen: mb-detect-encoding() ist useless. It only supports UTF-8, UTF-7, ASCII, EUC-JP,SJIS, eucJP-win, SJIS-win, JIS and ISO-2022-JP. The most important ones for me, ISO-8859-1 and WINDOWS-1252, aren't supported. So I can't use mb-detect-encoding().Moreover
My, you're right. It's been a while since I've used it. You'll have to write your own detection-code then, or use an external utility. UTF-8 can be fairly reliably determined, because its escape sequences are quite characteristic. wp-1252 and iso-8859-1 can be distinguished because wp-1252 may contain bytes that are illegal in iso-8859-1. Use Wikipedia to get the details, or look in the comments-section of php.net, under various charset-related functions.Invagination
I think you can distinguish the different encodings when you look at the forms which the special sings emerge in: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why?Moreover
Yes, but then you need to know the contents of the string before comparing it, and that kind of defeats the purpose in the first place. The German ß appears differently because it has different values in different encodings. Somce characters happen to be represented in the same way in different encodings (eg. all characters in the ascii charset are encoded in the same way in utf-8, iso-8859-* and wp-1252), so as long as you use just those characters, they all look the same. That's why they are some times called ascii-compatible.Invagination
Ok, then it's quite easy, isn't it? Can't I just look for "Ã" in the texts? This only emerges if the text is double UTF-8 encoded, so too often encoded. So I must only decode it one time, right? The "Ã" wouldn't appear if the text is correct since the "Â" doesn't appear in German or English texts normally. Would this be a good approach? How could I code this in PHP? Would it work?Moreover
You cannot always tell just from looking for such oddities if some data is not proper encoded. There always might be the possibility that they are intended. Take your own question as an example.Plight
Yes, they might be intended. But I would be fine for me if 99% of the texts are displayed correctly and only 1% is displayed wrongly because the "strange" characters were intended. If there was a possibility to achieve this, I would like to use it.Moreover
@marco92w: Well then I’d suggest to try the standards way. I’d say the error rate is not much higher than with your guessing method. But even if it’s higher you would support the standards.Plight
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)Moreover
looks like ISO-8859-* and Windows-1252 are supported by mb_detect_encoding php.net/manual/en/mbstring.supported-encodings.phpDecent
Unless you know better, test if your input is valid UTF-8 string and if not, blindly convert from Windows-1252 to UTF-8. This usually works for Western European Languages because if the input happens to be ISO-8859-1, it's a subset of Windows-1252 and the conversion will be correct. The only really problematic issue is ISO-8859-15 which as EUR sign ("€") in position 0xA4 whereas Windows-1252 has generic currency sign ("¤") in the same position. You can apply some heuristics to decide between ISO-8859-15 and Windows-1252 but you can never be sure.Tenerife
@MikkoRantalainen windows-1252 is not a subset of iso-8859-1 though. They are almost identical except for a few code points (Notably some quote characters).Invagination
L
14

This cheatsheet lists some common caveats related to UTF-8 handling in PHP: http://developer.loftdigital.com/blog/php-utf-8-cheatsheet

This function detecting multibyte characters in a string might also prove helpful (source):


function detectUTF8($string)
{
    return preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}         # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
        )+%xs', 
    $string);
}
Legwork answered 9/6, 2009 at 14:54 Comment(1)
I think that doesn't work correctly: echo detectUTF8('3٣3'); # 1Tomi
R
11

A little heads up. You said that the "ß" should be displayed as "Ÿ" in your database.

This is probably because you're using a database with Latin-1 character encoding or possibly your PHP-MySQL connection is set wrong, this is, P believes your MySQL is set to use UTF-8, so it sends data as UTF-8, but your MySQL believes PHP is sending data encoded as ISO 8859-1, so it may once again try to encode your sent data as UTF-8, causing this kind of trouble.

Take a look at mysql_set_charset. It may help you.

Rounce answered 27/6, 2011 at 16:12 Comment(1)
I had to run $mysqli->query("SET CHARACTER SET UTF8");Verily
F
6

Your encoding looks like you encoded into UTF-8 twice; that is, from some other encoding, into UTF-8, and again into UTF-8. As if you had ISO 8859-1, converted from ISO 8859-1 to UTF-8, and treated the new string as ISO 8859-1 for another conversion into UTF-8.

Here's some pseudocode of what you did:

$inputstring = getFromUser();
$utf8string = iconv($current_encoding, 'utf-8', $inputstring);
$flawedstring = iconv($current_encoding, 'utf-8', $utf8string);

You should try:

  1. detect encoding using mb_detect_encoding() or whatever you like to use
  2. if it's UTF-8, convert into ISO 8859-1, and repeat step 1
  3. finally, convert back into UTF-8

That is presuming that in the "middle" conversion you used ISO 8859-1. If you used Windows-1252, then convert into Windows-1252 (latin1). The original source encoding is not important; the one you used in flawed, second conversion is.

This is my guess at what happened; there's very little else you could have done to get four bytes in place of one extended ASCII byte.

The German language also uses ISO 8859-2 and Windows-1250 (Latin-2).

Finegan answered 4/6, 2009 at 10:7 Comment(0)
G
5

A really nice way to implement an isUTF8-function can be found on php.net:

function isUTF8($string) {
    return (utf8_encode(utf8_decode($string)) == $string);
}
Gader answered 13/8, 2010 at 18:23 Comment(7)
Unfortunately, this only works when the string only consists of characters that are included in ISO-8859-1. But this could work: @iconv('utf-8', 'utf-8//IGNORE', $str) == $strDingess
@Christian: Indeed, that's what the authors of High Performance MySQL recommend too.Donkey
Its doesn't work correctly: echo (int)isUTF8(' z'); # 1 echo (int)isUTF8(NULL); # 1Tomi
Though not perfect, I think this is a nice way to implement a sketchy UTF-8 check.Keeton
mb_check_encoding($string, 'UTF-8')Sueannsuede
Just to put into context how badly this will work: there are exactly 191 printable characters in ISO 8859-1; Unicode 13 defines about 140000. So if you pick a random Unicode character, encode it correctly as UTF-8, and pass it to this function, there is a more than 99% chance of this function incorrectly returning false. In case you think those are obscure characters, note that ISO 8859-1 has no Euro symbol, so isUTF8('€') will be among that 99%.Wightman
This function is deprecated in PHP 8.2 and will be removed in PHP 9.x, primarily because it is often musised like this! wiki.php.net/rfc/remove_utf8_decode_and_utf8_encodeOperator
R
4

The interesting thing about mb_detect_encoding and mb_convert_encoding is that the order of the encodings you suggest does matter:

// $input is actually UTF-8

mb_detect_encoding($input, "UTF-8", "ISO-8859-9, UTF-8");
// ISO-8859-9 (WRONG!)

mb_detect_encoding($input, "UTF-8", "UTF-8, ISO-8859-9");
// UTF-8 (OK)

So you might want to use a specific order when specifying expected encodings. Still, keep in mind that this is not foolproof.

Resupine answered 11/3, 2012 at 17:58 Comment(3)
This happens because ISO-8859-9 will in practice accept any binary input. The same goes for Windows-1252 and friends. You have to first test for encodings that can fail to accept the input.Tenerife
@MikkoRantalainen, yeah, I guess this part of the docs says something similar: php.net/manual/en/function.mb-detect-order.php#example-2985Carry
Considering that WHATWG HTML spec defines Windows 1252 as the default encoding, it should be pretty safe to assume if ($input_is_not_UTF8) $input_is_windows1252 = true;. See also: html.spec.whatwg.org/multipage/…Tenerife
F
3

mb_detect_encoding:

echo mb_detect_encoding($str, "auto");

Or

echo mb_detect_encoding($str, "UTF-8, ASCII, ISO-8859-1");

I really don't know what the results are, but I'd suggest you just take some of your feeds with different encodings and try if mb_detect_encoding works or not.

auto is short for "ASCII,JIS,UTF-8,EUC-JP,SJIS". It returns the detected charset, which you can use to convert the string to UTF-8 with iconv.

<?php
function convertToUTF8($str) {
    $enc = mb_detect_encoding($str);

    if ($enc && $enc != 'UTF-8') {
        return iconv($enc, 'UTF-8', $str);
    } else {
        return $str;
    }
}
?>

I haven't tested it, so no guarantee. And maybe there's a simpler way.

Farlie answered 26/5, 2009 at 14:10 Comment(2)
Thank you. What's the difference between 'auto' and 'UTF-8, ASCII, ISO-8859-1' as the second argument? Does 'auto' feature more encodings? Then it would be better to use 'auto', wouldn't it? If it really works without any bugs then I must only change "ASCII" or "ISO-8859-1" to "UTF-8". How?Moreover
Your function doesn't work well in all cases. Sometimes I get an error: Notice: iconv(): Detected an illegal character in input string in ...Moreover
S
2

Working out the character encoding of RSS feeds seems to be complicated. Even normal web pages often omit, or lie about, their encoding.

So you could try to use the correct way to detect the encoding and then fall back to some form of auto-detection (guessing).

Sibship answered 26/5, 2009 at 14:2 Comment(8)
I don't want to read out the encoding from the feed information. So it's equal if the feed information are wrong. I would like to detect the encoding from the text.Moreover
@marco92w: It’s not your problem if the declared encoding is wrong. Standards have not been established for fun.Plight
@Gumbo: but if you're working in the real world you have to be able to deal with things like incorrect declared encodings. The problem is that it's very difficult to guess (correctly) the encoding just from some text. Standards are wonderful, but many (most?) of the pages/feeds out there doesn't comply with them.Sibship
@Kevin ORourke: Exactly, right. That's my problem. @Gumbo: Yes, it's my problem. I want to read out the feeds and aggregate them. So I must correct the wrong encodings.Moreover
@marco92w: But you cannot correct the encoding if you don’t know the correct encoding and the current encoding. And that’s what the charset/encoding declaration if for: describe the encoding the data is encoded in.Plight
Oh, now I've understood it. I thought it would be possible because I can surely say that "Ã" can't appear but "Ÿ" does. Another method I had imagined was to utf8_decode() it and then look whether it is a normal text. If there is any "Ã" after utf8_decode() then it must be wrong.Moreover
@marco92w: Again, the character that’s shown to you depends on the character encoding/set that was used to interpret the data. If you interpret UTF-8 encoded with something other than UTF-8 you will probably get some oddities (excet you’re just using ASCII characters).Plight
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)Moreover
O
2

You need to test the character set on input since responses can come coded with different encodings.

I force all content been sent into UTF-8 by doing detection and translation using the following function:

function fixRequestCharset()
{
  $ref = array(&$_GET, &$_POST, &$_REQUEST);
  foreach ($ref as &$var)
  {
    foreach ($var as $key => $val)
    {
      $encoding = mb_detect_encoding($var[$key], mb_detect_order(), true);
      if (!$encoding)
        continue;
      if (strcasecmp($encoding, 'UTF-8') != 0)
      {
        $encoding = iconv($encoding, 'UTF-8', $var[$key]);
        if ($encoding === false)
          continue;
        $var[$key] = $encoding;
      }
    }
  }
}

That routine will turn all PHP variables that come from the remote host into UTF-8.

Or ignore the value if the encoding could not be detected or converted.

You can customize it to your needs.

Just invoke it before using the variables.

Occupational answered 16/12, 2011 at 16:46 Comment(2)
what is the purpose of using mb_detect_order() without a passed in encoding list?Reliquiae
The purpose is to return the system configured ordered array of encodings defined in php.ini used. This is required by mb_detect_encoding to fill third parameter.Occupational
A
1

It's simple: when you get something that's not UTF-8, you must encode that into UTF-8.

So, when you're fetching a certain feed that's ISO 8859-1 parse it through utf8_encode.

However, if you're fetching an UTF-8 feed, you don't need to do anything.

Apogeotropism answered 26/5, 2009 at 13:55 Comment(9)
Thanks! OK, I can find out how the feed is encoded by using mb-detect-encoding(), right? But what can I make if the feed is ASCII? utf8-encode() ist just for ISO-8859-1 to UTF-8, isn't it?Moreover
ASCII is a subset of ISO-8859-1 AND UTF-8, so using utf8-encode() should not make a change - IF it's actually just ASCIIWheatley
So I can always use utf8_encode if it's not UTF-8? This would be really easy. The text which was ASCII according to mb-detect-encoding() contained "&#228;". Is this a ASCII character? Or is it HTML?Moreover
That's HTML. Actually that's encoded so when you print it in a given page it shows ok. If you want you can first ut8_encode() then html_entity_decode().Apogeotropism
Yes, html_entity_decode() works in this case. But: The German "ß" emerges in different forms: Sometimes "Ÿ", sometimes "ß" and sometimes "ß". Why?Moreover
The character ß is encoded in UTF-8 with the byte sequence 0xC39F. Interpreted with Windows-1252, that sequence represents the two characters  (0xC3) and Ÿ (0x9F). And if you encode this byte sequence again with UTF-8, you’ll get 0xC383 0xC29F what represents ß in Windows-1252. So your mistake is to handle this UTF-8 encoded data as something with an encoding other than UTF-8. That this byte sequence is presented as the character you’re seeing is just a matter of interpretation. If you use an other encoding/charset, you’ll probably see other characters.Plight
Thank you. First, I want to say that all UTF-8 characters are shown as interpreted with Windows-1252 in my PHPMyAdmin. I don't handle them wrong. "Ÿ" is displayed correctly as "ß". I do the same things with all RSS feeds but some feeds are parsed as "Ÿ" and some are parsed as "ß". That's the problem. Can't I do the following: Look for "Ã" in the text. If it is in the text, then it must be double UTF-8 encoded. So I simply decode it one time and everything is fine. Would this work? How could I code this?Moreover
That’s why you should take the declared encoding into account. Because not every data is encoded with the same encoding using the same character set. There are plenty different character sets. Just by looking at the byte sequences you cannot determine what character set had been used. Take the ISO 8859 character set family as an example: 15 different character sets all use the same encoding.Plight
Thank you for you help! You've definitely convinced me to use the standards way. Is this script correct? paste.bradleygill.com/index.php?paste_id=9651 (Sorry for posting it several times as a comment but you shouldn't overlook it. One answer is enough for me. :)Moreover
O
1

I know this is an older question, but I figure a useful answer never hurts. I was having issues with my encoding between a desktop application, SQLite, and GET/POST variables. Some would be in UTF-8, some would be in ASCII, and basically everything would get screwed up when foreign characters got involved.

Here is my solution. It scrubs your GET/POST/REQUEST (I omitted cookies, but you could add them if desired) on each page load before processing. It works well in a header. PHP will throw warnings if it can't detect the source encoding automatically, so these warnings are suppressed with @'s.

//Convert everything in our vars to UTF-8 for playing nice with the database...
//Use some auto detection here to help us not double-encode...
//Suppress possible warnings with @'s for when encoding cannot be detected
try
{
    $process = array(&$_GET, &$_POST, &$_REQUEST);
    while (list($key, $val) = each($process)) {
        foreach ($val as $k => $v) {
            unset($process[$key][$k]);
            if (is_array($v)) {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = $v;
                $process[] = &$process[$key][@mb_convert_encoding($k,'UTF-8','auto')];
            } else {
                $process[$key][@mb_convert_encoding($k,'UTF-8','auto')] = @mb_convert_encoding($v,'UTF-8','auto');
            }
        }
    }
    unset($process);
}
catch(Exception $ex){}
Osburn answered 23/5, 2010 at 5:52 Comment(1)
Thanks for the answer, jocull. The function mb_convert_encoding() is what we've already had here, right? ;) So the only new thing in your answer is the loops to change encoding in all variables.Moreover
Z
1

harpax' answer worked for me. In my case, this is good enough:

if (isUTF8($str)) {
    echo $str;
}
else
{
    echo iconv("ISO-8859-1", "UTF-8//TRANSLIT", $str);
}
Zara answered 26/7, 2011 at 22:21 Comment(0)
M
1

I was checking for solutions to encoding since ages, and this page is probably the conclusion of years of search! I tested some of the suggestions you mentioned and here are my notes:

This is my test string:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chàrs to see thèm, convertèd by fùnctìon!! & that's it!

I do an INSERT to save this string on a database in a field that is set as utf8_general_ci

The character set of my page is UTF-8.

If I do an INSERT just like that, in my database, I have some characters probably coming from Mars...

So I need to convert them into some "sane" UTF-8. I tried utf8_encode(), but still aliens chars were invading my database...

So I tried to use the function forceUTF8 posted on number 8, but in the database the string saved looks like this:

this is a "wròng wrìtten" string bùt I nèed to pù 'sòme' special chà rs to see thèm, convertèd by fùnctìon!! & that's it!

So collecting some more information on this page and merging them with other information on other pages I solved my problem with this solution:

$finallyIDidIt = mb_convert_encoding(
  $string,
  mysql_client_encoding($resourceID),
  mb_detect_encoding($string)
);

Now in my database I have my string with correct encoding.

NOTE:

Only note to take care of is in function mysql_client_encoding! You need to be connected to the database, because this function wants a resource ID as a parameter.

But well, I just do that re-encoding before my INSERT so for me it is not a problem.

Marxmarxian answered 1/12, 2011 at 0:15 Comment(1)
Why do you not just use UTF-8 client encoding for mysql in the first place? Would not need manual conversion this waySussi
S
0

After sorting out your PHP scripts, don't forget to tell MySQL what charset you are passing and would like to receive.

Example: set the character to UTF-8

Passing UTF-8 data to a Latin 1 table in a Latin 1 I/O session gives those nasty birdfeets. I see this every other day in OsCommerce shops. Back and fourth it might seem right. But phpMyAdmin will show the truth. By telling MySQL what charset you are passing, it will handle the conversion of MySQL data for you.

How to recover existing scrambled MySQL data is another question. :)

Storyteller answered 18/1, 2012 at 19:31 Comment(0)
R
0

Get the encoding from headers and convert it to UTF-8.

$post_url = 'http://website.domain';

/// Get headers ///////////////////////////////////////////////
function get_headers_curl($url)
{
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL,            $url);
    curl_setopt($ch, CURLOPT_HEADER,         true);
    curl_setopt($ch, CURLOPT_NOBODY,         true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_TIMEOUT,        15);

    $r = curl_exec($ch);
    return $r;
}

$the_header = get_headers_curl($post_url);

/// Check for redirect ////////////////////////////////////////
if (preg_match("/Location:/i", $the_header)) {
    $arr = explode('Location:', $the_header);
    $location = $arr[1];

    $location = explode(chr(10), $location);
    $location = $location[0];

    $the_header = get_headers_curl(trim($location));
}

/// Get charset ///////////////////////////////////////////////
if (preg_match("/charset=/i", $the_header)) {
    $arr = explode('charset=', $the_header);
    $charset = $arr[1];

    $charset = explode(chr(10), $charset);
    $charset = $charset[0];
}

///////////////////////////////////////////////////////////////////
// echo $charset;

if($charset && $charset != 'UTF-8') {
    $html = iconv($charset, "UTF-8", $html);
}
Rhombohedron answered 1/2, 2014 at 9:20 Comment(0)
I
0

Ÿ is Mojibake for ß. In your database, you may have one of the following hex values (use SELECT HEX(col)...) to find out):

  • DF if the column is "latin1",
  • C39F if the column is utf8 -- OR -- it is latin1, but "double-encoded"
  • C383C5B8 if double-encoded into a utf8 column

You should not use any encoding/decoding functions in PHP; instead, you should set up the database and the connection to it correctly.

If MySQL is involved, see: Trouble with UTF-8 characters; what I see is not what I stored

Incontestable answered 19/8, 2016 at 18:46 Comment(2)
What do you mean by "you may have hex"? Arbitrary binary data? Or something else? Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Chari
@PeterMortensen - Yeah, my wording was rather cryptic. I hope I my clarification helps. Do a SELECT HEX(col)... to see what is in the table.Incontestable
C
0
if(!mb_check_encoding($str)){
    $str = iconv("windows-1251", "UTF-8", $str);
}

It helped for me

Calomel answered 16/12, 2022 at 5:15 Comment(0)
D
-1

When you try to handle multi languages, like Japanese and Korean, you might get in trouble.

mb_convert_encoding with the 'auto' parameter doesn't work well. Setting mb_detect_order('ASCII,UTF-8,JIS,EUC-JP,SJIS,EUC-KR,UHC') doesn't help since it will detect EUC-* wrongly.

I concluded that as long as input strings comes from HTML, it should use 'charset' in a meta element. I use Simple HTML DOM Parser because it supports invalid HTML.

The below snippet extracts the title element from a web page. If you would like to convert the entire page, then you may want to remove some lines.

<?php
require_once 'simple_html_dom.php';

echo convert_title_to_utf8(file_get_contents($argv[1])), PHP_EOL;

function convert_title_to_utf8($contents)
{
    $dom = str_get_html($contents);
    $title = $dom->find('title', 0);
    if (empty($title)) {
        return null;
    }
    $title = $title->plaintext;
    $metas = $dom->find('meta');
    $charset = 'auto';
    foreach ($metas as $meta) {
        if (!empty($meta->charset)) { // HTML5
            $charset = $meta->charset;
        } else if (preg_match('@charset=(.+)@', $meta->content, $match)) {
            $charset = $match[1];
        }
    }
    if (!in_array(strtolower($charset), array_map('strtolower', mb_list_encodings()))) {
        $charset = 'auto';
    }
    return mb_convert_encoding($title, 'UTF-8', $charset);
}
Doubtful answered 14/9, 2011 at 23:29 Comment(0)
R
-1

This version is for the German language, but you can modify the $CHARSETS and the $TESTCHARS.

class CharsetDetector
{
    private static $CHARSETS = array(
        "ISO_8859-1",
        "ISO_8859-15",
        "CP850"
    );

    private static $TESTCHARS = array(
        "€",
        "ä",
        "Ä",
        "ö",
        "Ö",
        "ü",
        "Ü",
        "ß"
    );

    public static function convert($string)
    {
        return self::__iconv($string, self::getCharset($string));
    }

    public static function getCharset($string)
    {
        $normalized = self::__normalize($string);
        if(!strlen($normalized))
            return "UTF-8";
        $best = "UTF-8";
        $charcountbest = 0;
        foreach (self::$CHARSETS as $charset)
        {
            $str = self::__iconv($normalized, $charset);
            $charcount = 0;
            $stop = mb_strlen($str, "UTF-8");

            for($idx = 0; $idx < $stop; $idx++)
            {
                $char = mb_substr($str, $idx, 1, "UTF-8");
                foreach (self::$TESTCHARS as $testchar)
                {
                    if($char == $testchar)
                    {
                        $charcount++;
                        break;
                    }
                }
            }

            if($charcount > $charcountbest)
            {
                $charcountbest = $charcount;
                $best = $charset;
            }
            //echo $text . "<br />";
        }
        return $best;
    }

    private static function __normalize($str)
    {
        $len = strlen($str);
        $ret = "";
        for($i = 0; $i < $len; $i++)
        {
            $c = ord($str[$i]);
            if ($c > 128) {
                if (($c > 247))
                    $ret .= $str[$i];
                elseif
                    ($c > 239) $bytes = 4;
                elseif
                    ($c > 223) $bytes = 3;
                elseif
                    ($c > 191) $bytes = 2;
                else
                    $ret .= $str[$i];

                if (($i + $bytes) > $len)
                    $ret .= $str[$i];
                $ret2 = $str[$i];
                while ($bytes > 1)
                {
                    $i++;
                    $b = ord($str[$i]);
                    if ($b < 128 || $b > 191)
                    {
                        $ret .= $ret2;
                        $ret2 = "";
                        $i += $bytes-1;
                        $bytes = 1;
                        break;
                    }
                    else
                        $ret2 .= $str[$i];
                    $bytes--;
                }
            }
        }
        return $ret;
    }

    private static function __iconv($string, $charset)
    {
        return iconv ($charset, "UTF-8", $string);
    }
}
Rivet answered 22/2, 2012 at 18:47 Comment(0)
F
-1

I had the same issue with phpQuery (ISO-8859-1 instead of UTF-8) and this hack helped me:

$html = '<?xml version="1.0" encoding="UTF-8" ?>' . $html;

mb_internal_encoding('UTF-8'), phpQuery::newDocumentHTML($html, 'utf-8'), mbstring.internal_encoding and other manipulations didn't take any effect.

Forbearance answered 15/7, 2013 at 20:19 Comment(0)
J
-1

I found a solution at http://deer.org.ua/2009/10/06/1/:

class Encoding
{
    /**
     * http://deer.org.ua/2009/10/06/1/
     * @param $string
     * @return null
     */
    public static function detect_encoding($string)
    {
        static $list = ['utf-8', 'windows-1251'];

        foreach ($list as $item) {
            try {
                $sample = iconv($item, $item, $string);
            } catch (\Exception $e) {
                continue;
            }
            if (md5($sample) == md5($string)) {
                return $item;
            }
        }
        return null;
    }
}

$content = file_get_contents($file['tmp_name']);
$encoding = Encoding::detect_encoding($content);
if ($encoding != 'utf-8') {
    $result = iconv($encoding, 'utf-8', $content);
} else {
    $result = $content;
}

I think that @ is a bad decision and made some changes to the solution from deer.org.ua.

Justis answered 13/12, 2016 at 15:5 Comment(1)
The link is broken: "Not Found. The requested URL /2009/10/06/1/ was not found on this server."Chari
C
-1

For Chinese characters, it is common to be encoded in the GBK encoding. In addition, when tested, the most voted answer doesn't work. Here is a simple fix that makes it work as well:

function toUTF8($raw) {
    try{
        return mb_convert_encoding($raw, "UTF-8", "auto"); 
    }catch(\Exception $e){
        return mb_convert_encoding($raw, "UTF-8", "GBK"); 
    }
}

Remark: This solution was written in 2017 and should fix problems for PHP in those days. I have not tested whether latest PHP already understands auto correctly.

Chorography answered 29/6, 2017 at 3:51 Comment(3)
Do you have any insight why, or how your files were different? What parts didn't work for you? For example: Uppercase German characters didn't convert correctly. Curious, what is "GBK" ?Belomancy
In what way doesn't the most voted answer work?Chari
An explanation would be in order. E.g., what is the idea/gist? From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Chari
M
-1

Try without 'auto'

That is:

mb_detect_encoding($text)

instead of:

mb_detect_encoding($text, 'auto')

More information can be found here: mb_detect_encoding

Metagalaxy answered 22/7, 2017 at 8:55 Comment(1)
An explanation would be in order. E.g., what is the idea/gist? What kind of input was it tested on? From the Help Center: "...always explain why the solution you're presenting is appropriate and how it works". Please respond by editing (changing) your answer, not here in comments (without "Edit:", "Update:", or similar - the answer should appear as if it was written today).Chari
T
-1

Try to use this... every text that is not UTF-8 will be translated.

function is_utf8($str) {
    return (bool) preg_match('//u', $str);
}

$myString = "Fußball";

if(!is_utf8($myString)){
    $myString = utf8_encode($myString);
}

// or 1 line version ;) 
$myString = !is_utf8($myString) ? utf8_encode($myString) : trim($myString);
Tune answered 22/4, 2021 at 23:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.