How can I convert HTML character references (&#x5E3;) to regular UTF-8?

G

2

5

I have some hebrew websites that contains character references like: נוף

I can only view these letters if I save the file as .html and view in UTF-8 encoding.

If I try to open it as a regular text file then UTF-8 encoding does not show the proper output.

I noticed that if I open a text editor and write hebrew in UTF-8, each character takes two bytes not 4 bytes line in this example (ו)

Any ideas if this is UTF-16 or any other kind of UTF representation of letters?

How can I convert it to normal letters if possible?

Using latest PHP version.

Graaf answered 25/8, 2010 at 12:19 Comment(2)

Now what is in your file: נוף or נוף? – Breast 25/8, 2010 at 12:27

i want to know how to convert it to regular utf-8 and i wanna know what are these characters? is this the representation of utf-16 or is it something else ? – Graaf 25/8, 2010 at 12:31

B

6

Those are character references that refer to character in ISO 10646 by specifying the code point of that character in decimal (&#n;) or hexadecimal (&#xn;) notation.

You can use html_entity_decode that decodes such character references as well as the entity references for entities defined for HTML 4, so other references like <, >, & will also get decoded:

$str = html_entity_decode($str, ENT_NOQUOTES, 'UTF-8');

If you just want to decode the numeric character references, you can use this:

function html_dereference($match) {
    if (strtolower($match[1][0]) === 'x') {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    return mb_convert_encoding(pack('N', $codepoint), 'UTF-8', 'UTF-32BE');
}
$str = preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', 'html_dereference', $str);

As YuriKolovsky and thirtydot have pointed out in another question, it seems that browser vendors did ‘silently’ agreed on something regarding character references mapping, that does differ from the specification and is quite undocumented.

There seem to be some character references that would normally be mapped onto the Latin 1 supplement but that are actually mapped onto different characters. This is due the mapping that would rather result from mapping the characters from Windows-1252 instead of ISO 8859-1, on which the Unicode character set is build on. Jukka Korpela wrote an extensive article on this topic.

Now here’s an extension to the function mentioned above that handles this quirk:

function html_character_reference_decode($string, $encoding='UTF-8', $fixMappingBug=true) {
    $deref = function($match) use ($encoding, $fixMappingBug) {
        if (strtolower($match[1][0]) === "x") {
            $codepoint = intval(substr($match[1], 1), 16);
        } else {
            $codepoint = intval($match[1], 10);
        }
        // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
        if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
            $mapping = array(
                8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
                338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
                8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
            $codepoint = $mapping[$codepoint-130];
        }
        return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
    };
    return preg_replace_callback('/&#(x[0-9a-f]+|[0-9]+);/i', $deref, $string);
}

If anonymous functions are not available (introduced with 5.3.0), you could also use create_function:

$deref = create_function('$match', '
    $encoding = '.var_export($encoding, true).';
    $fixMappingBug = '.var_export($fixMappingBug, true).';
    if (strtolower($match[1][0]) === "x") {
        $codepoint = intval(substr($match[1], 1), 16);
    } else {
        $codepoint = intval($match[1], 10);
    }
    // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    if ($fixMappingBug && $codepoint >= 130 && $codepoint <= 159) {
        $mapping = array(
            8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249,
            338, 141, 142, 143, 144, 8216, 8217, 8220, 8221, 8226,
            8211, 8212, 732, 8482, 353, 8250, 339, 157, 158, 376);
        $codepoint = $mapping[$codepoint-130];
    }
    return mb_convert_encoding(pack("N", $codepoint), $encoding, "UTF-32BE");
');

Here’s another function that tries to comply to the behavior of HTML 5:

function html5_decode($string, $flags=ENT_COMPAT, $charset='UTF-8') {
    $deref = function($match) use ($flags, $charset) {
        if ($match[1][0] === '#') {
            if (strtolower($match[1][0]) === '#') {
                $codepoint = intval(substr($match[1], 2), 16);
            } else {
                $codepoint = intval(substr($match[1], 1), 10);
            }

            // HTML 5 specific behavior
            // @see http://dev.w3.org/html5/spec/tokenization.html#tokenizing-character-references

            // handle Windows-1252 mismapping
            // @see http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
            // @see http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides
            $overrides = array(
                0x00=>0xFFFD,0x80=>0x20AC,0x82=>0x201A,0x83=>0x0192,0x84=>0x201E,
                0x85=>0x2026,0x86=>0x2020,0x87=>0x2021,0x88=>0x02C6,0x89=>0x2030,
                0x8A=>0x0160,0x8B=>0x2039,0x8C=>0x0152,0x8E=>0x017D,0x91=>0x2018,
                0x92=>0x2019,0x93=>0x201C,0x94=>0x201D,0x95=>0x2022,0x96=>0x2013,
                0x97=>0x2014,0x98=>0x02DC,0x99=>0x2122,0x9A=>0x0161,0x9B=>0x203A,
                0x9C=>0x0153,0x9E=>0x017E,0x9F=>0x0178);
            if (isset($windows1252Mapping[$codepoint])) {
                $codepoint = $windows1252Mapping[$codepoint];
            }

            if (($codepoint >= 0xD800 && $codepoint <= 0xDFFF) || $codepoint > 0x10FFFF) {
                $codepoint = 0xFFFD;
            }
            if (($codepoint >= 0x0001 && $codepoint <= 0x0008) ||
                ($codepoint >= 0x000E && $codepoint <= 0x001F) ||
                ($codepoint >= 0x007F && $codepoint <= 0x009F) ||
                ($codepoint >= 0xFDD0 && $codepoint <= 0xFDEF) ||
                in_array($codepoint, array(
                    0x000B, 0xFFFE, 0xFFFF, 0x1FFFE, 0x1FFFF, 0x2FFFE, 0x2FFFF,
                    0x3FFFE, 0x3FFFF, 0x4FFFE, 0x4FFFF, 0x5FFFE, 0x5FFFF, 0x6FFFE,
                    0x6FFFF, 0x7FFFE, 0x7FFFF, 0x8FFFE, 0x8FFFF, 0x9FFFE, 0x9FFFF,
                    0xAFFFE, 0xAFFFF, 0xBFFFE, 0xBFFFF, 0xCFFFE, 0xCFFFF, 0xDFFFE,
                    0xDFFFF, 0xEFFFE, 0xEFFFF, 0xFFFFE, 0xFFFFF, 0x10FFFE, 0x10FFFF))) {
                $codepoint = 0xFFFD;
            }
            return mb_convert_encoding(pack("N", $codepoint), $charset, "UTF-32BE");
        } else {
            return html_entity_decode($match[0], $flags, $charset);
        }   
    };
    return preg_replace_callback('/&(#(?:x[0-9a-f]+|[0-9]+)|[A-Za-z0-9]+);/i', $deref, $string);
}

I’ve also noticed that in PHP 5.4.0 the html_entity_decode function was added another flag named ENT_HTML5 for HTML 5 behavior.

Breast answered 25/8, 2010 at 13:1 Comment(8)

Any particular reason for using mb_convert_encoding instead of iconv? – Requisite 25/8, 2010 at 13:5

ou can turn a string represented by a local character set into the one represented by another character set, which may be the Unicode character set. Supported character sets depend on the iconv implementation of your system. – Witling 25/8, 2010 at 13:23

fair enough. It's not bad, I was just more curious as to the choice... +1 – Requisite 25/8, 2010 at 13:29

what about the microsoft windows character references like ? ;p – Congdon 9/2, 2012 at 13:54

Alohci pointed out to me that "The character override mapping is formally specified in the HTML5 spec here: http://dev.w3.org/html5/spec/tokenization.html#table-charref-overrides". Does your updated function match that? – Muhammadan 9/2, 2012 at 14:47

Also, you need use ($fixMappingBug, $encoding). – Muhammadan 9/2, 2012 at 14:52

@Breast works now after I added the use ($fixMappingBug,$encoding) fix, thanks! – Congdon 9/2, 2012 at 18:2

@Breast thank you for keeping this updated, but does it support the full list of html4 references unlike html_entity_decode? (I think ´, —,– were not supported, not sure) – Congdon 17/11, 2013 at 10:33

R

5

Those are XML Character References. You want to decode them using html_entity_decode():

$string = html_entity_decode($string, ENT_QUOTES, 'UTF-8');

For more information, you can search Google for the entity in question. See these few examples:

Requisite answered 25/8, 2010 at 12:42 Comment(2)

Those are not entities, not even entity references. Those are just character references. – Breast 25/8, 2010 at 12:44

@Gumbo: Fair enough. They are not using the named entity... But the concept is nearly identical (except that no map is needed). I'll edit the answer to reflect that... – Requisite 25/8, 2010 at 12:54

Recommended topics

Hot tags