How to convert a Unicode text-block to UTF-8 (HEX) code point?
Asked Answered
C

3

13

I have a Unicode text-block, like this:

ụ
ư
ứ
Ỳ
Ỷ
Ỵ
Đ

Now, I want to convert this orginal Unicode text-block into a text-block of UTF-8 (HEX) code point (see the Hexadecimal UTF-8 column, on this page: https://en.wikipedia.org/wiki/UTF-8), by PHP; like this:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

Not like this:

0x1EE5
0x01B0
0x1EE9
0x1EF2
0x1EF6
0x1EF4
0x0110

Is there any way to do it, by PHP?


I have read this topic (PHP: Convert unicode codepoint to UTF-8). But, it is not similar to my question.


I am sorry, I don't know much about Unicode.

Cacoepy answered 19/7, 2015 at 13:27 Comment(3)
You have to know (or try to guess, but that only works some of the time) what encoding your input is in. If it's already in UTF-8 then it's probably already in the format you want -- assuming that by 0xe1 you don't mean the 4 bytes representing 0, x, e, 1 but rather one byte representing the number 225.Aeromechanics
The second answer on the question you link to does convert a Unicode code point to UTF-8 bytes.Beige
Can you show what you have tried? So that we could know exactly what you are trying to do. Currently, there are many ways to interpret your question, as we are trying to guess your purpose in doing such conversion.Juggernaut
N
13

I think you're looking for the bin2hex() function:

Convert binary data into hexadecimal representation

And format by prepending \x to each byte (00-FF)

function str_hex_format ($bin) {
  return '\x'.implode('\x', str_split(bin2hex($bin), 2));
}

For your sample:

// utf8 encoded input
$arr = ["ụ","ư","ứ","Ỳ","Ỷ","Ỵ","Đ"];

foreach($arr AS $v)
  echo $v . " => " . str_hex_format($v) . "\n";

See test at eval.in (link expires)

ụ => \xe1\xbb\xa5
ư => \xc6\xb0
ứ => \xe1\xbb\xa9
Ỳ => \xe1\xbb\xb2
Ỷ => \xe1\xbb\xb6
Ỵ => \xe1\xbb\xb4
Đ => \xc4\x90

Decode example: $str = str_hex_format("ụưứỲỶỴĐ"); echo $str;

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

echo hex2bin(str_replace('\x', "", $str));

ụưứỲỶỴĐ


For more info about escape sequence \x in double quoted strings see php manual.

Nucleoplasm answered 22/7, 2015 at 6:1 Comment(1)
+1. That's exactly how I do it for codepoints.net: github.com/Codepoints/Codepoints.net/blob/…Pyrrho
D
3

PHP treats strings as arrays of characters, regardless of encoding. If you don't need to delimit the UTF8 characters, then something like this works:

$str='ụưứỲỶỴĐ';
foreach(str_split($str) as $char)
  echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);

Output:

\xe1\xbb\xa5\xc6\xb0\xe1\xbb\xa9\xe1\xbb\xb2\xe1\xbb\xb6\xe1\xbb\xb4\xc4\x90

If you need to delimit the UTF8 characters (i.e. with a newline), then you'll need something like this:

$str='ụưứỲỶỴĐ';
foreach(array_slice(preg_split('~~u',$str),1,-1) as $UTF8char){ // split before/after every UTF8 character and remove first/last empty string
  foreach(str_split($UTF8char) as $char)
    echo '\x'.str_pad(dechex(ord($char)),'0',2,STR_PAD_LEFT);
  echo "\n"; // delimiter
}

Output:

\xe1\xbb\xa5
\xc6\xb0
\xe1\xbb\xa9
\xe1\xbb\xb2
\xe1\xbb\xb6
\xe1\xbb\xb4
\xc4\x90

This splits the string into UTF8 characters using preg_split and the u flag. Since preg_split returns the empty string before the first character and the empty string after the last character, we need to array_slice the first and last characters. This can be easily modified to return an array, for example.

Edit: A more "correct" way to do this is this:

echo trim(json_encode(utf8_encode('ụưứỲỶỴĐ')),'"');
Dira answered 25/7, 2015 at 1:31 Comment(0)
G
1

The main thing you need to do is to tell PHP to interpret the incoming Unicode characters correctly. Once you do that, you can then convert them to UTF-8 and then to hex as needed.

This code frag takes your example character in Unicode, converts them to UTF-8, and then dumps the hex representation of those characters.

<?php
// Hex equivalent of "ụưứỲỶỴĐ" in Unicode
$unistr = "\x1E\xE5\x01\xB0\x1E\xE9\x1E\xF2\x1E\xF6\x1E\xF4\x01\x10";
echo " length=" . mb_strlen($unistr, 'UCS-2BE') . "\n";

// Here's the key statement, convert from Unicode 16-bit to UTF-8
$utf8str = mb_convert_encoding($unistr, "UTF-8", 'UCS-2BE');
echo $utf8str . "\n";

for($i=0; $i < mb_strlen($utf8str, 'UTF-8'); $i++) {
    $c = mb_substr($utf8str, $i, 1, 'UTF-8');
    $hex = bin2hex($c);
    echo $c . "\t" . $hex . "\t" . preg_replace("/([0-9a-f]{2})/", '\\\\x\\1', $hex) . "\n";
}

?>

Produces

length=7
ụưứỲỶỴĐ
ụ   e1bba5  \xe1\xbb\xa5
ư   c6b0    \xc6\xb0
ứ   e1bba9  \xe1\xbb\xa9
Ỳ   e1bbb2  \xe1\xbb\xb2
Ỷ   e1bbb6  \xe1\xbb\xb6
Ỵ   e1bbb4  \xe1\xbb\xb4
Đ   c490    \xc4\x90
Gunshot answered 22/7, 2015 at 5:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.