Working with GD ( imagettftext() ) and UTF-8 characters

S

2

11

Just for the record - my first question here but hopefully not my last input in the community. But that's not why I'm here.

I'm currently developing a simple system that has to generate an image with a text on it. Everthing went well until I realised that GD cannot handle UTF-8 characters like

ā, č, ž, ä, ø, é

and so on.

To clear things up - I'm using imagettftext()

Trying to solve my problem I dug into depths of google and some solutions were returned, none of them, sadly, solved my problem completely. Currently I'm using this script I found in this thread - PHP function imagettftext() and unicode

private function properText($text){

    // Convert UTF-8 string to HTML entities
    $text = mb_convert_encoding($text, 'HTML-ENTITIES',"UTF-8");
    // Convert HTML entities into ISO-8859-1
    $text = html_entity_decode($text,ENT_NOQUOTES, "ISO-8859-1");
    // Convert characters > 127 into their hexidecimal equivalents
    $out = "";
    for($i = 0; $i < strlen($text); $i++) {
        $letter = $text[$i];
        $num = ord($letter);
        if($num>127) {
          $out .= "&#$num;";
        } else {
          $out .=  $letter;
        }
    }

    return $out;

}

and it works fine for some characters but not all of them, for example, a with umlaut isn't converted correctly.

So at this point I'm not sure where and what to look for anymore as I cannot predict the user input. To be more precise, the system is pulling artist names from an xml feed and using the data for the image generation (I'm not planning to support hieroglyphs).

I've made sure that the data gathered from the feed is indeed UTF-8 by using PHP's mb_detect_encoding() and I've made sure that all the characters that currently aren't displayed correctly are indded in the font file I'm feeding to the imagettftext() function by checking it with windows charmap tool.

Hopefully I can find my answer here and thank you for your help in advance!

edit

To clarify - the characters are not displayed correctly, or, to be more precise, are replaced by malformed characters. Here is a screenshot -

Malformed Characters

it should read "José González"

edit No2

Using bin2hex() function on data retrieved from the xml feed returns this.

José González -> 4a6f73c3a920476f6e7ac3a16c657a
// input -> bin2hex(input)

edit - fixed

As I continued my research I came up with an answer for my problem, this piece of code did it!

$text = mb_convert_encoding($text, "HTML-ENTITIES", "UTF-8");
$text = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$text);
return($text);

Now all the characters that troubled me are displayed correctly!

Sennight answered 26/2, 2012 at 23:29 Comment(7)

What doesn't work exactly? How is the output not what you expect? Are you using a font that actually contains the characters you want? I'm using imagegettftext with Japanese, so Unicode characters aren't a problem in general. – Freewheeling 26/2, 2012 at 23:39

Yes, as I said in the original post I've made sure that all the characters that currently aren't displayed correctly are indded in the font file. Thing that working is the output - characters are not displayed correctly, or, to be more precise, are replaced by malformed characters. Here is a screenshot - imgur.com/B8RHa - it should read "José González" – Sennight 26/2, 2012 at 23:53

The error you get there : i.imgur.com/B8RHa.jpg is definitely an encoding problem, like printing some UTF-8 caracters in ANSI. – Oboe 27/2, 2012 at 11:40

Is your text really correctly encoded in UTF-8? Please show a bin2hex() of the string. – Freewheeling 27/2, 2012 at 12:23

I've added bin2hex() result of the string to the original post. – Sennight 27/2, 2012 at 12:38

You should add that as an answer and accept it. Can be useful to others in the future. Still weird, since the function is supposed to accept UTF-8 directly. – Freewheeling 2/3, 2012 at 8:58

Just did that! Thanks for all the help! :) – Sennight 2/3, 2012 at 9:2

S

8

As I continued my research I came up with an answer for my problem, this piece of code did it!

private function properText($text){
    $text = mb_convert_encoding($text, "HTML-ENTITIES", "UTF-8");
    $text = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$text);
    return($text); 
}

Now all the characters (and all the new ones I've seen) that troubled me are displayed correctly!

Sennight answered 2/3, 2012 at 9:2 Comment(2)

This particular preg_replace callback seems pretty nonsensical though. Sure this is working? – Freewheeling 2/3, 2012 at 11:8

I have the same problem. How've fixed it really? Your code will not return the text with accents. #23552489 – Fedora 8/5, 2014 at 20:49

K

0

In first place make sure your IDE is not saving file in another encoding than UTF8. For example new Intellij IDEA 9 changed UTF-8 to WINDOWS-1250 on Windows platform. If you won't notice that and you will use constant strings for testing, it is pretty crazy to debug.

Knop answered 2/8, 2015 at 20:32 Comment(0)

Recommended topics

Hot tags