Unicode unknown "�" character detection in PHP
Asked Answered
B

4

9

Is there any way in PHP of detecting the following character ?

I'm currently fixing a number of UTF-8 encoding issues with a few different algorithms and need to be able to detect if is present in a string. How do I do so with strpos?

Simply pasting the character into my codebase does not seem to work.

if (strpos($names['decode'], '?') !== false || strpos($names['decode'], '�') !== false)
Bertilla answered 27/12, 2010 at 6:33 Comment(5)
This is the wrong approach. You should add more info about what you're doing, there are probably better ways to do what you wantConsternate
last try with this 0x00 ? see fileformat.info/info/unicode/char/0000/index.htmTranscaucasia
Eric: Nope. Pekka: Some troublesome strings are double encoded, and by doing decode, when ? or � is returned then the string isn't double encoded. Unsure how else to detect.Bertilla
@Bertilla I take it the 0x00 approach didn't work out?Consternate
Even if the == (loose) comparison of the � character with 0x00 succeeds for someone, it can't be used for the � character detection since the == comparison with 0x00 will also pass if compared to "" or "0". You must use the === (strict) comparison of the � character with 0x00 which will most probably fail.Thermoelectrometer
C
19

Converting a UTF-8 string into UTF-8 using iconv() using the //IGNORE parameter produces a result where invalid UTF-8 characters are dropped.

Therefore, you can detect a broken character by comparing the length of the string before and after the iconv operation. If they differ, they contained a broken character.

Test case (make sure you save the file as UTF-8):

<?php

header("Content-type: text/html; charset=utf-8");

$teststring = "Düsseldorf";

// Deliberately create broken string
// by encoding the original string as ISO-8859-1
$teststring_broken = utf8_decode($teststring); 

echo "Broken string: ".$teststring_broken ;

echo "<br>";

$teststring_converted = iconv("UTF-8", "UTF-8//IGNORE", $teststring_broken );

echo $teststring_converted;

echo "<br>";

if (strlen($teststring_converted) != strlen($teststring_broken  ))
 echo "The string contained an invalid character";

in theory, you could drop //IGNORE and simply test for a failed (empty) iconv operation, but there might be other reasons for a iconv to fail than just invalid characters... I don't know. I would use the comparison method.

Consternate answered 3/1, 2011 at 12:24 Comment(1)
Note that the "//IGNORE" option can fail on recent versions of the libiconv library, but you can use this workaround: ini_set('mbstring.substitute_character', "none"); $teststring_converted = mb_convert_encoding($string, 'UTF-8', 'UTF-8'); Thermoelectrometer
I
4

Here is what I do to detect and correct the encoding of strings not encoded in UTF-8 when that is what I am expecting:

    $encoding = mb_detect_encoding($str, 'utf-8, iso-8859-1, ascii', true);
    if (strcasecmp($encoding, 'UTF-8') !== 0) {
      $str = iconv($encoding, 'utf-8', $str);
    }
Imperil answered 5/1, 2011 at 10:23 Comment(0)
S
1

As far as I know, that question mark symbol is not a single character. There are many different character codes in the standard font sets that are not mapped to a symbol, and that is the default symbol that is used. To do detection in PHP, you would first need to know what font it is that you're using. Then you need to look at the font implementation and see what ranges of codes map to the "?" symbol, and then see if the given character is in one of those ranges.

Secularism answered 27/12, 2010 at 6:35 Comment(2)
Actually, it is a particular character: it's U+FFFD, "Unicode Replacement Character" - it can occur when some system couldn't decode the data at that point (and replaced it with that character) or if you simply don't have the font. Better to look at the data, and see what data you actually have.Ailing
I suppose that's what I meant by "ranges" of data that don't properly decode.Secularism
E
0

I use the CUSTOM method (using str_replace) to sanitize undefined characters:

    $input='a³';

    $text=str_replace("\n\n",  "sample000"        ,$text);
    $text=str_replace("\n",    "sample111"        ,$text);

    $text=filter_var($text,FILTER_SANITIZE_SPECIAL_CHARS, FILTER_FLAG_STRIP_LOW);

    $text=str_replace("sample000",  "<br/><br/>"  ,$text);
    $text=str_replace("sample111",  "<br/>"       ,$text);

    echo $text; //outputs ------------>   a3
Expectorate answered 14/6, 2015 at 19:4 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.