PHP: How to encode U+FFFD in order to do a replace?

Asked 5/12, 2012 at 15:56 Answered 5/11, 2022 at 8:6

Solved php character-encoding escaping special-characters

I'm trying to display a data feed on a page. We're experiencing encoding issues with a weird character. For some reason, in the feed there's the U+FFFD character. And htmlentities() will not escape the character, so I need to replace it manually. (I'm using PHP 5.3)

I've tried the following:

$string = str_replace( "\xFFFD",  "_", $string );
$string = str_replace( "\XFFFD",  "_", $string );
$string = str_replace( "\uFFFD",  "_", $string );
$string = str_replace("\x{FFFD}", "_", $string );
$string = str_replace("\X{FFFD}", "_", $string );
$string = str_replace("\P{FFFD}", "_", $string );
$string = str_replace("\p{FFFD}", "_", $string );

None of the above work.

After reading this page - http://php.net/manual/en/regexp.reference.unicode.php - I'm not sure what I'm doing wrong. Do I need to compile UTF-8 support into PCRE?

Fuchs answered 5/12, 2012 at 15:56 Comment(4)

This may help different language but very similar result – Truncated 5/12, 2012 at 15:59

Also try using the preg_replace function as str_replace doesn't use regex – Truncated 5/12, 2012 at 16:7

@redolent, Guys, stop abusing the U+FFFD character for what it's not meant to be. – Spermiogenesis 27/1, 2015 at 11:5

@Spermiogenesis the character was given in the feed we had to parse, so there was no way around it. (Screenshot of the input: flickr.com/photos/90840058@N04/8249714661/in/photostream/…) – Fuchs 27/1, 2015 at 17:16

Use preg_replace instead like this:

$string = preg_replace('@\x{FFFD}@u', '_', $string);

Portfolio answered 5/12, 2012 at 16:6 Comment(0)

You should attempt to fix the original problem, FFFD (The unicode replacement character) is not in most cases meant to be a real text character but a sign that something was attempted to be decoded in an UTF encoding but that something was not actually encoded in an UTF encoding. It is an alternative to silently discarding invalid bytes or completely halting the decoding process, either way, if you see it, there was an error.

There is no way to know what the original character was. Especially with your solution, since you replace the character with _, you cannot even know that the original source was decoded incorrectly. You should go back to the source and decode it properly.

Note: It's possible for a source text to use � as a literal, normal character, for instance when talking about it, and there is no error then. I am excluding this possibility in my answer.

Aegis answered 6/12, 2012 at 10:28 Comment(12)

Well, "�" is a "real" character in itself... :) But yes, I agree that there's some root problem the OP is ignoring. +1 – Patras 6/12, 2012 at 10:33

@Patras well, rather it's not normal text character, but you know what I mean right? – Aegis 6/12, 2012 at 10:34

Well, it is a real, normal text character. Just like 💩, ☃ and 風 are real characters. It is simply used in special cases. :) – Patras 6/12, 2012 at 10:36

@Patras but those are real text characters, if I see �, I know there was an error and can't know what the real text character behind it was meant to be unless I decode properly :P – Aegis 6/12, 2012 at 10:38

But I can type it as a normal character: �. ��. That doesn't mean what I typed here was wrongly decoded. I know what you mean, but it is a real character. :P – Patras 6/12, 2012 at 10:39

@Patras got me there, but at least I recall reading that it's recommended not to use � literally – Aegis 6/12, 2012 at 10:42

Maybe so, unless you want to talk about �. ;) – Patras 6/12, 2012 at 10:43

@Patras ok read this from unicode.org: unicode.org/charts/PDF/UFFF0.pdf :P I have confused it with FFFE :( – Aegis 6/12, 2012 at 10:43

Good discussion. The problem is that one of the fields (on several entries) I'm importing from an XML document has that character in it. I have tried adjusting the encoding, and I'm sure I'm doing it correctly since the other fields have UTF-8 characters appear correctly. – Fuchs 6/12, 2012 at 16:15

@Fuchs by "has that character in it", do you mean it literally appears in the file? That is, when you do a raw hex dump of the file, you see the bytes 0xEF 0xBF 0xBD? – Aegis 6/12, 2012 at 16:59

Yes, it has exactly EF BF BD (See screenshot). The strange part is, other similar parts of the file have a valid UTF-8 character in its place. – Fuchs 6/12, 2012 at 19:17

@Fuchs ah I see. There is then nothing you can do, but judging from the surrounding text it doesn't look like there was literally meant to be the character � so it must have been some kind of error in the history of that file. – Aegis 6/12, 2012 at 19:26

Use preg_replace instead like this:

$string = preg_replace('@\x{FFFD}@u', '_', $string);

Portfolio answered 5/12, 2012 at 16:6 Comment(0)

UTF-8 '�' is U+EFBFBD

to replace UTF you have to use multi hex char to replace it

xEF xBF xBD

$string = str_replace("\xEF\xBF\xBD",'X','My ��� some text');

Inquest answered 5/11, 2022 at 8:6 Comment(0)

Recommended topics

Hot tags