PHP: How to encode U+FFFD in order to do a replace?
Asked Answered
F

3

7

I'm trying to display a data feed on a page. We're experiencing encoding issues with a weird character. For some reason, in the feed there's the U+FFFD character. And htmlentities() will not escape the character, so I need to replace it manually. (I'm using PHP 5.3)

I've tried the following:

$string = str_replace( "\xFFFD",  "_", $string );
$string = str_replace( "\XFFFD",  "_", $string );
$string = str_replace( "\uFFFD",  "_", $string );
$string = str_replace("\x{FFFD}", "_", $string );
$string = str_replace("\X{FFFD}", "_", $string );
$string = str_replace("\P{FFFD}", "_", $string );
$string = str_replace("\p{FFFD}", "_", $string );

None of the above work.

After reading this page - http://php.net/manual/en/regexp.reference.unicode.php - I'm not sure what I'm doing wrong. Do I need to compile UTF-8 support into PCRE?

Fuchs answered 5/12, 2012 at 15:56 Comment(4)
This may help different language but very similar resultTruncated
Also try using the preg_replace function as str_replace doesn't use regexTruncated
@redolent, Guys, stop abusing the U+FFFD character for what it's not meant to be.Spermiogenesis
@Spermiogenesis the character was given in the feed we had to parse, so there was no way around it. (Screenshot of the input: flickr.com/photos/90840058@N04/8249714661/in/photostream/…)Fuchs
P
7

Use preg_replace instead like this:

$string = preg_replace('@\x{FFFD}@u', '_', $string);
Portfolio answered 5/12, 2012 at 16:6 Comment(0)
A
11

You should attempt to fix the original problem, FFFD (The unicode replacement character) is not in most cases meant to be a real text character but a sign that something was attempted to be decoded in an UTF encoding but that something was not actually encoded in an UTF encoding. It is an alternative to silently discarding invalid bytes or completely halting the decoding process, either way, if you see it, there was an error.

There is no way to know what the original character was. Especially with your solution, since you replace the character with _, you cannot even know that the original source was decoded incorrectly. You should go back to the source and decode it properly.

Note: It's possible for a source text to use as a literal, normal character, for instance when talking about it, and there is no error then. I am excluding this possibility in my answer.

Aegis answered 6/12, 2012 at 10:28 Comment(12)
Well, "�" is a "real" character in itself... :) But yes, I agree that there's some root problem the OP is ignoring. +1Patras
@Patras well, rather it's not normal text character, but you know what I mean right?Aegis
Well, it is a real, normal text character. Just like 💩, ☃ and 風 are real characters. It is simply used in special cases. :)Patras
@Patras but those are real text characters, if I see , I know there was an error and can't know what the real text character behind it was meant to be unless I decode properly :PAegis
But I can type it as a normal character: �. ��������. That doesn't mean what I typed here was wrongly decoded. I know what you mean, but it is a real character. :PPatras
@Patras got me there, but at least I recall reading that it's recommended not to use literallyAegis
Maybe so, unless you want to talk about �. ;)Patras
@Patras ok read this from unicode.org: unicode.org/charts/PDF/UFFF0.pdf :P I have confused it with FFFE :(Aegis
Good discussion. The problem is that one of the fields (on several entries) I'm importing from an XML document has that character in it. I have tried adjusting the encoding, and I'm sure I'm doing it correctly since the other fields have UTF-8 characters appear correctly.Fuchs
@Fuchs by "has that character in it", do you mean it literally appears in the file? That is, when you do a raw hex dump of the file, you see the bytes 0xEF 0xBF 0xBD?Aegis
Yes, it has exactly EF BF BD (See screenshot). The strange part is, other similar parts of the file have a valid UTF-8 character in its place.Fuchs
@Fuchs ah I see. There is then nothing you can do, but judging from the surrounding text it doesn't look like there was literally meant to be the character so it must have been some kind of error in the history of that file.Aegis
P
7

Use preg_replace instead like this:

$string = preg_replace('@\x{FFFD}@u', '_', $string);
Portfolio answered 5/12, 2012 at 16:6 Comment(0)
I
1

UTF-8 '�' is U+EFBFBD

to replace UTF you have to use multi hex char to replace it

xEF xBF xBD

$string = str_replace("\xEF\xBF\xBD",'X','My ��� some text');
Inquest answered 5/11, 2022 at 8:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.