Ã © and other codes - McMap

About

&#xC3; &#xA9; and other codes

Asked 14/11, 2010 at 13:55 Answered 14/11, 2010 at 14:2

Solved utf-8 utf8-decode

V

1

7

I got a file full of those codes, and I want to "translate" it into normal chars (a whole file, I mean). How can I do it?

Thank you very much in advance.

Vtol answered 14/11, 2010 at 13:55 Comment(2)

What exactly do you mean? What do you see when you open the file in a hex editor? – Milks 14/11, 2010 at 14:2

Sorry about my bad explanation. I mean, with utf8_decode() function of PHP I can show the real value, but I need to change that to the whole file. How to do it? – Vtol 14/11, 2010 at 14:3

T

19

Looks like you originally had a UTF-8 file which has been interpreted as an 8 bit encoding (e.g. ISO-8859-15) and entity-encoded. I say this because the sequence C3A9 looks like a pretty plausible UTF-8 encoding sequence.

You will need to first entity-decode it, then you'll have a UTF-8 encoding again. You could then use something like iconv to convert to an encoding of your choosing.

To work through your example:

Ã © would be decoded as the byte sequence 0xC3A9
0xC3A9 = 11000011 10101001 in binary
the leading 110 in the first octet tells us this could be interpreted as a UTF-8 two byte sequence. As the second octet starts with 10, we're looking at something we can interpret as UTF-8. To do that, we take the last 5 bits of the first octet, and the last 6 bits of the second octet...
So, interpreted as UTF8 it's 00011101001 = E9 = é (LATIN SMALL LETTER E WITH ACUTE)

You mention wanting to handle this with PHP, something like this might do it for you:

 //to load from a file, use
 //$file=file_get_contents("/path/to/filename.txt");
 //example below uses a literal string to demonstrate technique...

 $file="&Pr&#xC3;&#xA9;c&#xC3;&#xA9;dent is a French word";
 $utf8=html_entity_decode($file);
 $iso8859=utf8_decode($utf8);

 //$utf8 contains "Précédent is a French word" in UTF-8
 //$iso8859 contains "Précédent is a French word" in ISO-8859

Taffy answered 14/11, 2010 at 14:2 Comment(6)

But how to change a whole file? I mean, it's a file with "common" text and encoded sequences... – Vtol 14/11, 2010 at 14:31

If this technique doesn't work for your file, I'd suggest including a small hex dump of a relevant sample of your file. – Taffy 14/11, 2010 at 14:33

For instance: PrÃ©cÃ©dent (it's a French word). In the file, there exists words without accents, but others with (and that's the issue, I need to convert those accent words into, at least, UTF8, and then I'd likely use iconv or something like). – Vtol 14/11, 2010 at 14:35

The result of html_entity_decode() on the string you provided is the UTF-8 encoding of Précédent - not sure I see the problem. – Taffy 14/11, 2010 at 14:37

Let me put it another way: decode the whole file, and update your question with exactly what's wrong when you decode with html_entity_decode. If you're not sure how to load a file into a string, try $str=file_get_contents($my_filename) – Taffy 14/11, 2010 at 14:44

Well, really that's a silly problem. Your question was very nice, thank you very much! – Vtol 14/11, 2010 at 14:45

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.