How to make a text file have more than one encoding?

Asked 14/2, 2010 at 12:7 Answered 14/2, 2010 at 12:41

I have a file which is ANSI encoded. However it shows Arabic letters inside it. this text file was generated by some program (I have no info on) but it seems like there is some kind of internal encoding (if I might say and if it's possible) for the Arabic letters to make appear.

Is there such a thing? If not, how can the ANSI file show the Arabic letters?

*If possible explain in Java code

Edition 01

When I open it in Notepad++ it shows that the page encoding is ANSI. Please check this photo:

http://www.4shared.com/file/221862075/e8705951/text-Windows.html

Edition 02

you can check the file at from:

http://www.4shared.com/file/221853641/3fa1af8c/data.html

Stakeout answered 14/2, 2010 at 12:7 Comment(8)

do you have access to linux? if so, what does the file command says? – Stemma 14/2, 2010 at 12:17

No, I don't have access to linux...if you do however and would like to help, please download my file from the link I provided in Edition 02 and let me know what you get. Appreciate your cooperation. – Stakeout 14/2, 2010 at 12:23

@João file yields BS on this file. :-( – Vitia 14/2, 2010 at 12:30

What is BS? Sorry I'm very beginner with linux! – Stakeout 14/2, 2010 at 12:33

wow Amazing.......with such an easy trick you maneged to find the encoding! It is Windows-1256 Klarth......you are the HERO :)Please write your text as an answer so I can check it true! – Stakeout 14/2, 2010 at 12:35

@MAK BS isn’t really Linux-specific. I meant bullsh*t. ;-) As in, file yields Latin-1 which is obviously wrong. – Vitia 14/2, 2010 at 12:39

@Konrad Rudolph> lol, well thanks for trying. Klarth maneged with his simple trick to find that it is Windows-1256 ! – Stakeout 14/2, 2010 at 12:43

On my machine, file says ../data.txt: ISO-8859 text, with CRLF line terminators – Canvasback 14/2, 2010 at 12:44

I tried opening the file in both Firefox and Opera. I had to set the character encoding to Arabic Windows-1256 to get it to display correctly in both browsers, so the file's encoding is most likely to be that.

NOTE: I originally posted this as a comment, but was asked to make it an answer.

Sisile answered 14/2, 2010 at 12:40 Comment(1)

Thanks again...Your simple testing (which I never thought of doing) found the solution for my problem which wasted 8 hours of my time. Thank you 100000000 times :) – Stakeout 14/2, 2010 at 12:46

Short answer: Likely, your text file is not "ANSI"-encoded, but utf-8.

Long answer:

First, the term "ANSI" (on Windows) doesn't mean a fixed encoding; it's meaning depends on your language settings. For example, in Western Europe and USA, it will usually be Windows-1252 (a variant of ISO/IEC 8859-1, also known as latin-1), in Japan, it's SHift JIS, and in Arabic countries, it's ISO/IEC_8859-6.

If you are using a non-Arabic version of Windows and heave not changed your language settings, and you can see Arabic letters in the file when you open it in Notepad, then it is certainly not in any of these ANSI encodings. Instead, it is probably Unicode.

Note that I don't mean "UNICODE", which on Windows usually means UTF-16LE. It could be UTF-8 as well. Both are encodings that can encode all 100.000+ characters currently defined in Unicode, but they do it in different ways. Both are variable length encodings, meaning that not all characters are encoded using the same number of bits.

In UTF-8, each character is encoded as one to four bytes. The encoding has been chosen such that ASCII characters are encoded in one byte.

In UTF-16, each character is encoded as either two four bytes. This encoding has originally been invented when Unicode had fewer than 64K characters, and one therefore could encode every character in a single 16-bit word. Later, when it became clear that Unicode would have to grow beyond the 64K limit, a scheme was invented where pairs of words in the range 0xD800-0xDFFF are used to represent characters outside of the first 64K (minus 0x800) characters.

To see what's actually in the file, open it in a hex editor:

If the first two bytes are FF FE, then it is likely UTF-16LE (little endian)
If the first two bytes are FE FF, then it is likely UTF-16BE (big endian, unlikely on Windows)
If the first three bytes are EF BB BF, then it is likely UTF-8
If you see a lot of 00 Bytes, it is likely UTF-16 (or UTF-32, if you see pairs of 00 BYtes)
If Arabic characters occupy a single Byte, it is likely ISO-8859-6 (e.g. ش would be D5).
If Arabic characters occupy multiple Bytes, it is likely UTF-8 (e.g. ش would be D8 B4).

Walkup answered 14/2, 2010 at 12:11 Comment(0)

How do you know that it's ANSI encoded? If it's not a multi-byte encoding like UTF-8, my guess would be it's encoded using an arabic code page like this one: Windows-1256.

You could look at the file in a Hex editor and find out what numbers the arabic characters have and that way try to find out which encoding / code page it was created with.

Billat answered 14/2, 2010 at 12:11 Comment(0)

Is there such a thing?

No.

If not, how can the ANSI file show the Arabic letters?

~~It’s not a Windows-ANSI encoded file.~~ More likely, it uses a variable-width encoding, most likely UTF-8: many common character positions in UTF-8 are equivalent to their positions in US-ASCII (in fact, it was designed that way), and by inference also for Windows-ANSI.

EDIT: We have to thank Microsoft for this confusion. “ANSI” isn’t well-specified when it comes to encodings. Usually it’s meant to stand for the Windows default encoding with codepage 1252 (“Windows-1252”), which happens to correspond to “Western” alphabets derived from Latin.

However, in other countries the default encoding used by Windows (in older Windows versions … today, the default is UTF-8) is not Windows-1252 but rather a different encoding, which is then also called “ANSI”. In this case, codepage 1256.

Vitia answered 14/2, 2010 at 12:12 Comment(6)

Please check this photo: 4shared.com/file/221862075/e8705951/text-Windows.html – Stakeout 14/2, 2010 at 12:16

@MAK: check it with a hex editor. In any case, Notepad++ must be lying to you. – Vitia 14/2, 2010 at 12:17

By the way, Nodepad++ is correct, after all: “Windows ANSI” isn’t one encoding. Rather, it’s a different encoding depending on the Windows locale version. The usual encoding is Windows-1252, which is the western (central Europe?) encoding but this is a typical North-American/European-centric prejudice. Codepage 1256 and many others are also often called “ANSI”. – Vitia 14/2, 2010 at 12:47

Interesting! Thanks for the info :) – Stakeout 14/2, 2010 at 12:51

Konrad, what has Microsoft got to do with ANSI? ANSI is American National Standards Institute. BTW, Windows codepages are not exactly the same as the character sets in ANSI/ISO standrards. Fpr exampe, Windows-1252 is superset of ISO 8859-1 and contains some additional characters such as the Euro symbol. – Heady 14/2, 2010 at 13:57

@Pauli Microsoft introduced the misleading designation “ANSI” for their default encodings. Wikipedia sums it up nicely: “The term "ANSI code page" is also used to refer to code pages used in Windows, like Windows-1252. Even though Windows-1252 is considered an ANSI code page in Microsoft Windows parlance, the code page has never been standardized by ANSI.” Historical note via Raymond Chen: blogs.msdn.com/oldnewthing/archive/2004/05/31/144893.aspx – Vitia 14/2, 2010 at 16:15

NOTE: I originally posted this as a comment, but was asked to make it an answer.

Sisile answered 14/2, 2010 at 12:40 Comment(1)

Thanks again...Your simple testing (which I never thought of doing) found the solution for my problem which wasted 8 hours of my time. Thank you 100000000 times :) – Stakeout 14/2, 2010 at 12:46

ANSI character encoding allows for 217 characters and does not contain Arabic letters. I think perhaps the file uses an alternative encoding.

Anwsering your edit, it appears that the problem is with Notepad++, because what is being displayed is clearly beyond the capabilities of the ANSI charset.

Tagmemics answered 14/2, 2010 at 12:14 Comment(3)

How do you get to 217 characters? Are these the printable characters? – Vitia 14/2, 2010 at 12:15

Yes, they are printable. I suppose there are more non-printable. – Tagmemics 14/2, 2010 at 12:19

It is well possible the file is UTF-8 but what about the possibility of code pages? There were non-english characters on computers before UTF-8. – Billat 14/2, 2010 at 12:40

first i downloaded your file and tried to use vim to check its encoding and it didn't seem to know and on a second machine it said latin1 which could be similar to what happened in notepad++ (gave the generic answer).
so i did file data.txt and the output was this:

data.txt: ISO-8859 text, with CRLF line terminators

hope this helps.

EDIT:
using the browser thing showed that this answer is incorrect.

ISO-8859-4 and ISO-8859-13 could display the text, without errors, but the characters where not in Arabic.

Politburo answered 14/2, 2010 at 12:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags