Short answer: Likely, your text file is not "ANSI"-encoded, but utf-8.
Long answer:
First, the term "ANSI" (on Windows) doesn't mean a fixed encoding; it's meaning depends on your language settings. For example, in Western Europe and USA, it will usually be Windows-1252 (a variant of ISO/IEC 8859-1, also known as latin-1), in Japan, it's SHift JIS, and in Arabic countries, it's ISO/IEC_8859-6.
If you are using a non-Arabic version of Windows and heave not changed your language settings, and you can see Arabic letters in the file when you open it in Notepad, then it is certainly not in any of these ANSI encodings. Instead, it is probably Unicode.
Note that I don't mean "UNICODE", which on Windows usually means UTF-16LE. It could be UTF-8 as well. Both are encodings that can encode all 100.000+ characters currently defined in Unicode, but they do it in different ways. Both are variable length encodings, meaning that not all characters are encoded using the same number of bits.
In UTF-8, each character is encoded as one to four bytes. The encoding has been chosen such that ASCII characters are encoded in one byte.
In UTF-16, each character is encoded as either two four bytes. This encoding has originally been invented when Unicode had fewer than 64K characters, and one therefore could encode every character in a single 16-bit word. Later, when it became clear that Unicode would have to grow beyond the 64K limit, a scheme was invented where pairs of words in the range 0xD800-0xDFFF are used to represent characters outside of the first 64K (minus 0x800) characters.
To see what's actually in the file, open it in a hex editor:
- If the first two bytes are FF FE, then it is likely UTF-16LE (little endian)
- If the first two bytes are FE FF, then it is likely UTF-16BE (big endian, unlikely on Windows)
- If the first three bytes are EF BB BF, then it is likely UTF-8
- If you see a lot of 00 Bytes, it is likely UTF-16 (or UTF-32, if you see pairs of 00 BYtes)
- If Arabic characters occupy a single Byte, it is likely ISO-8859-6 (e.g. ش would be D5).
- If Arabic characters occupy multiple Bytes, it is likely UTF-8 (e.g. ش would be D8 B4).
file
command says? – Stemmafile
yields BS on this file. :-( – Vitiafile
yields Latin-1 which is obviously wrong. – Vitiafile
says../data.txt: ISO-8859 text, with CRLF line terminators
– Canvasback