Your file is stored in UTF-16 (Unicode). The first character in your file is "L", which is code point 0x4C. The first 4 bytes of your file are FF FE 4C 00
, which are a byte-order mark (BOM) and the letter L encoded in UTF-16 as two bytes.
fgets
is not Unicode-aware, so it's looking for the newline character '\n'
, which is the byte 0x0A. Most likely this will happen on the first byte of a Unicode newline (the two bytes 0A 00
), but it could also happen on plenty of other non-newline characters such as U+010A (LATIN CAPITAL LETTER A WITH DOT ABOVE) or anything in the Gurmukhi or Gujarati scripts (U+0A00 to U+0AFF).
In any case, though, the data that's ending up in the buffer wah
has lots of embedded nulls and looks something like FF FE 4C 00 47 00 4F 00 4F 00 0A 00
. NUL (0x00) is the C string terminator, so when you attempt to print this out using printf
, it stops at the first null, and all you see is \377\376L
. \377\376
is the octal representation of the bytes FF FE
.
The fix for this is to convert your text file to a single-byte encoding such as ISO 8859-1 or UTF-8. Note that must single-byte encodings (UTF-8 excepted) cannot encode the full range of Unicode characters, so if you need Unicode, I strongly recommend using UTF-8. Alternatively, you can convert your program to be Unicode-aware, but then you can no longer use a lot of standard library functions (such as fgets
and printf
), and you need to use wchar_t
everywhere in place of char
.