Why can I not read a UTF-16 file longer than 4094 characters?
Asked Answered
S

1

13

Some information:

  • I've only tried this on Linux
  • I've tried both with GCC (7.2.0) and Clang (3.8.1)
  • It requires C++11 or higher to my understanding

What happens when I run it

I get the expected string "abcd" repeated until it hits the position of 4094 characters. After that all it outputs is this sign "?" until the end of the file.

What do I think about this?

I think this is not the expected behavior and that it must be a bug somewhere.

Code you can test with:

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

void createTestFile() {
  std::ofstream file ("utf16le.txt", std::ofstream::binary);
  if (file.is_open()) {
    uint16_t bom = 0xFEFF; // UTF-16 little endian BOM
    uint64_t abcd = 0x0064006300620061; // UTF-16 "abcd" string
    file.write((char*)&bom,2);
    for (size_t i=0; i<2000; i++) {
      file.write((char*)&abcd,8);
    }
    file.close();
  }
}

int main() {
  //createTestFile(); // uncomment to make the test file

  std::wifstream file;
  std::wstring line;

  file.open("utf16le.txt");
  file.imbue(std::locale(file.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (file.is_open()) {
    while (getline(file,line)) {
      std::wcout << line << std::endl;
    }
  }
}
Salvation answered 24/8, 2017 at 20:59 Comment(15)
The example worked fine for me. I get a file with 2000 times "abcd" and it's properly displayed. Tried it with Visual Studio 2015.Connelly
How are you verifying the file? Perhaps your viewer has a bug?Tiga
Thank you for answering! I am happy that it works on Windows correctly without any changes, I want to stay cross plattform compatible if I can. Then the bug must have something to do with Linux I would guess. Maybe even something silly like the terminal I use...?Salvation
What program are you viewing the file in?Slurp
Reproduced with gcc 7.1.1 After setting a breakpoint with a debugger, and examining the contents of the read string, all signs are pointing to a libstdc++ bug. The read string is 8000 wide characters, as expected. But starting at line[4094], the endianness is rudely reversed. You should open a gcc bug, and attach the output of the debugger, showing what's in line[4093] and line[4094]Peppery
@MichaelDorgan In my code above I output the contents into stdout and reads it from there using my terminal. That's where it starts displaying ? signs after 4094 bytes are written to it. The file is ok and verified, but my code to read the file seems to not function properly on my system.Salvation
@SamVarshavchik Thank you very much for your analysis! I've never submitted any bug reports before and I am pretty much new to this whole open source scene. I might try to open a gcc bug, but I've never even used a C++ debugger before.Salvation
Well, now this is a perfect learning opportunity. Knowing how to effectively use a debugger is a required skill for every C++ developer. Your first project is to reproduce my results. This will get mangled by stackoverflow.com due to lack of linebreaks in comments, but: (gdb) p line[4093] $19 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628244: 98 L'b' (gdb) p line[4094] $20 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628248: 25344 L'挀' - the C++ library obviously botched reading this string.Peppery
@SamVarshavchik maybe you could write your answer as an answer instead of comment, then OP can link to this thread in the bug reportCykana
Does a binary dump of the file confirm that the contents are correct? @SamVarshavchik that question is for you too.Monotone
Yup, the binary dump shows that it's correct. Verified by dumping it with od -v.Peppery
<codecvt> is deprecated. It was badly specified and the implementation never got around of implementing it semi-correctly. Don't use it.Kuster
@n.m. It's marked as deprecated in C++17 but with no std alternatives. What do you suggest I use instead?Salvation
Did someone tried libcxx ?Fingering
My first choice would be avoiding utf16 entirely. Failing that, use a platform or third-party unicode conversion library like iconv.Kuster
P
11

This looks like a library bug to me. Stepping through the sample program as compiled by gcc 7.1.1 using gdb:

(gdb) n
28      while (getline(file,line)) {
(gdb) n
29        std::wcout << line << std::endl;
(gdb) p line.size()
$1 = 8000

8000 characters read, as expected. But then:

(gdb) p line[4092]
$18 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628240: 97 L'a'
(gdb) p line[4093]
$19 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628244: 98 L'b'
(gdb) p line[4094]
$20 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628248: 25344 L'挀'
(gdb) p line[4095]
$21 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x62824c: 25600 L'搀'
(gdb) p line[4096]
$22 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628250: 24832 L'愀'

line[4092] and line[4093] look ok. But then, I see line[4094], line[4095], and line[4096], containing 6300, 6400 and 6500, instead of 0063, 0064, and 0065.

So, this is getting messed up starting with character 4094, and not 4096, actually. I dumped the binary UTF-16 file, and it looks correct to me. The BOM marker is followed by consistent endian-ness for the entire contents of the file.

The only thing that's puzzling is why both clang and gcc are supposedly affected, but a quick Google search indicates that clang also uses gcc's libstdc++, at least up until recently. So, this looks like a libstdc++ bug to me.

Peppery answered 24/8, 2017 at 21:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.