Why can I not read a UTF-16 file longer than 4094 characters? - McMap

About

Why can I not read a UTF-16 file longer than 4094 characters?

Asked 24/8, 2017 at 20:59 Answered 24/8, 2017 at 21:49

Solved c++linux utf-16 wstring wifstream

S

1

13

Some information:

I've only tried this on Linux
I've tried both with GCC (7.2.0) and Clang (3.8.1)
It requires C++11 or higher to my understanding

What happens when I run it

I get the expected string "abcd" repeated until it hits the position of 4094 characters. After that all it outputs is this sign "?" until the end of the file.

What do I think about this?

I think this is not the expected behavior and that it must be a bug somewhere.

Code you can test with:

#include <iostream>
#include <fstream>
#include <locale>
#include <codecvt>

void createTestFile() {
  std::ofstream file ("utf16le.txt", std::ofstream::binary);
  if (file.is_open()) {
    uint16_t bom = 0xFEFF; // UTF-16 little endian BOM
    uint64_t abcd = 0x0064006300620061; // UTF-16 "abcd" string
    file.write((char*)&bom,2);
    for (size_t i=0; i<2000; i++) {
      file.write((char*)&abcd,8);
    }
    file.close();
  }
}

int main() {
  //createTestFile(); // uncomment to make the test file

  std::wifstream file;
  std::wstring line;

  file.open("utf16le.txt");
  file.imbue(std::locale(file.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
  if (file.is_open()) {
    while (getline(file,line)) {
      std::wcout << line << std::endl;
    }
  }
}

Salvation answered 24/8, 2017 at 20:59 Comment(15)

The example worked fine for me. I get a file with 2000 times "abcd" and it's properly displayed. Tried it with Visual Studio 2015. – Connelly 24/8, 2017 at 21:4

How are you verifying the file? Perhaps your viewer has a bug? – Tiga 24/8, 2017 at 21:7

Thank you for answering! I am happy that it works on Windows correctly without any changes, I want to stay cross plattform compatible if I can. Then the bug must have something to do with Linux I would guess. Maybe even something silly like the terminal I use...? – Salvation 24/8, 2017 at 21:7

What program are you viewing the file in? – Slurp 24/8, 2017 at 21:9

Reproduced with gcc 7.1.1 After setting a breakpoint with a debugger, and examining the contents of the read string, all signs are pointing to a libstdc++ bug. The read string is 8000 wide characters, as expected. But starting at line[4094], the endianness is rudely reversed. You should open a gcc bug, and attach the output of the debugger, showing what's in line[4093] and line[4094] – Peppery 24/8, 2017 at 21:15

@MichaelDorgan In my code above I output the contents into stdout and reads it from there using my terminal. That's where it starts displaying ? signs after 4094 bytes are written to it. The file is ok and verified, but my code to read the file seems to not function properly on my system. – Salvation 24/8, 2017 at 21:20

@SamVarshavchik Thank you very much for your analysis! I've never submitted any bug reports before and I am pretty much new to this whole open source scene. I might try to open a gcc bug, but I've never even used a C++ debugger before. – Salvation 24/8, 2017 at 21:27

Well, now this is a perfect learning opportunity. Knowing how to effectively use a debugger is a required skill for every C++ developer. Your first project is to reproduce my results. This will get mangled by stackoverflow.com due to lack of linebreaks in comments, but:

(gdb) p line[4093] $19 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628244: 98 L'b' (gdb) p line[4094] $20 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628248: 25344 L'挀'

- the C++ library obviously botched reading this string. – Peppery 24/8, 2017 at 21:30

@SamVarshavchik maybe you could write your answer as an answer instead of comment, then OP can link to this thread in the bug report – Cykana 24/8, 2017 at 21:38

Does a binary dump of the file confirm that the contents are correct? @SamVarshavchik that question is for you too. – Monotone 24/8, 2017 at 21:49

Yup, the binary dump shows that it's correct. Verified by dumping it with od -v. – Peppery 24/8, 2017 at 21:50

<codecvt> is deprecated. It was badly specified and the implementation never got around of implementing it semi-correctly. Don't use it. – Kuster 24/8, 2017 at 22:16

@n.m. It's marked as deprecated in C++17 but with no std alternatives. What do you suggest I use instead? – Salvation 25/8, 2017 at 17:33

Did someone tried libcxx ? – Fingering 26/8, 2017 at 6:50

My first choice would be avoiding utf16 entirely. Failing that, use a platform or third-party unicode conversion library like iconv. – Kuster 26/8, 2017 at 15:9

P

11

This looks like a library bug to me. Stepping through the sample program as compiled by gcc 7.1.1 using gdb:

(gdb) n
28      while (getline(file,line)) {
(gdb) n
29        std::wcout << line << std::endl;
(gdb) p line.size()
$1 = 8000

8000 characters read, as expected. But then:

(gdb) p line[4092]
$18 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628240: 97 L'a'
(gdb) p line[4093]
$19 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628244: 98 L'b'
(gdb) p line[4094]
$20 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628248: 25344 L'挀'
(gdb) p line[4095]
$21 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x62824c: 25600 L'搀'
(gdb) p line[4096]
$22 = (__gnu_cxx::__alloc_traits<std::allocator<wchar_t> >::value_type &) @0x628250: 24832 L'愀'

line[4092] and line[4093] look ok. But then, I see line[4094], line[4095], and line[4096], containing 6300, 6400 and 6500, instead of 0063, 0064, and 0065.

So, this is getting messed up starting with character 4094, and not 4096, actually. I dumped the binary UTF-16 file, and it looks correct to me. The BOM marker is followed by consistent endian-ness for the entire contents of the file.

The only thing that's puzzling is why both clang and gcc are supposedly affected, but a quick Google search indicates that clang also uses gcc's libstdc++, at least up until recently. So, this looks like a libstdc++ bug to me.

Peppery answered 24/8, 2017 at 21:49 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.