How to open a file with wchar_t* containing non-Ascii string in Linux?

Asked 13/1, 2011 at 2:49 Answered 13/1, 2011 at 4:11

Solved c++c linux file wchar

Environment: Gcc/G++ Linux

I have a non-ascii file in file system and I'm going to open it.

Now I have a wchar_t*, but I don't know how to open it. (my trusted fopen only opens char* file)

Please help. Thanks a lot.

Igenia answered 13/1, 2011 at 2:49 Comment(3)

Is the filename not ASCII, or are the contents non-ASCII, or both? – Lanfranc 13/1, 2011 at 2:50

Yeah, both. There are wfstream to read/write wchar into a file, but wfstream also opens only char* file. – Igenia 13/1, 2011 at 2:51

Convert wchar to utf8 char and try fopen() on that? – Comate 13/1, 2011 at 3:2

There are two possible answers:

If you want to make sure all Unicode filenames are representable, you can hard-code the assumption that the filesystem uses UTF-8 filenames. This is the "modern" Linux desktop-app approach. Just convert your strings from wchar_t (UTF-32) to UTF-8 with library functions (iconv would work well) or your own implementation (but lookup the specs so you don't get it horribly wrong like Shelwien did), then use fopen.

If you want to do things the more standards-oriented way, you should use wcsrtombs to convert the wchar_t string to a multibyte char string in the locale's encoding (which hopefully is UTF-8 anyway on any modern system) and use fopen. Note that this requires that you previously set the locale with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "").

And finally, not exactly an answer but a recommendation:

Storing filenames as wchar_t strings is probably a horrible mistake. You should instead store filenames as abstract byte strings, and only convert those to wchar_t just-in-time for displaying them in the user interface (if it's even necessary for that; many UI toolkits use plain byte strings themselves and do the interpretation as characters for you). This way you eliminate a lot of possible nasty corner cases, and you never encounter a situation where some files are inaccessible due to their names.

Syne answered 13/1, 2011 at 4:11 Comment(1)

Thanks. That's the very way I'm looking for. – Igenia 13/1, 2011 at 9:37

Linux is not UTF-8, but it's your only choice for filenames anyway

(Files can have anything you want inside them.)

With respect to filenames, linux does not really have a string encoding to worry about. Filenames are byte strings that need to be null-terminated.

This doesn't precisely mean that Linux is UTF-8, but it does mean that it's not compatible with wide characters as they could have a zero in a byte that's not the end byte.

But UTF-8 preserves the no-nulls-except-at-the-end model, so I have to believe that the practical approach is "convert to UTF-8" for filenames.

The content of files is a matter for standards above the Linux kernel level, so here there isn't anything Linux-y that you can or want to do. The content of files will be solely the concern of the programs that read and write them. Linux just stores and returns the byte stream, and it can have all the embedded nuls you want.

Grossman answered 13/1, 2011 at 3:40 Comment(1)

It shouldn't be frustrating. It's actually the simplest possible. Just use UTF-8 everywhere and you have nothing to worry about. – Syne 13/1, 2011 at 4:2

Convert wchar string to utf8 char string, then use fopen.

typedef unsigned int   uint;
typedef unsigned short word;
typedef unsigned char  byte;

int UTF16to8( wchar_t* w, char* s ) {
  uint  c;
  word* p = (word*)w;
  byte* q = (byte*)s; byte* q0 = q;
  while( 1 ) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x080 ) *q++ = c; else 
      if( c<0x800 ) *q++ = 0xC0+(c>>6), *q++ = 0x80+(c&63); else 
        *q++ = 0xE0+(c>>12), *q++ = 0x80+((c>>6)&63), *q++ = 0x80+(c&63);
  }
  *q = 0;
  return q-q0;
}

int UTF8to16( char* s, wchar_t* w ) {
  uint  cache,wait,c;
  byte* p = (byte*)s;
  word* q = (word*)w; word* q0 = q;
  while(1) {
    c = *p++;
    if( c==0 ) break;
    if( c<0x80 ) cache=c,wait=0; else
      if( (c>=0xC0) && (c<=0xE0) ) cache=c&31,wait=1; else 
        if( (c>=0xE0) ) cache=c&15,wait=2; else
          if( wait ) (cache<<=6)+=c&63,wait--;
    if( wait==0 ) *q++=cache;
  }
  *q = 0;
  return q-q0;
}

Comate answered 13/1, 2011 at 3:14 Comment(3)

Don't bother with fopen, just use your normal stream constructor or member. – Avian 13/1, 2011 at 3:38

Thank you and I solved my problem. The only problem is that in Linux wchar_t is equal to uint32. I made a few modifications and it worked. – Igenia 13/1, 2011 at 3:40

The functions in this answer are horribly non-conformant and insecure. Lookup the correct definitions of UTF-8 and UTF-16 if you want to use them. (And note that UTF-16 is irrelevant to OP's question since wchar_t is not UTF-16 except on Windows, and even there it's rather broken...) – Syne 13/1, 2011 at 4:4

Check out this document

http://www.firstobject.com/wchar_t-string-on-linux-osx-windows.htm

I think Linux follows POSIX standard, which treats all file names as UTF-8.

Aguayo answered 13/1, 2011 at 3:3 Comment(0)

I take it it's the name of the file that contains non-ascii characters, not the file itself, when you say "non-ascii file in file system". It doesn't really matter what the file contains.

You can do this with normal fopen, but you'll have to match the encoding the filesystem uses.

It depends on what version of Linux and what filesystem you're using and how you've set it up, but likely, if you're lucky, the filesystem uses UTF-8. So take your wchar_t (which is probably a UTF-16 encoded string?), convert it to a char string encoded in UTF-8, and pass that to fopen.

Salado answered 13/1, 2011 at 3:3 Comment(0)

Linux is not UTF-8, but it's your only choice for filenames anyway

Recommended topics

Hot tags