Accept non ASCII characters
Asked Answered
F

3

6

Consider this program:

#include <stdio.h>

int main(int argc, char* argv[]) {
   printf("%s\n", argv[1]);  
   return 0;
}

I compile it like this:

x86_64-w64-mingw32-gcc -o alpha alpha.c

The problem is if I give it a non ASCII argument:

$ ./alpha róisín
r�is�n

How can I write and/or compile this program such that it accepts non ASCII characters? To respond to alk: no, the program is printing wrongly. See this example:

$ echo Ω | od -t x1c
0000000  ce  a9  0a
        316 251  \n
0000003

$ ./alpha Ω | od -t x1c
0000000  4f  0d  0a
          O  \r  \n
0000003
Flabellum answered 14/6, 2015 at 18:4 Comment(2)
It depends on what MinGW does to create the argv array. Does it encode the command line using UTF-8 or ANSI? If it's ANSI then you should check whether MinGW supports wmain to use wchar_t * parameters. Otherwise just ignore the decrepit ANSI strings (IMO, the entire ANSI API is worthless garbage nowadays that so often leads to mojibake) and call CommandLineToArgvW and manually encode to UTF-8 via WideCharToMultiByte if you need char * strings.Ketone
Your update proves that MinGW is calling GetCommandLineA to get an ANSI encoded copy of the command line, and so you get the mojibake "Ω" => "O", since that's the closest mapping your ANSI character set (probably 1252) has for the Greek Omega character. This is worthless. Use GetCommandLineW, CommandLineToArgvW, and WideCharToMultibyte to get UTF-8 encoded command line arguments.Ketone
F
4

The easiest way to do this is with wmain:

#include <fcntl.h>
#include <stdio.h>

int wmain (int argc, wchar_t** argv) {
  _setmode(_fileno(stdout), _O_WTEXT);
  wprintf(L"%s\n", argv[1]);
  return 0;
}

It can also be done with GetCommandLineW; here is a simple version of the code found at the HandBrake repo:

#include <stdio.h>
#include <windows.h>

int get_argv_utf8(int* argc_ptr, char*** argv_ptr) {
  int argc;
  char** argv;
  wchar_t** argv_utf16 = CommandLineToArgvW(GetCommandLineW(), &argc);
  int i;
  int offset = (argc + 1) * sizeof(char*);
  int size = offset;
  for (i = 0; i < argc; i++)
    size += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1, 0, 0, 0, 0);
  argv = malloc(size);
  for (i = 0; i < argc; i++) {
    argv[i] = (char*) argv + offset;
    offset += WideCharToMultiByte(CP_UTF8, 0, argv_utf16[i], -1,
      argv[i], size-offset, 0, 0);
  }
  *argc_ptr = argc;
  *argv_ptr = argv;
  return 0;
}

int main(int argc, char** argv) {
  get_argv_utf8(&argc, &argv);
  printf("%s\n", argv[1]);
  return 0;
}
Flabellum answered 15/6, 2015 at 0:41 Comment(5)
Since you're using UTF-8 instead of the Windows native UTF-16, you'll have to convert back to UTF-16 whenever you call a Windows API. For example, say the user passed a filename on the command line, you have to call MultiByteToWideChar to convert to UTF-16 before you can open this file via CreateFileW (or _wfopen, _wopen, etc).Ketone
fopen calls _open, which calls the ANSI API CreateFileA. This will decode the filename to the native UTF-16 using the system's ANSI codepage, such as 1252. So if the string isn't all ASCII characters, you'll end up with mojibake and a file not found error. To work around this on Windows you'll have to instead convert via MultiByteToWideChar and then call _wfopen, which calls _wopen, which calls CreateFileW. You may want to create a helper function my_fopen or something like that to avoid preprocessor hell.Ketone
@eryksun I dont get it. Why do I have to convert to UTF-8 only to convert it back to UTF-16? Is this a case of "im doing it wrong" or another example of Windows being horrible?Flabellum
If you don't want to support full unicode on Windows, just stick to the ANSI API. Then if a user passes a filename that can't be represented with their ANSI codepage, tell them too bad it would be too much work to support unicode on Windows. If you don't like giving that answer then I'm afraid it really will be a lot of work to support unicode in a cross-platform way using C/C++. Almost every other operating system has opted to adapt char * APIs that predate unicode by using UTF-8. Windows is the odd duck using UTF-16 because it was an early adopter of wchar_t * and UCS-2 in the early 90s.Ketone
When I say Windows uses UTF-16, I mean all the way to the kernel. For example, the CreateFile API is a user-mode function that does preliminary work before make the system call NtCreateFile. In the kernel, object paths use an OBJECT_ATTRIBUTES record, which stores the path itself as a UNICODE_STRING. This is a counted wide-character string that can be up to 32768 characters.Ketone
F
3

Since you're using MinGW (actually MinGW-w64, but that shouldn't matter in this case), you have access to the Windows API, so the following should work for you. It could probably be cleaner and actually tested properly, but it should provide a good idea at the least:

#define _WIN32_WINNT 0x0600
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

#include <windows.h>

int main (void)
{
    int       argc;
    int       i;
    LPWSTR    *argv;

    argv = CommandLineToArgvW(GetCommandLineW(), &argc);
    if (argv == NULL)
    {
        FormatMessageA(
            (
                FORMAT_MESSAGE_ALLOCATE_BUFFER |
                FORMAT_MESSAGE_FROM_SYSTEM |
                FORMAT_MESSAGE_IGNORE_INSERTS),
            NULL,
            GetLastError(),
            0,
            (LPWSTR)&error, 0,
            NULL);

        fprintf(stderr, error);
        fprintf(stderr, "\n");
        LocalFree(error);
        return EXIT_FAILURE;
    }

    for (i = 0; i < argc; ++i)
        wprintf(L"argv[%d]: %ls\n", i, argv[i]);

    // You must free argv using LocalFree!
    LocalFree(argv);

    return 0;
}

Bear in mind this one issue with it: Windows will not compose your strings for you. I use my own Windows keyboard layout that uses combining characters (I'm weird), so when I type

example -o àlf

in my Windows Command Prompt, I get the following output:

argv[0]: example
argv[1]: -o
argv[2]: a\u0300lf

The a\u0300 is U+0061 (LATIN SMALL LETTER A) followed by a representation of the Unicode code point U+0300 (COMBINING GRAVE ACCENT). If I instead use

example -o àlf

which uses the precomposed character U+00E0 (LATIN SMALL LETTER A WITH GRAVE), the output would have differed:

argv[0]: example
argv[1]: -o
argv[2]: \u00E0lf

where \u00E0 is a representation of the precomposed character à represented by Unicode code point U+00E0. However, while I may be an odd person for doing this, Vietnamese code page 1258 actually includes combining characters. This shouldn't affect filename handling ordinarily, but there may be some difficulty encountered.

For arguments that are just strings, you may want to look into normalization with the NormalizeString function. The documentation and examples linked in it should help you to understand how the function works. Normalization and a few other things in Unicode can be a long journey, but if this sort of thing excites you, it's also a fun journey.

Frit answered 14/6, 2015 at 22:48 Comment(1)
I couldnt use wmain because my compiler didnt know such stuff exists but your solution has worked for me. I had to modify it so I used empty main like you did, then CommandLineToArgvW to read arguments so they are not modified to garbage for my program then I set chcp in my program to 852 and then set locale to slovak and now characters work like a charm ;) and finally I can read files/folders with special characters in our company server.Monarch
T
0

Try compiling and running the following program:

#include <stdio.h>

int main()
{
    int i = 0;

        for( i=0; i<256; i++){
            printf("\nASCII Character #%d:%c ", i, i);
        }

        printf("\n");

    return 0;
}

In your output you should see those little question marks from number 128 and onward. FYI I am using Ubuntu, and when I compile and run this program (whith GNOME Terminal) this happens to me as well.

However, if I go to Terminal > Set character encoding... and select Western (WINDOWS-1252) as opposed to Unicode (UTF-8), and rerun the program, the extended ASCII characters display properly.

I don't know the exact steps for Windows/MinGW, but, in short, changing the character encoding should fix your problem.

Truncation answered 14/6, 2015 at 19:28 Comment(6)
UPDATE: Just tried running your program myself, and as it turns out, it works well with UTF-8 and prints wrong characters with WINDOS-1252. Weird. Well, anyhow, you should still try out my suggestion above and see what happens. If anyone more experienced could provide more insight into these platform differences that would be great.Truncation
@Steve Penny How does my answer not address the issue? I propose changing character encoding while trying to establish a connection between non ASCII characters in command line arguments and program outputTruncation
@Steve Penny I am only trying to help. Even though I am on a different platform, the same steps could fix OP's problem. If they don't, well, then someone else or perhaps YOU could be of better assistance? EDIT:completely missed the fact that you are the OP, sorryTruncation
@Steven Penny lol see the edit in my previous comment :) Anyway, have you actually tried my solution (change character encoding)? Let me know the result. If it doesn't work let's hope someone else can answer this. Good luckTruncation
@Truncation Unfortunately, Windows still relies on its legacy character sets. You can change the "code page", but all that does is change the characters that are available to you. UTF-8 is not one of those "code pages". And that's assuming that you can change the code page to the one desired. Windows is still stuck in the past in that regard unfortunately.Frit
@Chrono Kitsune Thanks for your input. To OP: Is programming and compiling on GNU/Linux not an option for this one? Would save you a lot of troubleTruncation

© 2022 - 2024 — McMap. All rights reserved.