How to Convert UTF-16 to UTF-32 and Print the Resulting wchar_t in C?
Asked Answered
A

1

4

i'm trying to print out a string of UTF-16 characters. i posted this question a while back and the advice given was to convert to UTF-32 using iconv and print it as a string of wchar_t.

i've done some research, and managed to code the following:

// *c is the pointer to the characters (UTF-16) i'm trying to print
// sz is the size in bytes of the input i'm trying to print

iconv_t icv;
char in_buf[sz];
char* in;
size_t in_sz;
char out_buf[sz * 2];
char* out;
size_t out_sz;

icv = iconv_open("UTF-32", "UTF-16");

memcpy(in_buf, c, sz);

in = in_buf;
in_sz = sz;
out = out_buf;
out_sz = sz * 2;

size_t ret = iconv(icv, &in, &in_sz, &out, &out_sz);
printf("ret = %d\n", ret);
printf("*** %ls ***\n", ((wchar_t*) out_buf));

The iconv call always return 0, so i guess conversion should be OK?

However, printing seems to be hit and miss. At times the converted wchar_t string prints OK. Other times, it seems to hit problem while printing the wchar_t, and terminates the printf function call altogether such that even the trailing "***" does not get printed.

i also tried using

wprintf(((wchar_t*) "*** %ls ***\n"), out_buf));

but nothing ever gets printed.

Am i missing something here?

Reference: How to Print UTF-16 Characters in C?

UPDATE

incorporated some of the suggestions in the comments.

updated code:

// *c is the pointer to the characters (UTF-16) i'm trying to print
// sz is the size in bytes of the input i'm trying to print

iconv_t icv;
char in_buf[sz];
char* in;
size_t in_sz;
wchar_t out_buf[sz / 2];
char* out;
size_t out_sz;

icv = iconv_open("UTF-32", "UTF-16");

memcpy(in_buf, c, sz);

in = in_buf;
in_sz = sz;
out = (char*) out_buf;
out_sz = sz * 2;

size_t ret = iconv(icv, &in, &in_sz, &out, &out_sz);
printf("ret = %d\n", ret);
printf("*** %ls ***\n", out_buf);
wprintf(L"*** %ls ***\n", out_buf);

still the same result, not all the UTF-16 strings get printed (both the printf and the wprintf).

what else could i be missing?

btw, i'm using Linux, and have verified that wchar_t is 4 bytes.

Amann answered 11/12, 2011 at 17:24 Comment(5)
wprintf() needs the format string to have the L prefix, e.g. wprintf(L"*** %ls ***\n", out_buf).Paramount
Why are you copying the input to a local buffer in_buf? Just use c directly...Hogg
Also you cannot legally cast a pointer to a char array to a pointer to wchar_t. The output buffer needs to have type wchar_t [n].Hogg
Not all platforms use UTF-32 for wchar_t, Win doesn't.Mayce
On Linux you can't mix wide (wprintf) and narrow (printf) output in the same application. The first call sets the orientation and can't be changed afterwords. "Once a stream has an orientation, it cannot be changed and persists until the stream is closed." See linux.about.com/library/cmd/blcmdl3_fwide.htm and bytes.com/topic/c/answers/…Flaminius
H
5

Here is a short program that converts UTF-16 to a wide character array and then prints it out.

#include <endian.h>
#include <errno.h>
#include <iconv.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <wchar.h>

#define FROMCODE "UTF-16"

#if (BYTE_ORDER == LITTLE_ENDIAN)
#define TOCODE "UTF-32LE"
#elif (BYTE_ORDER == BIG_ENDIAN)
#define TOCODE "UTF-32BE"
#else
#error Unsupported byte order
#endif

int main(void)
{
    void *tmp;
    char *outbuf;
    const char *inbuf;
    long converted = 0;
    wchar_t *out = NULL;
    int status = EXIT_SUCCESS, n;
    size_t inbytesleft, outbytesleft, size;
    const char in[] = {
        0xff, 0xfe,
        'H', 0x0,
        'e', 0x0,
        'l', 0x0,
        'l', 0x0,
        'o', 0x0,
        ',', 0x0,
        ' ', 0x0,
        'W', 0x0,
        'o', 0x0,
        'r', 0x0,
        'l', 0x0,
        'd', 0x0,
        '!', 0x0
    };
    iconv_t cd = iconv_open(TOCODE, FROMCODE);
    if ((iconv_t)-1 == cd) {
        if (EINVAL == errno) {
            fprintf(stderr, "iconv: cannot convert from %s to %s\n",
                    FROMCODE, TOCODE);
        } else {
            fprintf(stderr, "iconv: %s\n", strerror(errno));
        }
        goto error;
    }
    size = sizeof(in) * sizeof(wchar_t);
    inbuf = in;
    inbytesleft = sizeof(in);
    while (1) {
        tmp = realloc(out, size + sizeof(wchar_t));
        if (!tmp) {
            fprintf(stderr, "realloc: %s\n", strerror(errno));
            goto error;
        }
        out = tmp;
        outbuf = (char *)out + converted;
        outbytesleft = size - converted;
        n = iconv(cd, (char **)&inbuf, &inbytesleft, &outbuf, &outbytesleft);
        if (-1 == n) {
            if (EINVAL == errno) {
                /* junk at the end of the buffer, ignore it */
                break;
            } else if (E2BIG != errno) {
                /* unrecoverable error */
                fprintf(stderr, "iconv: %s\n", strerror(errno));
                goto error;
            }
            /* increase the size of the output buffer */
            converted = size - outbytesleft;
            size <<= 1;
        } else {
            /* done */
            break;
        }
    }
    converted = (size - outbytesleft) / sizeof(wchar_t);
    out[converted] = L'\0';
    fprintf(stdout, "%ls\n", out);
    /* flush the iconv buffer */
    iconv(cd, NULL, NULL, &outbuf, &outbytesleft);
exit:
    if (out) {
        free(out);
    }
    if (cd) {
        iconv_close(cd);
    }
    exit(status);
error:
    status = EXIT_FAILURE;
    goto exit;
}

Since UTF-16 is a variable-length encoding you're guessing how big your output buffer needs to be. A correct program should handle the case where the output buffer isn't large enough to hold the converted data.

You should also note that iconv doesn't NULL-terminate your output buffer for you.

Iconv is a stream-oriented processor, so you need to flush iconv_t if you want to reuse it for another conversion (the sample code does this near the end). If you want do stream processing you would handle the EINVAL error, copying any bytes left in the input buffer to the beginning of the new input buffer before calling iconv again.

Hyperthermia answered 12/12, 2011 at 21:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.