Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?
Asked Answered
C

3

13

I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.

I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.

My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.

I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.

I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...

I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.

My understanding of how unicode works (in case that's where I'm going wrong):

  1. I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.

  2. I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.

  3. I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب

My understanding of the ways I can process unicode in C:

Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)

Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:

    const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";

My problems with this:

  1. I need to manipulate individual characters
  2. Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.

Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)

Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.

My problem with this:

I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.

I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.

(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)

Option C - Use external libraries

I've read in various comments that external libraries are the way to go so I've tried:

C programming library

http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.

My problem:

While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)

ICU library

My problem:

I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.

My code

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>


int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;


FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";


//printf - works 

printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");

printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);


//wprintf  - english - works

wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');

//wprintf - arabic - doesnt work

wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);

wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);

wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);


wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');


wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");

fclose(f);

return 0;
}

Output file

printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"

wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""

I'm using Windows 10, Notepad++ and MinGW.

Edit This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...

I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!

Caerphilly answered 6/6, 2017 at 19:26 Comment(23)
A data type is not an encoding in and of itself.Package
What font are you using in your console? Are you sure it supports the Arabic script?Tutuila
yes, because I can type Arabic into the command line!Caerphilly
Hmmm, I'd expect char* string = u8"ايه الاخبار"; to make string a UTF8 encoded string.Sigler
mm I don't want to covert from UTF-16 to UTF-8 though - my editor and console are both set to UTF-8 so I don't think that's the problemCaerphilly
@chux the string is actually printing out fine! but my problem is I want to be able to store individual charactersCaerphilly
With this code printing out fine does not indicate that the string is certainly encoded as UTF8. What is the size of wchar_t ? To store individual characters, use uint32_t.Sigler
wide character should not (can not) handle UTF-8Trevatrevah
@chux i'm not completely sure. i read its 2-4 bytes depending on the compiler and that arabic characters are 2 bytes so i thought i would be okCaerphilly
@chux how do i print with uint32_t? do i still use wprintf?Caerphilly
I do not know if char* string = "ايه الاخبار"; is OK as that allows various implementation issues.. I am sure char* string = u8"ايه الاخبار"; is OK.Sigler
Possible duplicate of Light C Unicode LibraryTrevatrevah
@Trevatrevah if the Arabic letters are 2 bytes why can't I used wchar?Caerphilly
@Caerphilly I didn't say that wchar_t can't handle Arabic letters. I said that wchar_t should not exist in C and are really not here to handle UTF-8. You will have a lot of unsolvable problem if you try. Believe me... I tried :p. Only windows use wide character and this is not a good idea. You should read UTF-8 to understand that a type with more than 1 octet is not suitable to read utf-8. You must use an uint8_t but whatever I strongly recommend you to use a library, utf8proc should work for you.Trevatrevah
"Is it actually possible to store and process individual UTF-8 characters on C ?" --> storing and processing individual UTF-8 characters is not what code does. Instead it is printing strings (which might be UTF8 encoded strings) and wchar_t (which might be wide enough and encoded as Unicode). "individual UTF-8 character" is unclear. I'd expect "individual Unicode character" which needs at least 21-bits as with uint32_t. UTF8 is a byte encoding 1-4 long of a Unicode character. To store 1-4 bytes, use a char u[4] and zero unused bytes.Sigler
Does char* string = u8"ايه الاخبار"; for (char *s = string; *s; ) { printf("<"); char u[5]; char *p = u; *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; *p = 0; printf("%s", u); printf(">\n"); } puts(""); work well for you?Sigler
"you can call me stupid if you like", you are far away from stupidity, the documentation of this library is very bad for a beginner.Trevatrevah
@chux OH MY GOD it works!!!!! thank you!!! i have no idea how, especially the for statement, ive never seen a for statement like that before... char*s = string is setting the pointer s to the first letter of the string right? but then i don't understand how the second condition of the for statement can also be *s...Caerphilly
*s will be "false" when it is a binary zero.Cari
"Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word" - That's not how it works, if you want different displays you have to have code points indicating that (could be a different character, or combining characters). E.g. here is table of codes for the contextual forms of ArabicSibship
@M.M. It is indeed how it works. Unicode renderers are expected to understand Arabic contextual rules and present the letters in the correct order and with the correct form. Here's the letters lam (U+0644): ل and here's ain (U+0639): ع and finally beh (U+0628): ب Now, lets put them together: لعب And now I'll add a marbuta (U+0629) ة at the end: لعبة You can see the browser rendering the contextual forms without my having to change the code point. (And my Linux terminal emulator does it too.)Horizontal
@Horizontal OK, I guess the question is whether the windows console also is able to do thatSibship
@m.m: I'm pretty sure it is but of course you need to have the locale set up correctly. I'm less certain about cygwin.Horizontal
S
10

C and UTF-8 are still getting to know each other. In-other-words, IMO, C support for UTF-8 is scant.

Is it ... possible to store and process individual UTF-8 characters ...?

First step is to make certain "ايه الاخبار" is a UTF-8 encoded string. C supports this explicitly with u8"ايه الاخبار".

A UTF-8 string is a sequence of char. Each 1 to 4 char represents a Unicode character. A Unicode character needs at least 21-bits for encoding. Yet OP does not needs to convert a portion of string[] into a Unicode character as much as wants to segment that string on UTF-8 boundaries. This is readily found by looking for UTF-8 continuation bytes.

The following forms a 1 Unicode character encoded as a UTF-8 string with the accompanying terminating null character. Then that short string is printed.

char* string = u8"ايه الاخبار";
for (char *s = string; *s; ) {
  printf("<");
  char u[5];
  char *p = u;
  *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  if ((*s & 0xC0) == 0x80) *p++ = *s++;
  *p = 0; 
  printf("%s", u);
  printf(">\n");
}

With the output viewed with a UTF8 aware screen:

<ا>
<ي>
<ه>
< >
<ا>
<ل>
<ا>
<خ>
<ب>
<ا>
<ر>
Sigler answered 6/6, 2017 at 21:15 Comment(12)
as i said in the comments above it works, which is amazing!! i would be really interested to understand how it works though, especially the for statement. @Cari said "*s will be "false" when it is a binary zero." how/why?Caerphilly
@Caerphilly Why does a for() loop stop? What is the condition?Sigler
when it no longer meets the second condition? i mean with (i=0;i<5;i++) it continues until i<5 is no longer true... so i guess here it's continuing until *s no longer equals *s ? doesnt really make sense to me :/Caerphilly
@chux: What you mean by the first line, is "Microsoft is the only C compiler and C library provider that still has issues with UTF-8 and wide-character I/O streams, so if you are using Windows, you need to just make assumptions and do it yourself anyway, and hope that everything happens to work out." None of the other actively developed C libraries or compilers have any issues with UTF-8 anymore. Grr.Frascati
@chux is the *s pointing to the first byte of the first unicode character, or is it pointing to the whole unicode character? if it's the second is that how the for loop works? like it's going through the bytes until it's moved past the the first unicode character? or am i just making this up...Caerphilly
@Caerphilly s is a char *. So *s points to a single char. *s is true as long as *s is not zero. At the end of the _string, *s is zero, so the loop stops.Sigler
@NominalAnimal What is the source of the ""Microsoft is the only C compiler..." quote? I am unclear why that is in your comment. My answer does not reference MS.Sigler
@chux: You wrote, "C support for UTF-8 is scant". That is incorrect. Microsoft is the only current C compiler and library provider that has a problem with UTF-8 or Unicode. Every other currently actively developed C compiler and C library implementation supports UTF-8 just fine. (Plus, u8"literal" is C++, not C. But then again, Windows rules, and none of the other OSes matter, eh?)Frascati
@NominalAnimal You should be aware that C11 has UTF8 literal, en.cppreference.com/w/c/language/string_literal.Trevatrevah
@NominalAnimal What is the source of the quote? 2nd time request. C does not have any standard C library functions support for processing/segmenting UTF-8 encoded strings nor clear translation for UTF-8 encodings to/from Unicode code-points, hence my assertion that support is scant. Many compilers do not support UTF-8 well in the embedded community and MS is not alone with short-comings with Unicode/UTF-8. There is no MS tirade on my part - for or against.Sigler
@NominalAnimal Disagree "u8"literal" is ... not C." as C11 defines the encoding prefix u8 in 6.4.5 String literals.Sigler
@NominalAnimal To be clear, there is no implied MS consideration in this answer on my part. Additional C/UTF8 short comings are commented here. Concerning C89 etc., that is far OT from OP's title question. Perhaps post it as a question on SO or some SE site? Bash shell's etc are also not specified by C and do not relate to this post as tagged.Sigler
T
1

An example with utf8proc library to iterate is:

#include <utf8proc.h>
#include <stdio.h>

int main(void) {
  utf8proc_uint8_t const string[] = u8"ايه الاخبار";
  utf8proc_ssize_t size = sizeof string / sizeof *string - 1;
  utf8proc_int32_t data;
  utf8proc_ssize_t n;

  utf8proc_uint8_t const *pstring = string;
  while ((n = utf8proc_iterate(pstring, size, &data)) > 0) {
    printf("<%.*s>\n", (int)n, pstring);
    pstring += n;
    size -= n;
  }
}

This is probably not the best way to use this library but I make an issue an github to have some example. Because, I'm unable to understand how work this library.

Trevatrevah answered 6/6, 2017 at 23:37 Comment(0)
H
0

You need to very clearly understand the difference between a Unicode code point and UTF-8. UTF-8 is a variable byte encoding of Unicode code points. The lower end, values 0-127, is stored as a single byte. That's the main point of UTF-8, and makes it backwards compatible with Ascii.

When bit 7 is set, for values over 127, a variable length code of two bytes or more is used. The leading byte always has the bit pattern 11xxxxxx.

Here's code to get the skip (the number of character used), also to read a codepoint and to write one.

static const unsigned int offsetsFromUTF8[6] = 
{
    0x00000000UL, 0x00003080UL, 0x000E2080UL,
    0x03C82080UL, 0xFA082080UL, 0x82082080UL
};

static const unsigned char trailingBytesForUTF8[256] = {
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
    1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};



int bbx_utf8_skip(const char *utf8)
{
  return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}

int bbx_utf8_getch(const char *utf8)
{
    int ch;
    int nb;

    nb = trailingBytesForUTF8[(unsigned char)*utf8];
    ch = 0;
    switch (nb) 
    {
            /* these fall through deliberately */
        case 3: ch += (unsigned char)*utf8++; ch <<= 6;
        case 2: ch += (unsigned char)*utf8++; ch <<= 6;
        case 1: ch += (unsigned char)*utf8++; ch <<= 6;
        case 0: ch += (unsigned char)*utf8++;
    }
    ch -= offsetsFromUTF8[nb];

    return ch;
}

int bbx_utf8_putch(char *out, int ch)
{
  char *dest = out;
  if (ch < 0x80) 
  {
     *dest++ = (char)ch;
  }
  else if (ch < 0x800) 
  {
    *dest++ = (ch>>6) | 0xC0;
    *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x10000) 
  {
     *dest++ = (ch>>12) | 0xE0;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else if (ch < 0x110000) 
  {
     *dest++ = (ch>>18) | 0xF0;
     *dest++ = ((ch>>12) & 0x3F) | 0x80;
     *dest++ = ((ch>>6) & 0x3F) | 0x80;
     *dest++ = (ch & 0x3F) | 0x80;
  }
  else
    return 0;
  return dest - out;
}

Using these functions or similar, you convert between code points and UTF-8 and back.

Windows currently uses UTF-16 for its apis. To a first approximation, UTF-16 is the code points in 16 bit format. So when writing a UTF-8 based program, you need to convert the UTF-8 to UTF-16 (using wide chars) immediately before calling Windows output functions.

Support for UTF-8 via printf() is patchy. Passing a UTF-8 encoded string to printf() is unlikely to do what you want.

Haemin answered 6/6, 2017 at 22:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.