I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too.
I'm having massive problems saving and performing functions on individual characters. My editor and console are both set up to UTF-8 and can display Arabic text fine if I save it as a char*, but when I try to print wchars they display random punctuation marks.
My program needs to be able to recognise an individual UTF-8 character in order to work. For example, for the word 'though' it stores 't' as syllable[1]segment[1]letter[1], h as syllable[1]segment[1]letter[2] etc. I want to be able to do the same for non-ASCII characters.
I've spent basically the whole day researching unicode and trying out different methods and I can't get any of them to let me store an Arabic character as a character.
I'm not sure if I've just made some stupid syntax errors along the way, if I've completely misunderstood the whole concept, or if it actually just isn't possible to do what I want in C and I should just give up and try another language...
I would massively, massively, massively appreciate any help you can offer! I'm pretty new to programming, but unicode is completely instrumental to my work so I want to work out how to do it from the beginning.
My understanding of how unicode works (in case that's where I'm going wrong):
I type some text into my editor. My editor encodes it according to the encoding I have set. So if I set it to UFT-8 it will encode the Arabic letter ب with the 2 byte sequence 0xd8 0xab which indicates the code point U+0628.
I compile it, breaking down 0xd8 0xab into the binary 11011000 10101000.
I run it on the command prompt. The command prompt interprets the text according to the encoding I have set, so if I set it to UFT-8 it should interpret 11011000 10101000 as the code point U+0628. Unicode algorithms also tell it which version of U+0628 to display to me, as the character has different shapes depending on where it is in the word. As the character is alone it will show me the standalone version ب
My understanding of the ways I can process unicode in C:
Option A - Use single bytes encoded as UTF-8 (http://www.nubaria.com/en/blog/?p=289)
Use single bytes encoded as UTF-8. Leave all my datatypes as chars and char arrays and only type ASCII characters in my code. If I absolutely have to hard code a unicode character enter it as an array in the format:
const char kChineseSampleText[] = "\xe4\xb8\xad\xe6\x96\x87";
My problems with this:
- I need to manipulate individual characters
- Having to type Arabic characters as code points is going to render my code completely unreadable and slow me down immensely.
Option B - Use wchar and friends (http://icu-project.org/docs/papers/unicode_wchar_t.html)
Swap using chars for wchars, which hold 2 to 4 bytes depending on the compiler. String functions like strlen will not work as they are expecting characters to be one byte, but there are w functions like wprintf I can use instead.
My problem with this:
I can’t get wchars to print Arabic characters at all! I can get them to print English letters fine, but Arabic characters just pull through as random punctuation marks.
I've tried inputing the unicode code point as well as the actual Arabic character and I've tried printing them both to the console and to a UTF-8 encoded text file and I get the same result, even though both the console and the text file display Arabic text if entered as a char*. I've included my code at the end.
(It’s worth saying here that I am aware that a lot of people think wchars are bad because they aren’t very portable and because they take up extra space for ASCII characters. But at this stage, neither of those things are really a worry for me - I’m just writing the program to run on my own computer and the program will only be processing short strings.)
Option C - Use external libraries
I've read in various comments that external libraries are the way to go so I've tried:
C programming library
http://www.cprogramming.com/tutorial/unicode.html suggests replacing all chars with unsigned long integers and using special functions for iterating through strings etc. The site even provides a sample library to download.
My problem:
While I can set the character to be an unsigned long integer I can’t print it out, because the printf and wprintf functions don’t work, and neither does the library provided on the website (I think maybe the library was designed for Linux? Some of the datatypes are invalid and amending them didn't work either)
ICU library
My problem:
I downloaded the ICU library, but when I was looking into how to use it I saw that functionality such as the characterIterator is not available for use in C (http://userguide.icu-project.org/strings). Being able to iterate through characters is completely fundamental to what I need to do, so I don't think the library will work for me.
My code
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
#include <string.h>
int main ()
{
wchar_t unicode = L'\xd8ac';
wchar_t arabic = L'ب';
wchar_t number = 0x062c;
FILE* f;
f = fopen("unitest.txt","w");
char* string = "ايه الاخبار";
//printf - works
printf("printf - literal arabic character is \"م\"\n");
fprintf(f,"printf - literal arabic character is \"م\"\n");
printf("printf - char* string is \"%s\"\n",string);
fprintf(f,"printf - char* string is \"%s\"\n",string);
//wprintf - english - works
wprintf(L"wprintf - literal english char is \"%C\"\n\n", L't');
fwprintf(f,L"wprintf - literal english char is \"%C\"\n\n", L't');
//wprintf - arabic - doesnt work
wprintf(L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
fwprintf(f,L"wprintf - unicode wchar_t is \"%C\"\n", unicode);
wprintf(L"wprintf - unicode number wchar_t is \"%C\"\n", number);
fwprintf(f,L"wprintf - unicode number wchar_t is \"%C\"\n", number);
wprintf(L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
fwprintf(f,L"wprintf - arabic wchar_t is \"%C\"\n", arabic);
wprintf(L"wprintf - literal arabic character is \"%C\"\n",L'ت');
fwprintf(f,L"wprintf - literal arabic character is \"%C\"\n",L'ت');
wprintf(L"wprintf - literal arabic character in string is \"م\"\n\n");
fwprintf(f,L"wprintf - literal arabic character in string is \"م\"\n\n");
fclose(f);
return 0;
}
Output file
printf - literal arabic character is "م"
printf - char* string is "ايه الاخبار"
wprintf - literal english char is "t"
wprintf - unicode wchar_t is "�"
wprintf - unicode number wchar_t is ","
wprintf - arabic wchar_t is "("
wprintf - literal arabic character is "*"
wprintf - literal arabic character in string is ""
I'm using Windows 10, Notepad++ and MinGW.
Edit This got marked as a duplicate of Light C Unicode Library but I don't think it really answers my question. I've downloaded the library and had a look at and you can call me stupid if you like, but I'm really new to programming and I don't understand most of the code in the library, so it's hard for me to work out how I can use it achieve what I want. I searched the library for a print function and couldn't find one...
I just want to save a UTF-8 character and then print it out again! Do I really need to install an entire library to do that? I would just really appreciate someone taking pity on me and telling me in baby terms how I can do it... People keep saying I should use uint_32 or something instead of wchar - but how do I then print those datatypes? Can I do it with wprintf?!
char* string = u8"ايه الاخبار";
to makestring
a UTF8 encoded string. – Siglerwchar_t
? To store individual characters, useuint32_t
. – Siglerchar* string = "ايه الاخبار";
is OK as that allows various implementation issues.. I am surechar* string = u8"ايه الاخبار";
is OK. – Siglerwchar_t
can't handle Arabic letters. I said thatwchar_t
should not exist in C and are really not here to handle UTF-8. You will have a lot of unsolvable problem if you try. Believe me... I tried :p. Only windows use wide character and this is not a good idea. You should read UTF-8 to understand that a type with more than 1 octet is not suitable to read utf-8. You must use anuint8_t
but whatever I strongly recommend you to use a library, utf8proc should work for you. – Trevatrevahwchar_t
(which might be wide enough and encoded as Unicode). "individual UTF-8 character" is unclear. I'd expect "individual Unicode character" which needs at least 21-bits as withuint32_t
. UTF8 is a byte encoding 1-4 long of a Unicode character. To store 1-4 bytes, use achar u[4]
and zero unused bytes. – Siglerchar* string = u8"ايه الاخبار"; for (char *s = string; *s; ) { printf("<"); char u[5]; char *p = u; *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; if ((*s & 0xC0) == 0x80) *p++ = *s++; *p = 0; printf("%s", u); printf(">\n"); } puts("");
work well for you? – Sigler*s
will be "false" when it is a binary zero. – Cari