Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....
You need to understand what character encoding are you using.
I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)).
Read utf8everywhere....
I don't recommend wchar_t
whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t
; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...
Do understand that in UTF-8 a character can be encoded in several bytes. For example ê
(French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa
, and ы
(Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b
and both are considered vowels but neither fit in one char
(which is an 8 bit byte on your and mine machines).
The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).
Are you exactly sure that æ
and œ
are letters or vowels? (FWIW, å
& œ
& æ
are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf
is in a dictionary at the place of oeuf
, which means egg). But I am not an expert about this. See strcoll(3).
On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t
, but use UTF-8 char
(so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :
unsigned count_norvegian_lowercase_vowels(const char*s) {
assert (s != NULL);
// s should be a not-too-big string
// (its `strlen` should be less than UINT_MAX)
// s is assumed to be UTF-8 encoded, and should be valid UTF-8:
if (!g_utf8_validate(s, -1, NULL)) {
fprintf(stderr, "invalid UTF-8 string %s\n", s);
exit(EXIT_FAILURE);
};
unsigned count = 0;
char* next= NULL;
char* pc= NULL;
for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
g_unichar u = g_utf8_get_char(pc);
// comments from OP make me believe these are the only Norvegian vowels.
if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
|| u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
|| u==(g_unichar)0xf8 //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
|| u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
/* notice that for me ы & ê are also vowels but œ is a ligature ... */
)
count++;
};
return count;
}
I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.
It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.
UTF-8
(default for gcc) has variable length characters, sochar
will not be sufficient to hold anything else than ASCII in a singlechar
variable. – Smithchar extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};
you should usechar *extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};
– Dhoti-std=c11
. How well that works with version 4.8.4, I don' t know. – Carichar extended[3][3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};
– Parabolizechar
. You must usewchar_t
,char16_t
orchar32_t
. Read more Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Schroederwchar_t
(whose width vary from one implementation or OS to another), but you should use UTF_ multibytechar
like I did in my answer. – Upanchor