How to do operations with 'æ', 'ø' and 'å' in C
Asked Answered
P

2

5

I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.

I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.

The program is working with this array of chars:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives these warnings:

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with 'a' it replaces 'å' with "�a".

I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but it gives this error:

excess elements in char array initializer

Is there a a way to make this work without making it too complicated?

Poly answered 21/9, 2015 at 12:14 Comment(10)
Please state the standard version you are using and if you tried with C11 and which source/target character encoding your compiler uses. Note that e.g. UTF-8 (default for gcc) has variable length characters, so char will not be sufficient to hold anything else than ASCII in a single char variable.Smith
How can I find out which version I'm using? I haven't tried with C11, and I don't know how I would go about doing that. I use this line to compile: > gcc -Wall -g -o filename filename.cPoly
You have to specify yourself. Check the documentation which standard your gcc-version uses by default. (hint: this changed recently). Anyway, you have to use wide chars, but I cannot help you with that - sorry.Smith
Instead of char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"}; you should use char *extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};Dhoti
I'm using gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) I will try that Fernando.Poly
You can select standard version with -std=c11. How well that works with version 4.8.4, I don' t know.Cari
Try char extended[3][3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};Parabolize
@MartinJohansen I would really change the program to work with UTF8, because of the reasons stated in UTF-8 Everywhere, (the link that Basile Starynkevitch already posted in his answer).Egregious
those characters can't fit in a char. You must use wchar_t, char16_t or char32_t. Read more Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)Schroeder
@LuuVinhPhuc: No you don't have to use wchar_t (whose width vary from one implementation or OS to another), but you should use UTF_ multibyte char like I did in my answer.Upanchor
P
4

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

Potoroo answered 21/9, 2015 at 12:37 Comment(16)
I made this work by doing these replacements in my code: char -> wchar_t, strcpy() -> wcscpy(), strlen() -> wcslen(), printf("%s", str) -> printf("%ls", str). I'm only missing a replacement for getline().Poly
There are no "extended ASCII characters". "Code pages" are specific to one family of operating systems. There is absolutely no problem whatsoever comparing ASCII with UTF8, as UTF8 is specifically designed to be ASCII-compatible.Winou
@n.m. I beg to differ. en.wikipedia.org/wiki/Extended_ASCII Extended ASCII (or high ASCII) is eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized,[1][2][3] because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue.Potoroo
I believe that on Linux using UTF-8 char is much better than wchar_tUpanchor
Basile, how would you make that work with letters like 'æ', 'ø' or 'å'?Poly
@Dogbert Please note how Wikipedia says The use of the term is sometimes criticized (and lists the reasons why). Now you have encountered someone who criticizes the use of the term (myself). Where's a contradiction?Winou
@BasileStarynkevitch depends on what you are doing. For character-level work, like scanning words for vowels, wchar_t is much easier.Winou
Not much easier, and for scanning words that you have got in UTF-8, converting all the input to wchar_t is inefficient and error prone...Upanchor
@n.m. Because they do indeed exist. Just because Windows uses code pages and Linux uses locales, doesn't mean extended ASCII chars don't exist. The term is criticized because it seems to indicate ASCII supports char values above 127, or that the values on the range [128,255] are the same from system to system. The term itself is criticized for incorrect assumptions it causes readers to infer, not the validity of the existence of the term. As a counterexample to your point, æ maps to 145 in extended ASCII, but 230 in UTF8. Extended ASCII doesn't map to unicode equivalents.Potoroo
@BasileStarynkevitch Yes. What I'm getting at is that extended ASCII exists, just as a logical mapping to a specific set of 128 characters that change from platform to platform depending on locale settings, and that it doesn't map to UTF8 as nm noted before editing his/her comment.Potoroo
@Dogbert æ maps to 145 in a specific encoding called ISO8859-1. There are many encodings and charsets that can equally be called "extended ASCII" and æ is not in.most of them. Which is exactly the reason why the term should never be used.Winou
I believe that using Unicode/UCS4 wchar_t is worse than UTF8everywhere char-s so I downvoted that answerUpanchor
@BasileStarynkevitch "inefficient and error prone" can't really see how either of this is true.Winou
@n.m. Extended ASCII's real meaning is simply the use of the upper/eighth bit to index into additional character maps, nothing more. The fact that a specific character exists in multiple character sets is irrelevant, and has no bearing on the actual definition of "extended ASCII".Potoroo
@Dogbert The term doesn't imply which map is to be used, only that there's an unspecified map from an unspecified set of characters to numbers 128-255. Why ever use such a vague term if you can just call the map by its name like ISO8859-15?Winou
@n.m. Because I'm trying to draw attention to the general concept with respect to TC's post, and make a distinction between extended ASCII and code pages, rather than treat the two synonymously, as they are distinct concepts, despite them working side-by-side typically.Potoroo
U
4

Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

You need to understand what character encoding are you using.

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

Upanchor answered 21/9, 2015 at 12:42 Comment(8)
I understand that UTF-8 can be several bytes, and I think that's the reason why 'å' was replaced with "�a". 'æ', 'ø', and 'å' are vowels in the norwegian and danish language. 'æ' is the sound a sheep make (baa) w.o. the 'b', 'ø' sounds like "uhh" and 'å' sounds like "oh". But the program doesn't have to work for every language, only norwegian :)Poly
It says in the title.Poly
Norvegian is not mentioned in the title or in the question. Languages have much more vowels than you think. ы & ê are obviously vowels, but you wrongly believe they are not. And I won't dare speaking about vowels in Hebrew or Arabic or Japanese or Cherokee, but I do know it is a tricky subject.Upanchor
how-to-do-operations-with-æ-ø-and-å-in-c. Maybe the title is bad.Poly
@BasileStarynkevitch It's quite simple, really. None of these letters are vowels. Vowels are sounds. Letters relate to sounds in complex ways, there is often no 1:1 mapping.Winou
@n.m. you should also convince the OP, Martin Johansen. However, in elementary school, I (and my children and grandchildren) was taught that a e i o u y are all the vowels in French.Upanchor
@BasileStarynkevitch yeah, in the elementary school they tend to teach that. Not universally though. It's a simplified approach that works relatively well for some languages, not so well for others.Winou
My point (to the OP, not to you n.m) is that the notion of vowels (in Unicode) is probably very complex.Upanchor

© 2022 - 2024 — McMap. All rights reserved.