How to do operations with 'æ', 'ø' and 'å' in C

Asked 21/9, 2015 at 12:14 Answered 21/9, 2015 at 12:42

I have made a program in C which both can replace or remove all vowels from a string. In addition I would like it to work for these characters: 'æ', 'ø', 'å'.

I have tried to use strstr(), but I didn't manage to implement it without replacing all chars on the line containing 'æ', 'ø' or 'å'. I have also read about wchar, but that only seem to complicate everything.

The program is working with this array of chars:

char vowels[6] = {'a', 'e', 'i', 'o', 'u', 'y'};

I tried with this array:

char vowels[9] = {'a', 'e', 'i', 'o', 'u', 'y', 'æ', 'ø', 'å'};

but it gives these warnings:

warning: multi-character character constant [-Wmultichar]

warning: overflow in implicit constant conversion [-Woverflow]

and if I want to replace each vowel with 'a' it replaces 'å' with "�a".

I have also tried with the UTF-8 hexval of 'æ', 'ø' and 'å'.

char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"};

but it gives this error:

excess elements in char array initializer

Is there a a way to make this work without making it too complicated?

Poly answered 21/9, 2015 at 12:14 Comment(10)

Please state the standard version you are using and if you tried with C11 and which source/target character encoding your compiler uses. Note that e.g. UTF-8 (default for gcc) has variable length characters, so char will not be sufficient to hold anything else than ASCII in a single char variable. – Smith 21/9, 2015 at 12:19

How can I find out which version I'm using? I haven't tried with C11, and I don't know how I would go about doing that. I use this line to compile: > gcc -Wall -g -o filename filename.c – Poly 21/9, 2015 at 12:24

You have to specify yourself. Check the documentation which standard your gcc-version uses by default. (hint: this changed recently). Anyway, you have to use wide chars, but I cannot help you with that - sorry. – Smith 21/9, 2015 at 12:27

Instead of char extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"}; you should use char *extended[3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"}; – Dhoti 21/9, 2015 at 12:28

I'm using gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) I will try that Fernando. – Poly 21/9, 2015 at 12:36

You can select standard version with -std=c11. How well that works with version 4.8.4, I don' t know. – Cari 21/9, 2015 at 12:38

Try char extended[3][3] = {"\xc3\xa6", "\xc3\xb8", "\xc3\xa5"}; – Parabolize 21/9, 2015 at 12:38

@MartinJohansen I would really change the program to work with UTF8, because of the reasons stated in UTF-8 Everywhere, (the link that Basile Starynkevitch already posted in his answer). – Egregious 21/9, 2015 at 13:24

those characters can't fit in a char. You must use wchar_t, char16_t or char32_t. Read more Joel on Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Schroeder 21/9, 2015 at 13:55

@LuuVinhPhuc: No you don't have to use wchar_t (whose width vary from one implementation or OS to another), but you should use UTF_ multibyte char like I did in my answer. – Upanchor 21/9, 2015 at 14:5

There are two approaches to getting that character to be usable. The first is code pages, which would allow you to use extended ASCII characters (values 128-255), but the code page is system and locale dependent, so it's a bad idea in general.

The better alternative is to use unicode. The typical case with unicode is to use wide character literals, like in this post:

wchar_t str[] = L"αγρω";

The key problem with your code is that you're trying to compare ASCII with UTF8, which can be a problem. The solution to this is simple: convert all your literals to wide character UTF8 equivalents, as well as your strings. You need to work with a common encoding rather than mixing it, unless you have conversion functions to help out.

Potoroo answered 21/9, 2015 at 12:37 Comment(16)

I made this work by doing these replacements in my code: char -> wchar_t, strcpy() -> wcscpy(), strlen() -> wcslen(), printf("%s", str) -> printf("%ls", str). I'm only missing a replacement for getline(). – Poly 21/9, 2015 at 13:7

There are no "extended ASCII characters". "Code pages" are specific to one family of operating systems. There is absolutely no problem whatsoever comparing ASCII with UTF8, as UTF8 is specifically designed to be ASCII-compatible. – Winou 21/9, 2015 at 13:12

@n.m. I beg to differ. en.wikipedia.org/wiki/Extended_ASCII Extended ASCII (or high ASCII) is eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others. The use of the term is sometimes criticized,[1][2][3] because it can be mistakenly interpreted that the ASCII standard has been updated to include more than 128 characters or that the term unambiguously identifies a single encoding, both of which are untrue. – Potoroo 21/9, 2015 at 13:25

I believe that on Linux using UTF-8 char is much better than wchar_t – Upanchor 21/9, 2015 at 13:26

Basile, how would you make that work with letters like 'æ', 'ø' or 'å'? – Poly 21/9, 2015 at 13:31

@Dogbert Please note how Wikipedia says The use of the term is sometimes criticized (and lists the reasons why). Now you have encountered someone who criticizes the use of the term (myself). Where's a contradiction? – Winou 21/9, 2015 at 13:57

@BasileStarynkevitch depends on what you are doing. For character-level work, like scanning words for vowels, wchar_t is much easier. – Winou 21/9, 2015 at 14:8

Not much easier, and for scanning words that you have got in UTF-8, converting all the input to wchar_t is inefficient and error prone... – Upanchor 21/9, 2015 at 14:9

@n.m. Because they do indeed exist. Just because Windows uses code pages and Linux uses locales, doesn't mean extended ASCII chars don't exist. The term is criticized because it seems to indicate ASCII supports char values above 127, or that the values on the range [128,255] are the same from system to system. The term itself is criticized for incorrect assumptions it causes readers to infer, not the validity of the existence of the term. As a counterexample to your point, æ maps to 145 in extended ASCII, but 230 in UTF8. Extended ASCII doesn't map to unicode equivalents. – Potoroo 21/9, 2015 at 14:9

@BasileStarynkevitch Yes. What I'm getting at is that extended ASCII exists, just as a logical mapping to a specific set of 128 characters that change from platform to platform depending on locale settings, and that it doesn't map to UTF8 as nm noted before editing his/her comment. – Potoroo 21/9, 2015 at 14:15

@Dogbert æ maps to 145 in a specific encoding called ISO8859-1. There are many encodings and charsets that can equally be called "extended ASCII" and æ is not in.most of them. Which is exactly the reason why the term should never be used. – Winou 21/9, 2015 at 14:17

I believe that using Unicode/UCS4 wchar_t is worse than UTF8everywhere char-s so I downvoted that answer – Upanchor 21/9, 2015 at 14:31

@BasileStarynkevitch "inefficient and error prone" can't really see how either of this is true. – Winou 21/9, 2015 at 14:57

@n.m. Extended ASCII's real meaning is simply the use of the upper/eighth bit to index into additional character maps, nothing more. The fact that a specific character exists in multiple character sets is irrelevant, and has no bearing on the actual definition of "extended ASCII". – Potoroo 21/9, 2015 at 15:51

@Dogbert The term doesn't imply which map is to be used, only that there's an unspecified map from an unspecified set of characters to numbers 128-255. Why ever use such a vague term if you can just call the map by its name like ISO8859-15? – Winou 21/9, 2015 at 15:56

@n.m. Because I'm trying to draw attention to the general concept with respect to TC's post, and make a distinction between extended ASCII and code pages, rather than treat the two synonymously, as they are distinct concepts, despite them working side-by-side typically. – Potoroo 21/9, 2015 at 16:25

Learn about UTF-8 (including its relationship to Unicode) and use some UTF-8 library: libunistring, utfcpp, Glib from GTK, ICU ....

You need to understand what character encoding are you using.

I strongly recommend UTF-8 in all cases (which is the default on most Linux systems and nearly all the Internet and web servers; read locale(7) & utf8(7)). Read utf8everywhere....

^{I don't recommend wchar_t whose width and range and sign is implementation specific (you can't be sure that Unicode fits in a wchar_t; it is rumored that on Windows it does not fit). Also converting UTF-8 input to Unicode/UCS4 can be time-consuming, more than handle UTF-8...}

Do understand that in UTF-8 a character can be encoded in several bytes. For example ê (French accentuated e circonflexe lower-case) is encoded in two bytes 0xc3, 0xaa, and ы (Russian yery lower-case) is encoded in two bytes 0xd1, 0x8b and both are considered vowels but neither fit in one char (which is an 8 bit byte on your and mine machines).

The notion of vowel is complicated (e.g. what are vowels in Russian, Arabic, Japanese, Hebrew, Cherokee, Hindi, ....), so there might be no simple solution to your problem (since UTF-8 has combining characters).

Are you exactly sure that æ and œ are letters or vowels? (FWIW, å & œ & æ are classified as a letter & lowercase in Unicode). I was taught in French elementary school that they are ligatures (and French dictionaries don't mention them as letters, so œuf is in a dictionary at the place of oeuf, which means egg). But I am not an expert about this. See strcoll(3).

On Linux, since UTF-8 is the default encoding (and it is increasingly hard to get some other one on recent distribution), I don't recommend using wchar_t, but use UTF-8 char (so functions handling multi-byte encoded UTF-8), for example (using Glib UTF8 & Unicode functions) :

 unsigned count_norvegian_lowercase_vowels(const char*s) {
   assert (s != NULL);
  // s should be a not-too-big string 
  // (its `strlen` should be less than UINT_MAX)
  // s is assumed to be UTF-8 encoded, and should be valid UTF-8:
    if (!g_utf8_validate(s, -1, NULL)) {
      fprintf(stderr, "invalid UTF-8 string %s\n", s);
      exit(EXIT_FAILURE);
    };
    unsigned count = 0;
    char* next= NULL; 
    char* pc= NULL;
    for (pc = s; *pc != '\0' && ((next=g_utf8_next_char(pc)), *pc); pc=next) {
      g_unichar u = g_utf8_get_char(pc);
      // comments from OP make me believe these are the only Norvegian vowels.
      if (u=='a' || u=='e' || u=='i' || u=='o' || u=='u' || u=='y'
          || u==(g_unichar)0xa6 //æ U+00E6 LATIN SMALL LETTER AE
          || u==(g_unichar)0xf8  //ø U+00F8 LATIN SMALL LETTER O WITH STROKE
          || u==(g_unichar)0xe5 //å U+00E5 LATIN SMALL LETTER A WITH RING ABOVE
       /* notice that for me  ы & ê are also vowels but œ is a ligature ... */
      )
        count++;
    };
    return count;
  }

I'm not sure the name of my function is correct; but you told me in comments that Norvegian (which I don't know) has no more vowel characters than what my function is counting.

It is on purpose that I did not put UTF-8 in literal strings or wide char literals (only in comments). There are other obsolete character encodings (read about EBCDIC or KOI8) and you might want to cross-compile the code.

Upanchor answered 21/9, 2015 at 12:42 Comment(8)

I understand that UTF-8 can be several bytes, and I think that's the reason why 'å' was replaced with "�a". 'æ', 'ø', and 'å' are vowels in the norwegian and danish language. 'æ' is the sound a sheep make (baa) w.o. the 'b', 'ø' sounds like "uhh" and 'å' sounds like "oh". But the program doesn't have to work for every language, only norwegian :) – Poly 21/9, 2015 at 13:28

It says in the title. – Poly 21/9, 2015 at 13:39

Norvegian is not mentioned in the title or in the question. Languages have much more vowels than you think. ы & ê are obviously vowels, but you wrongly believe they are not. And I won't dare speaking about vowels in Hebrew or Arabic or Japanese or Cherokee, but I do know it is a tricky subject. – Upanchor 21/9, 2015 at 13:48

how-to-do-operations-with-æ-ø-and-å-in-c. Maybe the title is bad. – Poly 21/9, 2015 at 13:51

@BasileStarynkevitch It's quite simple, really. None of these letters are vowels. Vowels are sounds. Letters relate to sounds in complex ways, there is often no 1:1 mapping. – Winou 21/9, 2015 at 14:11

@n.m. you should also convince the OP, Martin Johansen. However, in elementary school, I (and my children and grandchildren) was taught that a e i o u y are all the vowels in French. – Upanchor 21/9, 2015 at 14:16

@BasileStarynkevitch yeah, in the elementary school they tend to teach that. Not universally though. It's a simplified approach that works relatively well for some languages, not so well for others. – Winou 21/9, 2015 at 15:4

My point (to the OP, not to you n.m) is that the notion of vowels (in Unicode) is probably very complex. – Upanchor 21/9, 2015 at 17:42

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags