How to match Unicode vowels?

Asked 5/8, 2016 at 15:24 Answered 5/3, 2021 at 18:43

What character class or Unicode property will match any Unicode vowel in Perl?

Wrong answer: [aeiouAEIOU]. (sermon here, item #24 in the laundry list)

perluniprops mentions vowels only for Hangul and Indic scripts.

Let's set aside the question what a vowel is. Yes, i may not be a vowel in some contexts. So, any character that can be a vowel will do.

Average answered 5/8, 2016 at 15:24 Comment(0)

There's no such property.

$ uniprops --all a
U+0061 <a> \N{LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
       ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
       IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
       POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
       X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
    Age=1.1 Age=V1_1 Block=Basic_Latin Bidi_Class=L Bidi_Class=Left_To_Right BC=L
       Bidi_Paired_Bracket_Type=None Block=ASCII BLK=ASCII Canonical_Combining_Class=0
       Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR
       Decomposition_Type=None DT=None East_Asian_Width=Na East_Asian_Width=Narrow EA=Na
       Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA
       Hangul_Syllable_Type=Not_Applicable HST=NA Indic_Positional_Category=NA InPC=NA
       Indic_Syllabic_Category=Other InSC=Other Joining_Group=No_Joining_Group JG=NoJoiningGroup
       Joining_Type=Non_Joining JT=U Joining_Type=U Script=Latin Line_Break=AL
       Line_Break=Alphabetic LB=AL Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN
       Present_In=1.1 IN=1.1 Present_In=2.0 IN=2.0 Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0
       Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1
       Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0
       Present_In=6.1 IN=6.1 Present_In=6.2 IN=6.2 Present_In=6.3 IN=6.3 Present_In=7.0 IN=7.0
       Present_In=8.0 IN=8.0 SC=Latn Script=Latn Script_Extensions=Latin Scx=Latn
       Script_Extensions=Latn Sentence_Break=LO Sentence_Break=Lower SB=LO Word_Break=ALetter WB=LE
       Word_Break=LE

The most important thing when dealing with i18n is to think about what you actually need, yet you didn't even mention what you are trying to accomplish.

Find vowels? That can't be what you are actually trying to do. I could see a use for identifying vowel sounds in a word, but those are often formed from multiple letters (such as "oo" in English, and "in", "an"/"en", "ou", "ai", "au"/"eau", "eu" in French), and it would be language-specific.

As it stands, you're asking for a global solution but you're defining the problem in local terms. You first need to start by defining the actual problem you are trying to solve.

Jeffreys answered 5/8, 2016 at 18:22 Comment(5)

For that matter: it's not clear what "vowel" would even mean for ideographic or syllabic characters. – Fanjet 5/8, 2016 at 18:36

@ikegami: Let's say the problem is how to substitute an apostrophe for horizontal space between any word consisting solely in a single small latin letter l, and any vowel whatsover, in order to achieve French elision not only among French words, but also something like this: j'aime l'ἐπιστήμη. Like this: "j aime l ἐπιστήμη" =~ s/ \b l \b \K \h+ (?= \p{Vowel} ) /'/gx; – Average 5/8, 2016 at 20:15

@duskwuff: as ideographic or syllabic characters don't involve vowels, the problem simply doesn't arise for them. – Average 5/8, 2016 at 20:16

You'd actually have to determine if ἐπιστήμη starts with a vowel sound in the language to which the word begins. In French, those all start with vowels, and everything that starts with a vowel counts as a vowel (incl "oi", which isn't actually a vowel sound), but I don't know if that's the case for all other languages. In fact, I've just found an exception: The English word "You" does not start with a vowel sound. If there was something named "You", it would be "le You", not "l'You". This doesn't look like a problem that can be solved using regex. – Jeffreys 5/8, 2016 at 20:23

Also, consider l'homme, l'hôpital, l'Hospital, etc. – Eleanore 8/11, 2020 at 14:2

Setting aside the definition of a vowel and the obvious problem that different languages share symbols but use them differently, there's a way that you can define your own property for use in a Perl pattern.

Define a subroutine that starts with In or Is and specify the characters that can be in it. The simplest is one code number be line, or a range of code numbers separated by horizontal whitespace:

#!perl
use v5.10;
use utf8;
use open qw(:std :utf8);

sub InSpecial {
    return <<"HERE";
00A7
00B6
2295\t229C
HERE
}


$_ = "ABC\x{00A7}";

say $_;
say /\p{InSpecial}/ ? 'Matched' : 'Missed';

Rendon answered 5/3, 2021 at 18:43 Comment(0)

First of all, not all written languages have "vowels". For one example, 中文 (Zhōngwén) (written Chinese) does not, as it is ideogrammatic instead of phonetic. For another example, Japanese mostly doesn't; it uses mostly consonant+vowel hiragana or katakana syllabics such as "ga", "wa", "tsu" instead.

And some written languages (for example, Hindi, Bangla, Greek, Russian) do have vowels, but use characters which are not easily mapable to aeiou. For such languages you'd have to find (search metacpan?) or make look-up tables specifying which letters are "vowels".

But if you're dealing with any written language based even loosely on the Latin alphabet (abcdefghijklmnopqrstuvwxyz), even if the language uses tons of diacritics (called "combining marks" in Perl and Unicode circles) (eg, Vietnamese), you can easily map those to "vowel" or "not-vowel", yes. The way is to "normalize-to-fully-decomposed-form", then strip-out all the combining marks, then fold-case, then compare each letter to regex /[aeiou]/. The following Perl script will find most-or-all "vowels" in any language using a Latin-based alphabet:

#!/usr/bin/perl -CSDA
# vowel-count.pl
use v5.20;
use Unicode::Normalize 'NFD';
my $vcount;
while (<>)
{
   $_ =~ s/[\r\n]+$//;
   say "\nRaw string: $_";
   my $decomposed = NFD $_;
   my $stripped = ($decomposed =~ s/\pM//gr);
   say "Stripped string: $stripped";
   my $folded = fc $stripped;
   my @base_letters = split //, $stripped;
   $vcount = 0;
   /[aeiou]/ and ++$vcount for @base_letters;
   say "# of vowels: $vcount";
}

Berlinda answered 22/2, 2021 at 0:13 Comment(2)

w and y are vowels in Welsh (Cymraeg). I18n is hard :-D – Eleanore 13/8, 2022 at 13:17

Hence my sentence "For such languages you'd have to find... or make look-up tables specifying which letters are 'vowels'." Really, to answer the question of whether a written symbol or spoken sound is a "vowel" or not, one needs to know what language is being used and what its phonetics and orthography are like. And have fun with the "edge cases" such as African bush languages that use clicking sounds not easily categorized as "consonant" OR "vowel". Human language is diverse indeed. – Berlinda 15/8, 2022 at 1:30

Recommended topics

Hot tags