How do I write regexes for German character classes like letters, vowels, and consonants?

Asked 19/4, 2013 at 9:39 Answered 29/3, 2014 at 13:19

ruby regex customization character-class metacharacters

For example, I set up these:

L = /[a-z,A-Z,ßäüöÄÖÜ]/
V = /[äöüÄÖÜaeiouAEIOU]/
K = /[ßb-zBZ&&[^#{V}]]/

So that /(#{K}#{V}{2})/ matches "ßäÜ" in "azAZßäÜ".

Are there any better ways of dealing with them?

Could I put those constants in a module in a file somewhere in my Ruby installation folder, so I can include/require them inside any new script I write on my computer? (I'm a newbie and I know I'm muddling this terminology; Please correct me.)

Furthermore, could I get just the meta-characters \L, \V, and \K (or whatever isn't already set in Ruby) to stand for them in regexes, so I don't have to do that string interpolation thing all the time?

Marlowe answered 19/4, 2013 at 9:39 Comment(5)

your approach seems pretty sound. you can shorten K like this: /[ßb-zB-Z&&[^aeiouAEIOU]]/ if you like. – Wast 19/4, 2013 at 9:49

Oh thanks, good to know I can use that syntax! ^^ – Marlowe 19/4, 2013 at 10:20

Your "module in installation folder" is a gem. See guides.rubygems.org for more details. – Coburn 19/4, 2013 at 12:24

Oh, thanks, yes, I ended up just putting the constants in another file in the same folder and putting require '/.constants.rb' in any script in that folder I need to use them in. Works for now. – Marlowe 19/4, 2013 at 16:19

Be sure to look at the POSIX and Unicode script extensions to the standard Regexp character classes. They're already tested and battle-hardened. – Chang 18/11, 2013 at 14:0

You're starting pretty well, but you need to look through the Regexp class code that is installed by Ruby. There are tricks for writing patterns that build themselves using String interpolation. You write the bricks and let Ruby build the walls and house with normal String tricks, then turn the resulting strings into true Regexp instances for use in your code.

For instance:

LOWER_CASE_CHARS = 'a-z'
UPPER_CASE_CHARS = 'A-Z'
CHARS = LOWER_CASE_CHARS + UPPER_CASE_CHARS
DIGITS = '0-9'

CHARS_REGEX = /[#{ CHARS }]/
DIGITS_REGEX = /[#{ DIGITS }]/

WORDS = "#{ CHARS }#{ DIGITS }_"
WORDS_REGEX = /[#{ WORDS }]/

You keep building from small atomic characters and character classes and soon you'll have big regular expressions. Try pasting those one by one into IRB and you'll quickly get the hang of it.

Chang answered 18/11, 2013 at 15:10 Comment(0)

A small improvement on what you do now would be to use regex unicode support for categories or scripts.

If you mean L to be any letter, use \p{L}. Or use \p{Latin} if you want it to mean any letter in a Latin script (all German letters are).

I don't think there are built-ins for vowels and consonants.

See \p{L} match your example.

Lip answered 29/3, 2014 at 13:19 Comment(0)

Recommended topics

Hot tags