What is the regular expression for a Spanish word?
Asked Answered
D

3

7

Regular expression languages use \B to include A..Z, a..z, 0..9, and _, and \b is defined as a word boundary.

How can I write a regular expression that matches all valid Spanish words, including characters such as: á, í, ó, é, ñ, etc.?

I'm using .NET.

Demount answered 22/5, 2009 at 4:40 Comment(0)
K
6

Use a Spanish locale and make your regex locale-sensitive.

Kolnick answered 22/5, 2009 at 4:45 Comment(0)
H
2

This depends heavily on the language (and regex engine) you're using.

In Perl, \w matches all word characters, regardless of language or alphabet, and something like /\b(\w+)\b/ would (probably) match Spanish words as well as English words or Russian words.

In languages using PCRE, \w (and therefore probably \b) do NOT match Unicode characters. You will probably need to build your own set. I suggest something like [\wáéíóúñ] (matches all word characters, plus the accented characters you want), and the PCRE library has to be pre-built with Unicode support before this will even work.

If you're using something else, good luck. Some regex engines don't even support Unicode.

Hobgoblin answered 22/5, 2009 at 4:51 Comment(0)
I
1

Your regex system should have something equivalent to Python's re.L (aka re.LOCALE) to make a regex locale-dependent, so that what's a word-character and what isn't changes with locale, as do "word boundaries" etc. Are you instead asking for a way to compensate for some given regex system not supporting locale, trying to force the issue anyway...?

Intent answered 22/5, 2009 at 4:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.