How can I make a regular expression which takes accented characters into account?
Asked Answered
C

2

6

I have a JavaScript regular expression which basically finds two-letter words. The problem seems to be that it interprets accented characters as word boundaries. Indeed, it seems that

A word boundary ("\b") is a spot between two characters that has a "\w" on one side of it and a "\W" on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a "\W". AS3 RegExp to match words with boundry type characters in them

And since

\w matches any alphanumerical character (word characters) including underscore (short for [a-zA-Z0-9_]). \W matches any non-word characters (short for [^a-zA-Z0-9_]) http://www.javascriptkit.com/javatutors/redev2.shtml

obviously accented characters are not taken into account. This becomes a problem with words like Montréal. If the é is considered a word boundary, then al is a two-letter word. I have tried making my own definition of a word boundary which would allow for accented characters, but seeing as a word boundary isn't even a characters, I don't exactly know how to go about finding it..

Any help?

Here is the relevant JavaScript code, which searches userInput and finds two-letter words using the re_state regular expression:

var re_state = new RegExp("\\b([a-z]{2})[,]?\\b", "mi");
var match_state = re_state.exec(userInput);
document.getElementById("state").value = (match_state)?match_state[1]:"";
Calvary answered 12/9, 2010 at 4:28 Comment(0)
G
5

While JavaScript regexes recognize non-ASCII characters in some cases (like \s), it's hopelessly inadequate when it comes to \w and \b. If you want them to work with anything beyond the ASCII word characters, you'll have to either use a different language, or install Steve Levithan's XRegExp library with the Unicode plugin.

By the way, there's an error in your regex. You have a \b after the optional trailing comma, but it should be in front:

"\\b([a-z]{2})\\b,?"

I also removed the square brackets; you would only need those if the comma had a special meaning in regexes, which it doesn't. But I suspect you don't need to match the comma at all; \b should be sufficient to make sure you're at the end of the word. And if you don't need the comma, you don't need the capturing group either:

"\\b[a-z]{2}\\b"
Gonsalez answered 12/9, 2010 at 7:27 Comment(7)
@Alan Moore: What's the difference between using the literal and the constructor? The difference I found is that if I use the constructor, I can add the matches of previous regular expressions to my regexp... for example: var re_address = new RegExp(match_buildingNumber[0] + match_street[0] + match_city[0] + "?", "mi"); That kind of thing, which is, to my knoledge, impossible with a regexp literal...Calvary
Okay, if you've got a good reason for using the constructor, by all means use it. I just wanted to make sure you were aware of the regex-literal option.Gonsalez
@Alan Moore: ok thanks! But I'm still a bit curious.. What IS the difference between the two? Why should one prefer using the literal when possible? Also, I downloaded XRegExp and the unicode plugin, but I still don't see how to use it for what I want. I guess there would be a Lm (modified letter) somewhere in there?Calvary
It's just that, with the constructor you're writing the regex in the form of a string literal, which has its own set of escaping rules. For example, if you forgot to escape the backslashes in your regex, you'd be looking for a word surrounded by backspaces, not a word surrounded by word boundaries.Gonsalez
@Alan, I’ve here posted an answer that shows how to do this properly. You have to create a UCA collator object whose comparison strength is set to primary only. You can do this in Perl, Python, or Java, although of those only Perl comes with the necessary classes in its base distribution. I don’t think Javascript has any of the standards-compliant objects needed to do this, though.Beckman
Does ES6 or any other updates provide any fixes for this? I'm running into it as well. \w stops at accented alpha numeric characters. FYI I'm using JS in the browser so no alternative language choices.Walk
Update I found a post that defines a set that finds word characters with accents #20690999. It appears that [A-Za-zÀ-ÖØ-öø-ÿ0-9_] appears to more closely match \w. But comments indicate it matches Latin characters but not Cyrillic or others so caveats apply.Walk
R
-3

Have you set JavaScript to use non-ASCII? Here is a page that suggests setting JavaScript to use UTF-8: http://blogs.oracle.com/shankar/entry/how_to_handle_utf_8

It says:

add a charset attribute (charset="utf-8") to your script tags in the parent page:

script type="text/javascript" src="[path]/myscript.js"  charset="utf-8"
Receiptor answered 12/9, 2010 at 5:10 Comment(3)
Yeah, the type attribute isn't even in HTML5 as it isn't supported by browsers, it's a mistake people made when interpreting the spec. The charset meta tag works, but charset in links isn't a real thing.Corollaceous
@Rich Bradshaw: I do have <meta http-equiv="content-type" content="text/html; charset=utf-8" /> in my head section. Is that what you mean?Calvary
That's wrong too. The speech marks people added for XHTML should define two attributes: content and charset, but popular wisdom put them in the same speechmarks with a semicolon for some reason! Browsers do parse that and make it work though. Check the HTML5 version of this for the best/conforming way to do it. Charset on js and CSS has never worked though and is pointless to add.Corollaceous

© 2022 - 2024 — McMap. All rights reserved.