Why does \w match only English words in javascript regex?
Asked Answered
F

10

11

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).

So what can I use instead of \w to match all letters in all languages?

Felly answered 29/12, 2008 at 14:17 Comment(0)
D
17

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

Dygall answered 29/12, 2008 at 14:22 Comment(2)
Thanks, for the inner parts of the url I ended up matching everything except space, '.' and '/'. Anything else I might be missing?Felly
Perhaps colon, ':', which could be used to separate a URL from a port numberDygall
O
7

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.

JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:

[\u0590-\u05FF]

This simply matches any code point in the Hebrew block.

You can match any ASCII word character or any Hebrew character with:

[\w\u0590-\u05FF]
Overstride answered 30/12, 2008 at 13:33 Comment(0)
S
5

I think you are looking for this regex:

^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$
Shantay answered 16/9, 2010 at 6:33 Comment(1)
Welcome to Stack Overflow. I never tried, but א-ת may work as well, even including the final letters - en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet .Madeup
K
4

I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.

According to the website, to match unicode chars, you'd use such code:

var unicodeWord = XRegExp("^\\p{L}+$");

unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true
Kaleb answered 16/9, 2011 at 9:26 Comment(1)
I've just integrated this tool in our project and it works well.Kaleb
T
2

Try this \p{L} the unicode regex to Letters

Toniatonic answered 26/4, 2013 at 16:2 Comment(0)
E
1

Perhaps \S (non-whitespace).

Explain answered 29/12, 2008 at 14:21 Comment(0)
B
1

Have a look at http://www.regular-expressions.info/refunicode.html.

It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

Bertsche answered 29/12, 2008 at 14:22 Comment(1)
This page has a more thorough explanation and listing of character patterns: regular-expressions.info/unicode.htmlDarton
T
1

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.

Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

Tollbooth answered 29/12, 2008 at 14:22 Comment(1)
Too bad. \p would have been just what the doctor ordered.Tollbooth
B
1

If you're the one generating URLs with non-english letters in it, you may want to reconsider.

If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

Brierwood answered 29/12, 2008 at 15:36 Comment(2)
Sadly I can't control the url-creation, and they almost always will contain Hebrew Characters.Felly
That's not true - Russian symbols are permitted too, and also other symbols from other alphabetsTashia
S
-1

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:

In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax

Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.

If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

Sibel answered 30/12, 2008 at 14:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.