Why is /[\w-+]/ a valid regex but /[\w-+]/u invalid?
Asked Answered
S

2

3

If I type /[\w-+]/ in the Chrome console, it accepts it. I get a regex object I can use to test strings as usual. But if I type /[\w-+]/u, it says VM112:1 Uncaught SyntaxError: Invalid regular expression: /[\w-+]/: Invalid character class.

In Firefox, /[\w-+]/ works fine, but if I type /[\w-+]/u in the console, it just goes to the next line as if I typed an incomplete statement. If I try to force it to create the regex by running eval('/[\w-+]/u'), it tells me SyntaxError: invalid range in character class.

Why does the u flag make the regex invalid? The MDN RegExp documentation says u enables some Unicode features, but I don't see anything about how it affects ranges in character classes.

Spurge answered 15/1, 2019 at 18:55 Comment(16)
What is it supposed to mean to start with ? [\w-+] doesn't make any kind of sense. It looks like one of the engines is too lenient.Carmina
as per Deny's comment, the actually unexpected thing is that [\w-+] gets accepted at all. The range "from any word character to the plus symbol" makes absolutely no sense, so if you want to match the minus symbol, escape it: [\w\-+], and that'l work in simple as well as unicode matching.Pooka
I don't have any immediate practical purpose by asking this question besides satisfying my curiosity. I can't go into too much detail, but we have a (somewhat badly designed IMO) internal library where I work, and I'm trying to figure out how it works and why I have problems when I try to use one of the regexes as the pattern attribute of an HTML input element. I can't figure out what the person who wrote the code was trying to do, and I am not implying that using character classes in ranges is a smart idea.Spurge
@Mike'Pomax'Kamermans I think many regexp engines have heuristics to determine whether - is a literal character or range delimiter inside []. Looks like JS doesn't treat it as a range delimiter when it's after an escape sequence, since it wouldn't make sense.Unpractical
@Barmar, I thought - was only treated as a literal character if it is the last character in the range. /[\w-+]/.test('-') returns true so you might be right, but that is not definite proof.Spurge
Usually, - is treated as a literal if it appears in a position where it cannot be interpreted as indicating a range. I'm searching in the normative document for confirmation. That doesn't explain why it doesn't work with u, though.Drambuie
u modifier makes the regex engine parse the regex expression in a more strict way. All chars that do not have to be escaped must not be escaped and those that should must be escaped. All ambiguity must be avoided.Misinterpret
@EliasZamaria MDN actually says it's treated literally if it's first or last. I tried finding this in the W3C spec, but it's really confusing. Since this differs between Chrome and FF, it seems like an implementation-specific extension to be more lenient. To be conforming, the programmer should only use - in the allowed places, or escape it.Unpractical
@Barmar, what do you mean "this differs between Chrome and FF"? Both browsers accept the regex without the u flag and reject it with the u flag. The only real differences are the details of how they handle it and the error messages they produce.Spurge
@EliasZamaria FF does not support the latest regex features outlined in ECMAScript 2018 standard while Chrome does. Like names groups, s modifier, infinite length lookbehind. As for u, in Chrome, you may use Unicode property classes like \p{L}.Misinterpret
You're right, misread the question's description of the difference.Unpractical
Okay, so ECMA-262, page 570, note 3, says that "a - character can be treated literally or it can denote a range. It is treated literally if it is the first or last character of ClassRanges, the beginning or end limit of a range specification, or immediately follows a range specification".Drambuie
And: ClassRanges can expand into a single ClassAtom and/or ranges of two ClassAtom separated by dashes. In the latter case the ClassRanges includes all characters between the first ClassAtom and the second ClassAtom, inclusive; an error occurs if either ClassAtom does not represent a single character (for example, if one is \w) or if the first ClassAtom's character value is greater than the second ClassAtom's character value. (link)Misinterpret
This ECMA stuff seems useful, but is dense and complex and a bit over my head. What is a range specification? I searched that huge document for "range specification" and only got 2 matches, neither of which really explained what it means.Spurge
@WiktorStribiżew, your quote seems to explain why the regex causes an error. But I don't see anything about why the error only happens with a u flag.Spurge
I hope Mathias Bynens will drop in to share his thoughts.Misinterpret
M
8

Within a RegExp character set, a hyphen-minus character (your standard keyboard dash) denotes a range of character codes between the two characters it separates. The exceptions are when it is escaped (\-) or when it does not separate two characters because it is either the final character of the class or it is the first character (after the optional caret that inverts the class).

Three examples of character ranges: a simple example, an advanced example, and a bug:

  • [a-z] is pretty straightforward because it works the way we expect it to, though this is actually because the character codes happen to be sequential. Another way of writing this is [\x61-\x7a]
  • [!-~] is not at all straightforward, at least until you look at a character map and learn that ! is the first printable ASCII character and ~ is the last (of "lower ASCII"), so this is a way of saying "all printable lower ASCII characters" and it is the equivalent of [\x21-\x7e]
  • [A-z] has a switched case in it. You may dislike the fact that there are six non-letter characters accepted by this range (which is [\x41-\x7a])

ASCII Table


Now let's examine your regex of /[\w-+]/u. Regex101 has a more informative error:

You can not create a range with shorthand escape sequences

Since \w is not itself a character (but rather a collection of characters), an abutting dash must either be taken literally or else an error. When you invoke it with the /u flag to trigger fullUnicode, you enter a more strict mode and therefore get an error.

The error I get from "foo".match(/[\w-+]/u) in Firefox 64.0 is:

SyntaxError: character class escape cannot be used in class range in regular expression

This is slightly more informative than the error you got since it actually tells you the problem is with the escape (though not why it's a problem).

According to ECMAScript 2015's RegExBuiltinExec() logic:

  1. If fullUnicode is true, then
  2. e is an index into the Input character list, derived from S, matched by matcher. Let eUTF be the smallest index into S that corresponds to the character at element e of Input. If e is greater than or equal to the length of Input, then eUTF is the number of code units in S.
  3. Let e be eUTF.

This seems to be explicitly building its own range-parsing logic.


The solution is to either escape your hyphen-minus or else put it last (or first):

/[\w\-+]/u or /[\w+-]/u or /[-\w+]/u. I personally always put it last.

Managua answered 15/1, 2019 at 20:37 Comment(2)
The first solution, /[\w\-+]/u, triggers an error in regex101 with ECMAScript selected: This token has no special meaning and has thus been rendered erroneous (Chrome 102.0)Branca
@Branca – Interesting, regex101 does that for me on Firefox 100.0.2 as well. However, that's a bug in regex101, not your browser; PCRE specifies that an escaped non-word-character is always a literal for that character and ECMAScript implements PCRE. Pop open your developer console (F12 on Windows/Linux) and run "foo-bar baz".match(/[\w\-+]/ug) and you'll see it matches everything but the space. No errors. (Chrome and FF use the same regex interpreter, so I'm confident you'll see this replicate.)Managua
I
7

There is a report for this: V8 implementation: does unicode property escapes behavior in character classes range differ from other classes intentionally?.


I took a look at V8 source code (regexp-parser.cc) and found this:

if (is_class_1 || is_class_2) {
    // Either end is an escaped character class. Treat the '-' verbatim.
    if (unicode()) {
       // ES2015 21.2.2.15.1 step 1.
       return ReportError(CStrVector(kRangeInvalid));
    }

kRangeInvalid is a constant that holds Invalid character class.

21.2.2.15.1 step 1.

If A does not contain exactly one character or B does not contain exactly one character, throw a SyntaxError exception.

Isola answered 15/1, 2019 at 21:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.