Java regex alternation operator "|" behavior seems broken
Asked Answered
G

2

12

Trying to write a regex matcher for roman numerals. In sed (which I think is considered 'standard' for regex?), if you have multiple options delimited by the alternation operator, it will match the longest. Namely, "I|II|III|IV" will match "IV" for "IV" and "III" for "III"

In Java, the same pattern matches "I" for "IV" and "I" for "III". Turns out Java chooses between alternation matches left-to-right; that is, because "I" appears before "III" in the regex, it matches. If I change the regex to "IV|III|II|I", the behavior is corrected, but this obviously isn't a solution in general.

Is there a way to make Java choose the longest match out of an alternation group, instead of choosing the 'first'?

A code sample for clarity:

public static void main(String[] args)
{
    Pattern p = Pattern.compile("six|sixty");
    Matcher m = p.matcher("The year was nineteen sixty five.");
    if (m.find())
    {
        System.out.println(m.group());
    }
    else
    {
        System.out.println("wtf?");
    }
}

This outputs "six"

Goulash answered 23/12, 2010 at 2:5 Comment(1)
Where there might be a (not-so-good) way. Separate the above pattern into separate patterns and perform the adjudication yourself.Spadework
F
20

No, it's behaving correctly. Java uses an NFA, or regex-directed flavor, like Perl, .NET, JavaScript, etc., and unlike sed, grep, or awk. An alternation is expected to quit as soon as one of the alternatives matches, not hold out for the longest match.

You can force it to continue by adding a condition after the alternation that can't be met until the whole token has been consumed. What that condition might be depends on the context; the simplest option would be an anchor ($) or a word boundary (\b).

"\\b(I|II|III|IV)\\b"

EDIT: I should mention that, while grep, sed, awk and others traditionally use text-directed (or DFA) engines, you can also find versions of some of them that use NFA engines, or even hybrids of the two.

Filth answered 23/12, 2010 at 2:35 Comment(7)
+1 for pointing out that this is an NFA vs DFA thing. What does "regex-directed flavor" mean, though? Another option for the OP would be to arrange the alternatives such that none are preceded by any of their prefixes. For literal patterns the easiest way to to this is just to sort by descending size.Inveigh
Thanks much. I'm disappointed I can't force Java to default to longest-match, but quite satisfied with the explanation of why, and the workarounds you both provided.Goulash
@Laurence: There's a brief description of regex-directed vs text-directed engines here: regular-expressions.info/engine.html ...and Friedl covers it very thoroughly in The Book: oreilly.com/catalog/regex3/index.htmlFilth
@Laurence: And yes, sorting alternatives by descending length is a good practice. I would also recommend not using alternation at all if there's another reasonable option. For example, using six(?:ty)? instead of six|sixty or sixty|six.Filth
@Alan: Thanks. I know the NFA versus DFA distinction, I'd just never heard these "regex-directed" and "text-directed" terms used for them before. Do you know where these terms originate?Inveigh
@Alan: Agreed that using other options is generally better than | with an NFA-based engine, especially if it's an engine that's prone to backtracking. For a quick-and-dirty "I've got a bunch of literal strings and I want a regex that finds the longest match" situation, though, the sorting by descending length approach is an easy way to get something that works, albeit sub-optimally. If you want to get really fancy, construct a trie out of your strings, and turn that into a regex.Inveigh
@Laurence: I think we're on the same page. As for "text-directed" and "regex-directed", the first place I remember seeing them is in Friedl's book.Filth
J
4

I think a pattern that will work is something like

IV|I{1,3}

See the "greedy quantifiers" section at http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html

Edit: in response to your comment, I think the general problem is that you keep using alternation when it is not the right thing to use. In your new example, you are trying to match "six" or "sixty"; the right pattern to use is six(ty)?, not six|sixty. In general, if you ever have two members of an alternation group such that one is a prefix of another, you should rewrite the regular expression to eliminate it. Otherwise, you can't really complain that the engine is doing the wrong thing, since the semantics of alternation don't say anything about a longest match.

Edit 2: the literal answer to your question is no, it can't be forced (and my commentary is that you shouldn't ever need this behavior).

Edit 3: thinking more about the subject, it occurred to me that an alternation pattern where one string is the prefix of another is undesirable for another reason; namely, it will be slower unless the underlying automaton is constructed to take prefixes into account (and given that Java picks the first match in the pattern, I would guess that this is not the case).

Jahn answered 23/12, 2010 at 2:8 Comment(9)
So that fixes this particular case, but not the behavior in general. See the code sample I just added.Goulash
This is because you keep using alternation when you don't really want alternation. I will update my answer to explain in more detail.Jahn
I understand the approach you're taking with your response, but these are just a couple of 'fake' examples synthesized to show the problem. If my actual problem was matching "six|sixty", you're right, that'd be the way to do it. But the real problem is that Java's behavior is inconsistent with the standard behavior for the last 30 years or so. I'm interested in seeing if I can fix that, not in making a smarter regex for the sample code.Goulash
The "standard behavior" you're talking about is a particular standard (POSIX) - to my knowledge this is the only standard that requires that alternation return the longest match. Given that there is nothing particular about alternation that suggests that the longest match is the correct one, there isn't any reason to expect that an arbitrary matching algorithm should choose the longest match.Jahn
I don't claim that Java made a claim to be compliant to a standard, I'm just shocked that "longest match" isn't what it does in alternation, because it should be the default in any lexing. It's actually orthogonal to the semantics of alternation. So my question is still: is there a way to force it?Goulash
Updated my answer with the actual answer to your question. I still don't understand where your shock comes from, or why you claim that longest match "should" be the default. It is not at all orthogonal to the semantics of alternation - alternation literally means "match any one of these things"; if you get that then the matcher is not broken.Jahn
Given the regexp "six|sixty", if the rule is not "longest match", it's hard to see what the rule would be. If the rule were "shorter match" or "first listed", "six|sixty" would be identical to "six", which is just foolish. What's left?Gluttonize
@Malvolio: it may be foolish to you, but not to an automaton.Jahn
@Jahn -- I hope it was obvious I was criticizing the creator not the creation.Gluttonize

© 2022 - 2024 — McMap. All rights reserved.