Unclosed character class near index nnn
Asked Answered
L

2

5

I'm borrowing a rather complex regex from some PHP Textile implementations (open source, properly attributed) for a simple, not quite feature complete Java implementation, textile4j, that I'm porting to github and syncing to Maven central (the original code was written to provide a plugin for blojsom, a Java blogging platform; this is part of a larger effort to make blojsom dependencies available in Maven Central).

Unfortunately, the textile regex expressions (while they work in context of preg_replace_callback in PHP) fail in Java with the following exception:

java.util.regex.PatternSyntaxException: Unclosed character class near index 217

The statement is obvious, the solution is elusive.

Here's the raw, multiline regex from the PHP implementation:

return preg_replace_callback('/
    (^|(?<=[\s>.\(])|[{[]) # $pre
    "                      # start
    (' . $this->c . ')     # $atts
    ([^"]+?)               # $text
    (?:\(([^)]+?)\)(?="))? # $title
    ":
    ('.$this->urlch.'+?)   # $url
    (\/)?                  # $slash
    ([^\w\/;]*?)           # $post
    ([\]}]|(?=\s|$|\)))
    /x',callback,input);

Cleverly, I got the textile class to "show me the code" being used in this regex with a simple echo that resulted in the following, rather long, regular expression:

(^|(?<=[\s>.\(])|[{[])"((?:(?:\([^)]+\))|(?:\{[^}]+\})|(?:\[[^]]+\])|(?:\<(?!>)|(?<!<)\>|\<\>|\=|[()]+(?! )))*)([^"]+?)(?:\(([^)]+?)\)(?="))?":([\w"$\-_.+!*'(),";\/?:@=&%#{}|\^~\[\]`]+?)(\/)?([^\w\/;]*?)([\]}]|(?=\s|$|\)))

I've uncovered a couple of possible areas that could be resulting in parsing errors, using online tools such as RegExr by gskinner and RegexPlanet. However, none of those particulars fix the error.

I suspect that there is a range issue hidden in one of the character classes, or a Unicode order hiding somewhere, but I can't find it.

Any ideas?

I'm also curious why PHP doesn't throw a similar error, for example, I found one "passive subexpression" poorly handled using the RegExr, but it didn't fix the Java exception and didn't alter behavior in PHP, shown below.

In #title switch the escaped paren:

        (?:\(([^)]+?)\)(?="))? # $title
        ...^
        (?:(\([^)]+?)\)(?="))? # $title
        ....^

Thanks, Tim

edit: adding a Java String interpretation (with escapes) of the Textile regex, as determined by RegexPlanet ...

"(^|(?<=[\\s>.\\(])|[{[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:\\<(?!>)|(?<!<)\\>|\\<\\>|\\=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$\\-_.+!*'(),\";\\/?:@=&%#{}|\\^~\\[\\]`]+?)(\\/)?([^\\w\\/;]*?)([\\]}]|(?=\\s|$|\\)))"
Luanneluanni answered 14/11, 2011 at 18:32 Comment(1)
That #title line looks okay to me. It optionally matches something in parentheses (capturing everything except the parens themselves), but only if it's the last thing before the closing ".Pitanga
P
9

@CodeJockey is correct: there's a square bracket in one of your character classes that needs to be escaped. []] or [^]] are okay because the ] is the first character other than the negating ^, but in Java an unescaped [ anywhere in a character class is a syntax error.

Ironically, the original regex contains many backslashes that aren't needed even in PHP. It also escapes / because that's what it uses as the regex delimiter. After weeding all those out I came up with this Java regex:

"(^|(?<=[\\s>.(])|[{\\[])\"((?:(?:\\([^)]+\\))|(?:\\{[^}]+\\})|(?:\\[[^]]+\\])|(?:<(?!>)|(?<!<)>|<>|=|[()]+(?! )))*)([^\"]+?)(?:\\(([^)]+?)\\)(?=\"))?\":([\\w\"$_.+!*'(),\";/?:@=&%#{}|^~\\[\\]`-]+?)(/)?([^\\w/;]*?)([]}]|(?=\\s|$|\\)))"

Whether it's the best regex I have no idea, not knowing how it's being used.

Pitanga answered 15/11, 2011 at 9:0 Comment(5)
alan, many thanks for the investigation! Basically, textile is a "mediawiki-lite" text parser created many years ago by Dean Cameron Allen of FARVD fame. The syntax apparently is highly infectious, so much so, I believe but can't be sure, the TextPattern CMS is built around it. The RegEx in question, taken from PHP, is, as OP, used to parse the link syntax of Textile. Sometime c.2003, a Java port of Textile was undertaken. The Java port had a regex that could not handle all the textile features. See OP for my interest in getting it right. :)Luanneluanni
Ah, the joy of translating complicated regexes to a not-quite-compatible flavor--never a dull moment! ;) But you do understand that @FailedDev's answer is wrong, don't you? Wherever the original regex uses \< or \>, it's trying to match a literal angle bracket, not a word boundary. (I checked the docs just to be sure; they're part of Textile's text-alignment syntax.)Pitanga
alan, this indeed works and is much cleaner, fewer empty match groups (from 12 to 9). Textile is neat, but looking at the code, and thinking about TextPattern makes me recall this quote: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. --Jamie Zawinski Then there's Jeff Atwood on regex too.Luanneluanni
[rage]This? This is why I was getting an error? Because Java thinks I can by some fey necromancy embed a character class WITHIN ANOTHER CHARACTER CLASS? [/rage] Well, it's a good thing a well-spoken person thought to explain it on stackoverflow.Contradistinguish
Hehe, it has just happened to me as well - I wanted to check for backslashes in the input, but I forgot to double-escape them (first time for Java String, second time for regex, I had to write it as "\\\\")Assess
M
1

I'm not sure exactly where your problem lies, but this might help:

In Java (and I believe this is unique to Java), the [ symbol (not just the ] symbol) is reserved inside character classes and needs to be escaped.

The revised expression should probably be similar to the following, in order to be Java-compatible:

(^|(?<=[\s>.\(])|[{\[]) # $pre
"                       # start
(' . $this->c . ')      # $atts
([^"]+?)                # $text
(?:\(([^)]+?)\)(?="))?  # $title
":
('.$this->urlch.'+?)    # $url
(\/)?                   # $slash
([^\w\/;]*?)            # $post
([\]}]|(?=\s|$|\)))
/x

Basically, any place where most regex flavors will allow a character class like [a-z,;[\]+-] - which would match "either a letter a-z or a comma, semicolon, open or close square bracket, plus or minus sign", needs to actually be [a-z,;\[\]+-] (escape the [ with a \ character)

This escaping requirement is due to the Java union, intersection and subtraction character-class constructs.

Mortonmortuary answered 14/11, 2011 at 18:49 Comment(2)
Actually since it's java you would need to escape it with double '\'.Superintendent
@Superintendent - yes, any \ character, when put into a string, would need to be escaped. The above example is in PHP, so all the \ characters, including the ones in \s and \( would need to be doubled, if put into a Java string.Mortonmortuary

© 2022 - 2024 — McMap. All rights reserved.