regex embedded {{ matching
Asked Answered
D

2

7

I need to match the entire following statement:

{{CalendarCustom|year={{{year|{{#time:Y}}}}}|month=08|float=right}}

Basically whenever there is a { there needs to be a corresponding } with however many embedded { } are inside the original tag. So for example {{match}} or {{ma{{tch}}}} or {{m{{a{{t}}c}}h}}.

I have this right now:

(\{\{.+?(:?\}\}[^\{]+?\}\}))

This does not quite work.

Dudek answered 14/5, 2011 at 14:57 Comment(5)
What exactly are you trying to get out of the string?Atalya
I just want to match the entire statement so I can remove it. Like there is other text surrounding that and I want to match anything inside {} brackets and remove it.Dudek
In general regexps are not the right tool to match brackets, see. e.g. here.Talesman
@Howard: "Regular expressions" have come a long way away from being regular. Modern regex flavors offer many new things, and a problem like this is perfectly suited for a recursive regex.Wit
Can you just use JSON? This kind of sounds like you're outputting this string yourself, and then trying to parse it later. If you do in fact own both ends (and are just serializing and deserializing), you'll save yourself a lot of work if you just go with an existing solution ;)Busty
W
16

The .NET regex engine allows recursive matching:

result = Regex.Match(subject,
    @"\{                   # opening {
        (?>                # now match...
           [^{}]+          # any characters except braces
        |                  # or
           \{  (?<DEPTH>)  # a {, increasing the depth counter
        |                  # or
           \}  (?<-DEPTH>) # a }, decreasing the depth counter
        )*                 # any number of times
        (?(DEPTH)(?!))     # until the depth counter is zero again
      \}                   # then match the closing }",
    RegexOptions.IgnorePatternWhitespace).Value;
Wit answered 14/5, 2011 at 15:11 Comment(3)
thanks for pointing this out. Learnt something today... Do you have a link that documents <DEPTH>?Atalya
@Oded: DEPTH is an arbitrary name - it's just an empty named capturing group (?<id>) which in .NET counts the number of matches; (?<-id>) is the same, just decreasing the counter. And (?(ID)(?!)) only matches if the id counter is zero. This is documented on page 436 of Friedl's "Mastering Regular Expressions".Wit
I tried using a basic regex solution that I found but it was crazy slow. Like 2+ minutes to run. This one is like instantaneous.Dudek
A
4

I suggest writing a simple parser/tokenizer for this.

Basically, you loop over all the characters and start counting instances of { and } - incrementing for { and decrementing for }. Record the index of each first { and the index of each last } and you will have the indexes for your embedded expressions.

At this point you can use substring to get these and remove/replace them from the original string.

See this question and answers for why RegEx is not suitable.

Atalya answered 14/5, 2011 at 15:8 Comment(5)
I second this. I've seen a company I used to work for go down the road of parsing via regex, and it only seems like it's going to be easier. It's a big learning curve, but it'll be worth it in the long run. Check out ANTLR for a starting point....Busty
Here's a very simple example of using ANTLR to parse and evaluate expressions. Notice how simple it is to just define what the valid 'tokens' are and then sprinkle in inline Java source code (it works with c# as well), and then ANTLR does the rest. antlr.org/wiki/display/ANTLR3/Expression+evaluatorBusty
I'm making something that runs on an xbox, so no unmanaged code allowed.Dudek
@Paul - you can write this in c#.Atalya
@Paul - What? Looping through each char in a string? I described a simple algorithm. Where do you think unmanaged code comes into this? I do not mean ANTLR.Atalya

© 2022 - 2024 — McMap. All rights reserved.