Regex/token/rule to match nested curly braces?

Asked 17/7, 2020 at 3:9 Answered 19/7, 2020 at 14:15

I need to match the values of key = value pairs in BibTeX files, which can contain arbitrarily nested braces, delimited by braces. I've got as far as matching at most two deep nested curly braces, like {some {stuff} like {this}} with the kludgey:

    token brace-value {
    '{' <-[{}]>* ['{' <-[}]>* '}' <-[{}]>* ]* '}'
    }

I shudder at the idea of going one level further down... but proper parsing of my BibTeX stuff needs at least three levels deep.

Yes, I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile. My *.bib files are rather tame (and I wouldn't mind to handle a few stray entries by hand), the problem is that I have a lot of them, with much overlap. But some of the "same" entries have different keys, or extra data. I want to consolidate them into a few master files (the whole idea behind BibTeX, right?). Not fun by hand if bibtool gives a file with no duplicates (ha!) of some 20 thousand lines...

Brandiebrandise answered 17/7, 2020 at 3:9 Comment(7)

See docs.raku.org/language/regexes#Tilde_for_nesting_structures. That looks like what you need here – Stertorous 17/7, 2020 at 3:40

Can you please provide a string to check? – Km 17/7, 2020 at 6:53

@HåkonHægland, that handles one level of braces. I need at least three, – Brandiebrandise 17/7, 2020 at 22:1

@jjmerelo, {This {has {nested {braces} {inside {several {times} even} nested} again}} end} – Brandiebrandise 17/7, 2020 at 22:23

@michaeldel, '{' ~ '}' <regex> is (more or less) another way to write '{' <regex> '}', no further nesting. – Brandiebrandise 17/7, 2020 at 22:25

@Brandiebrandise Fwiw I'm working on a new version of my answer. I think it'll be illuminating but it's taking its sweet time to properly come together. Hopefully I'll get it in the bag tomorrow. – Exaggerate 18/7, 2020 at 23:26

OK, just (re) checked. The documentation of '{' ~ '}' leaves a whole lot to desire, it is not at all clear it is meant to handle balanced, correctly nested delimiters.

So my final solution is really just along the lines:

my regex nested-braces {
   :ratchet
   '{' ~ '}' .*
}

Thanks everyone! Lerned quite a bit today.

Brandiebrandise answered 19/7, 2020 at 14:15 Comment(3)

my regex nested-braces { :ratchet '{' ~ '}' .* } will match a single pair, without regard for nesting/balance. For example, matched against '{}}}}}nnn}' it returns '｢{}}}}}nnn}｣'. – Exaggerate 19/7, 2020 at 22:46

"it is not at all clear ['{' ~ '}'] is meant to handle balanced, correctly nested delimiters". It isn't meant to. It's for matching a single pair. The doc section title ("Tilde for nesting structures") is unfortunate. The doc section body is spot on -- but it's verbiage Larry wrote for compiler implementors, not users. The "nested" in "nested subrules" is about nested subrules matching what goes between a single occurrence of a pair of delimiters. – Exaggerate 19/7, 2020 at 22:50

The reason ~ exists in the regex syntax is to put the delimiters syntactically near each other, instead of on opposite sides of the middle. That is the only thing it does. '{' ~ '}' .* works exactly the same as '{' .* '}'. – Lavena 21/7, 2020 at 0:36

After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.

Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:

S -> { S } S | <nothing>

I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):

my regex nb {
   [ '{' <nb> '}' <nb> ]?
}

Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.

my regex nested-braces {
    :ratchet 
     <-[{}]>*
     [ '{' <.nested-braces> '}' <.nested-braces> ]?
     <-[{}]>*
};

This checks out with my test cases.

For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.

Brandiebrandise answered 17/7, 2020 at 13:59 Comment(3)

To make it avoid backtracking, you generally just need to call it a token instead of a regex (or use the modifier :r [for :ratcheting] somewhere in the regex) – Roorback 17/7, 2020 at 14:4

@user0721090601, yes, tthat was what I had in mind. Just was laying out proof-of-concept. Final version in now. – Brandiebrandise 17/7, 2020 at 14:34

"context free grammar" The What are Raku Rules / grammars? section of my answer to SO What's the real difference between a token and a rule? may be of interest. – Exaggerate 18/7, 2020 at 0:27

Solution

A way to describe nested braces (and nothing else)

Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:

\{ <&R>* \}

If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:

'{' ~ '}' <R>*

This latter avoids leaning toothpick syndrome and uses the regex ~ operator.

These will both parse arbitrarily deeply nested paired braces, eg:

say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # ｢{{{{}}}}｣

(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)

`regex` vs `token`

kill backtracking

my regex nested-braces {
    :ratchet

The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:

my token nested-braces {

Backtracking

the "regex" machinery (based on backtracking)

The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.

But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.

Recursion

a regex can call another, and nowhere do I see a prohibition on recursive calls.

This alone is nothing special for contemporary regex engines.

PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.

Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.

Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.

Capabilities

a lot more capable

Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.

But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.

than officially admitted

Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.

There are three major caveats:

Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.
Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.
Left recursion Raku does not automatically rewrite left recursion (infinite loops).

Using existing parsers

I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.

Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.

Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.

I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.

A Raku BibTeX Grammar?

Perhaps your needs do call for creating and using a small Raku Grammar.

(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)

As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.

So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:

grammar BiBTeX {
  token TOP { ... }
  token ...
  token ...
}
BiBTeX.parse: my-bib-file

For more details, see the official doc's Grammar tutorial or read Moritz's book.

Exaggerate answered 18/7, 2020 at 1:6 Comment(10)

"Though you can't use <&?ROUTINE> in a /…/" … you can't use <&ROUTINE> but there is a syntax, but I'll be damned if I remember off hand. Someone posted it on something here though. i think it was <~~> – Roorback 18/7, 2020 at 5:48

Yup. Just double checked. /'{' ~ '}' <~~>*/ will match any random sequence of balanced {{{{{ }}}}} – Roorback 18/7, 2020 at 5:53

hey @Roorback - that’s cool - how does it look with capturing the contents and maybe actions?* – Leibniz 18/7, 2020 at 20:26

@p6steve Only <...> assertions that start with an alpha, capture. So <~~> won't separately capture recursive calls. (I just tested to confirm.) So for independently capturing recursive calls you'd need to use a rule in a method and explicitly name the rule in a recursive call. I've also tested actions, and they are called once for each time a particular call successfully completes its matching, as one would expect. – Exaggerate 18/7, 2020 at 23:11

@Exaggerate that's true, but it's fairly easy to simply add in $<inner>=<~~> which turns into an easily accessibly named capture. – Roorback 19/7, 2020 at 0:52

Sorry, but the Raku grammar documentation (and much else) is intelligible only if you know exactly what they are talking about. No decent examples, nothing. – Brandiebrandise 19/7, 2020 at 3:2

@user0721090601, all I get is "unrecognized regex character ~" – Brandiebrandise 19/7, 2020 at 3:7

thanks @Roorback and raiph ... that’s the incantation i was hoping for. – Leibniz 19/7, 2020 at 9:23

@Brandiebrandise you make a good point about the docs BUT i really like the current terse and searchable doc site as a concise tool for easy reference - however, i do think that the explication of Grammars for new users could use some worked examples - fwiw there is a lot of material in the raku advent posts... – Leibniz 19/7, 2020 at 9:23

@Brandiebrandise I didn't mean to suggest it was documented. I looked for it and didn't find it. It works for me. "unrecognized regex character ~" suggests you didn't write the four characters <~~>. – Exaggerate 19/7, 2020 at 9:25