Are Ruby 1.9 regular expressions equally powerful to a context free grammar?
Asked Answered
E

1

14

I have this regular expression:

regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x

When I test it against several strings, it appears to be as powerful as a context free grammar because it handles the recursion properly.

regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">
regex.match("aaacaa")
# => nil

"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:

sentence = %r{ 
    (?<subject>   cat   | dog   | gerbil    ){0} 
    (?<verb>      eats  | drinks| generates ){0} 
    (?<object>    water | bones | PDFs      ){0} 
    (?<adjective> big   | small | smelly    ){0} 

    (?<opt_adj>   (\g<adjective>\s)?     ){0} 

    The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object> 
}x

Between his technique for rearranging the parts of the regex, and my example of recursive named capturing groups, does this mean Ruby 1.9 regular expressions have the power equivalent to a context-free grammar?

Exuberance answered 22/1, 2012 at 5:49 Comment(1)
This is a follow-on to answers I posted at #2627105Exuberance
U
7

This is one of the awesome things about the Oniguruma regexp engine used in Ruby 1.9 – it has the power of a parser, and is not restricted to recognizing regular languages. It has positive and negative lookahead/lookbehind, which even can be used to recognize some languages which are not context-free! Take the following as an example:

regexp = /\A(?<AB>a\g<AB>b|){0}(?=\g<AB>c)a*(?<BC>b\g<BC>c|){1}\Z/

This regexp recognizes strings like “abc”, “aabbcc”, “aaabbbccc”, and so on – the number of “a”, “b”, and “c” must be equal, or it will not match.

(One limitation: you can’t use named groups in the lookahead and lookbehind.)

Although I haven’t peeked under the hood, Oniguruma seems to deal with named groups by simple recursive descent, backing up when something doesn’t match. I’ve observed that it can’t deal with left recursion. For example:

irb(main):013:0> regexp = /(?<A>\g<A>a|)/
SyntaxError: (irb):13: never ending recursion: /(?<A>\g<A>a|)/
    from C:/Ruby192/bin/irb:12:in `<main>'

I don’t remember my parsing theory very clearly, but I think that a non-deterministic top-down parser like this should be able to parse any context-free language. (“language”, not “grammar”; if your grammar has left recursion, you will have to convert it to right recursion.) If that is incorrect, please edit this post.

Unlucky answered 22/1, 2012 at 21:20 Comment(6)
Do you have a link to a proof that they are context-free? I'd like to see that. Otherwise, do you have the specification of Oniguruma regex syntax? Doing a proof would be pretty cool. From what Ken Bloom posted, it looks like it supports defining CFGs... but I guess that depends on the full syntax, right? Maybe it can do more?Marcellemarcellina
It's a little more complicated than that. For instance, deterministic context-free languages also allow for recursion, but represent a proper superset of the context-free languages. Similarly, context-sensitive languages are a proper superset (although I somewhat doubt that, given the syntax used in the example, it would be possible to represent any non-CFL languages, but then again, I don't know the whole syntax). For instance, can you match {ww | w in E*} using this syntax? Can you match the language of all (including non-simple) palindromes?Marcellemarcellina
@Patrick87, thanks for pushing me to look into things more. I have edited my answer to make it way more informative. I also deleted my comments since they are now redundant. If you like the new answer, please upvote!Unlucky
How do you "convert it to right recursion" in this example? Specifically this is my case that does want to work: "aa"[/^(?<a>(a|\g<a>\g<a>)){0}\g<a>$/].Brocade
@Nakilon, that should probably be posted as a new question. If you create a question and link to it here, I'll answer it when I have a bit of time.Unlucky
@AlexD hi there. I've got this error here: https://mcmap.net/q/901991/-regex-for-integer-with-spaces-every-three-digits-throws-quot-never-ending-recursion-quot/322020Brocade

© 2022 - 2024 — McMap. All rights reserved.