Aren't modern regular expression dialects regular?

Asked 30/9, 2010 at 5:55 Answered 30/9, 2010 at 6:46

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so?

What features of modern regular expressions are not regular? Examples would be helpful.

Pugging answered 30/9, 2010 at 5:55 Comment(4)

This should probably be a community wiki – Greatest 30/9, 2010 at 5:57

@webdestroya: I can understand CW, but why not on SO? – Selfreliant 30/9, 2010 at 6:0

@NullUser - Isn't this a pretty subjective question? – Greatest 30/9, 2010 at 6:0

@web Nope. Regular languages have a formal definition. This allows us to answer objectively what regex features can make regex irregular. – Cupellation 30/9, 2010 at 6:5

The first thing that comes to mind is backreferences:

(\w*)\s\1

(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.

This construct is not regular (ie: can't be generated by a regular grammar).

Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:

\((a*|(?R))*\)

This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)

Cupellation answered 30/9, 2010 at 5:58 Comment(4)

Some backreferences can be made in a regular language. For example (.)x\1 defines a regular language: "axa", "bxb", etc. I believe it's only when combined with Kleene closures that backreferences make the language irregular. – Cryptanalysis 30/9, 2010 at 6:12

You don't need the space in there. (.*)\1 will do. – Aquilar 30/9, 2010 at 6:31

@Nabb: . matches a much larger range of characters than just \w*\s – Selfreliant 30/9, 2010 at 6:31

@Aquilar It's more to demonstrate real-world usage/for clarity's sake. – Cupellation 30/9, 2010 at 6:36

A couple of examples:

Regular expressions support grouping. E.g. in Ruby: /my (group)/.match("my group")[1] will output "group". storing something in a group requires an external storage, which a finite automaton does not have.
Many languages, e.g. C#, support captures, i.e. that each match will be captured on a stack - for example the pattern (?<MYGROUP>.)* could perform multiple captures of "." in the same group.
Grouping are used for backreferencing as pointed out by the user NullUserException above. Backreferencing requires one or more external stacks with the power of a push-down-automaton (you have to be able to push something on a stack and peek or pop it afterwards.
Some engines have the possibility of seperately pushing and popping external stacks and checking whether the stack is empty. In .NET, actually (?<MYGROUP>test) pushes a stack, while (?<-MYGROUP>) pops a stack.
Some engines like the .NET engine have a balanced grouping concept - where an external stack can be both pushed and popped at the same time. Balanced grouping syntax is (?<FIRSTGROUP-LASTGROUP>) which pops the LASTGROUP and pushes the capture since the LASTGROUP index on the FIRSTGROUP stack. This can actually be used to match infinitely nested constructions which is definitely beyond the power of a finite automaton.

Probably other good examples exist :-) If you are further interessted in some of the implementation details of external stacks in combination with Regex's and balanced grouping and thus higher order automata than finite automata, I once wrote two short articles on this (http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx and http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx).

Anyway - finitieness or not - I blieve that the power that this extra stuff brings to the regular languages is great :-)

Br. Morten

Rancor answered 30/9, 2010 at 6:46 Comment(1)

Grouping and capturing are not features that make the language irregular -- all they do is provide metadata, not change the expressiveness of the language. Obviously anything that involves a stack (like backreferences) do make for irregular languages though. – Cryptanalysis 30/9, 2010 at 7:28

A deterministic or nondeterministic finite automaton recognizes just the regular languages, which are described by regular expressions. The definition of a regular expression is simple. Let S be an alphabet. Then the empty set, the empty string, and every element of S are regular expressions (over S). Let u and v be regular expressions. Then the union (u | v), concatenation (uv), and closure (u*) of u and v are regular expressions over S. This definition is easily extended to the regular languages. No other expression is a regular expression. As pointed out, some back-references are an example. The Wikipedia pages on regular languages and expressions are good references.

In essence, certain "regular expressions" are not regular because no automaton of a particular type can be constructed to recognize them. For example, the the language

{ a^ i b^ i : i <= 0 }

is not regular. This is because the accepting automaton would require infinitely many states, but an automaton accepting regular languages must have a finite number of states.

Limicoline answered 30/9, 2010 at 6:23 Comment(4)

Judging from the original question, I'm pretty sure he understands the distinction between regular and non-regular languages. His question is, which features of modern "regular expression" implementations define languages that are not regular, and therefore cannot be expressed in some way using the operations you listed. – Becket 30/9, 2010 at 6:32

Maybe I should read more closely, then! In any case, I don't think I caused any harm. – Limicoline 30/9, 2010 at 6:45

a^i b^i is certainly nonregular (it's a DCFG), but can we actually express this using the "regular expressions" of programming languages? – Aquilar 30/9, 2010 at 6:52

@Aquilar /^( |a(?1)b)$/ – Erigena 10/3, 2015 at 11:13

Recommended topics

Hot tags