Capture groups using DFA-based (linear-time) regular expressions: possible?

Asked 9/3, 2015 at 11:55 Answered 13/3, 2015 at 22:12

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

Godfearing answered 9/3, 2015 at 11:55 Comment(2)

Do you mean just capturing groups, or also matching backreferences? – Taryntaryne 9/3, 2015 at 12:12

@Bergi: Just capture. – Godfearing 9/3, 2015 at 18:21

Is it possible to implement capture groups with DFA-based regular expressions while maintaining a linear time complexity with respect to the input length?

Yes - at least when the capture groups are deterministic. Consider the example regex /a|(a)/; matching that against the input "a" could either produce a captured group or none.

I think that capture groups could be based on a theoretical foundation using finite state transducers, which are like automatons but also may output strings while changing states. You may echo the input, but surround each capture group with parenthesis for example.

Intuitively I think not, because the subset construction procedure does not know which capture group it may have landed inside, but this is the first time I've realized this may be a potential problem, so I don't know.

Indeed, this is a problem. I think you can solve it by tagging your sets with the capture state, and similarly distinguish the states of your result DFA. You may fail to produce a fully deterministic automaton for regular expression like the above, as Wikipedia writes: "some non-deterministic transducers do not admit equivalent deterministic transducers".

However, a modification of the subset construction procedure is possible, see Determinization of Transducers. Their algorithm seems to revolve around the following:

local ambiguities […] are solved by delaying the outputs as far as needed, until these symbols can be written out deterministically.

For example, the regexes /ab|(a)c/ and even /(a[bc])|ad/ can be resolved into deterministic transducers. Notice that their memory representation may be much larger than if they had no capture groups.

Taryntaryne answered 9/3, 2015 at 19:8 Comment(4)

I'm trying to understand if being deterministic is enough. Would it be possible to make a DFA for a pattern like x*(a*b)c|x*(a?)(a?)bd? If you delay the capture until you see a c or a d, then you've solved the ambiguity problem, but then how do you know which capture was supposed to begin and end where? Is there a way to do this in constant time (and hopefully constant space) with respect to the input length? Perhaps using a stack or something? – Godfearing 9/3, 2015 at 19:29

Uh, good question. I haven't delved to deeply into finite state transducers. I think you can solve this by distinguishing capture groups by suffixes, i.e /x*(a*b)c|x*(a*b)d/ on abc should output (ab0)c and on xabd should output x(ab1)d. Those transducers would be determininisable to a rather simple form. Notice that you will always need linear time wrt to the input length; the output (space) would be linear as well but if you optimize and use pointers to the input, you should get linear space wrt the number of capture groups. – Taryntaryne 9/3, 2015 at 19:49

I see. Note that by constant time I meant constant with respect to the input length at each state transition (as there are n state transitions). Regarding the captures, it's something the user specifies, so if I can just change them and recover what the user wanted later quickly. – Godfearing 9/3, 2015 at 19:59

Right - but as the output takes linear time, I guess the the time spent for a state transition is amortized to be constant. – Taryntaryne 9/3, 2015 at 20:1

My http://github.com/hoehrmann/demo-parselov does this. I do not currently explain the construction on the web page, but suppose you have a grammar like

X = "a" B "c"
B = "b"

You can turn this regular grammar into a graph with labeled vertices

start X
"a"
start B
"b"
final B
"c"
final X

DFA states correspond to sets of these vertices. The first one would consist of vertices 1 and 2, the second one of vertices 3 and 4, then 5 and 6, and finally 7. If you parse the string "abc", you have

{ offset: 0, vertices: [1, 2] }
{ offset: 1, vertices: [3, 4] }
{ offset: 2, vertices: [5, 6] }
{ offset: EOF, vertices: [7] }

That is also a graph. You can write out the edges using (offset, vertex) pairs as vertices:

(o0, v1) -> (o0, v2)
(o0, v2) -> (o1, v3)
(o1, v3) -> (o1, v4)
...

Such a graph might contain vertices that do not ultimately reach the final vertex (EOF, v7), but such vertices can be eliminated in O(n) time. If the grammar is ambiguous, a match would be a path through the resulting graph. There may be many possible paths.

Goodson answered 13/3, 2015 at 22:12 Comment(0)

Recommended topics

Hot tags