Derive minimal regular expression from input

Asked 28/9, 2011 at 23:46 Answered 29/9, 2011 at 0:1

I have a remote "agent" that returns "yes" or "no" when handed a string. Communicating with this agent is expensive, so I'm hoping to find a library that will allow me to iteratively build a regular expression given positive and negative feedback, while being intelligent about its construction. This would allow me to cache answers on the sending side.

For example, suppose we query the agent with "good" and receive a "yes". The initial derived regular expression ought to be "good".

Suppose I query then with "goop" and receive a "yes". I would expect the derived regular expression to be "goo[dp]", not "good|goop".

And so forth.

I do not need backtracking or any other fancy non-linear time operations in my derived regex. Presumably the generated regex would be a DFA under the hood. Is anyone aware of any c/c++ regular expression libraries capable of doing this? Alternatively, reasons why this is a dumb idea and better solutions to my real problem would also be useful.

Kone answered 28/9, 2011 at 23:46 Comment(3)

Can we simplify this question to "How to find the minimal regex matching a given set of strings"? – Meza 28/9, 2011 at 23:58

@Kerrek: I think roughly, but it seems desirable to have it be efficient to add new strings, constructing it incrementally. – Highup 29/9, 2011 at 0:0

@R This is correct. Adding new strings on the fly, instead of a batch model, is important. – Kone 29/9, 2011 at 0:16

Rather than a regular expression, you could use a Trie.

Then for each new string you walk the trie one node for each character. I suspect that you would also want a marker character for end of string - once you reach this character, if the node exists, it holds the yes/no answer.

Predikant answered 29/9, 2011 at 0:1 Comment(3)

This is one option on the table. It would be nice to factor out repetitions, though. – Kone 29/9, 2011 at 0:15

@Kone By factoring out repetitions, you mean instead of having consecutive nodes for the same character only have one with a count attached or for whole parts of the trie? – Apple 29/9, 2011 at 0:39

@Apple Whole parts. Ideally the regex generated would be something like "abc{2,3}" instead of "abcabc(?:abc)?". – Kone 29/9, 2011 at 6:53

Well, unless I'm missing something in your situation, I think that memory is cheap enough to just straight up implement a dumb cache - say, an unordered_map of <std::string, bool>. Not only will this be much easier to build, it will probably be faster too, since you're building a hash map. The only downside to this is if you were going to query the remote service with a bazillion different keys, then this might not be the best approach.

Charters answered 28/9, 2011 at 23:54 Comment(0)

Recommended topics

Hot tags