Efficient (basic) regular expression implementation for streaming data

H

4

9

I'm looking for an implementation of regular expression matching that operates on a stream of data -- i.e., it has an API that allows a user to pass in one character at a time and report when a match is found on the stream of characters seen so far. Only very basic (classic) regular expressions are needed, so a DFA/NFA based implementation seems like it would be well-suited to the problem.

Based on the fact that it's possible to do regular expression matching using a DFA/NFA in a single linear sweep, it seems like a streaming implementation should be possible.

Requirements:

The library should not try to wait until the full string has been read before performing the match. The data I have really is streaming; there is no way to know how much data will arrive, it's not possible to seek forward or backward.
Implementing specific stream matching for a couple special cases is not an option, as I don't know in advance what patterns a user might want to look for.
Language: usable from C/C++

For the curious, my use case is the following: I have a system which intercepts memory writes inside a full system emulator, and I would like to have a way to identify memory writes that match a regular expression (e.g., one could use this to find the point in the system where a URL is written to memory).

I have found:

Apply a Regex on Stream?

Applying a regular expression to a Java I/O Stream

Code Guru - Building a Regular Expression Stream Search with the .NET Framework

But all of these attempt to convert the stream to a string first and then use a stock regular expression library.

Another thought I had was to modify the RE2 library, but according to the author it is architected around the assumption that the entire string is in memory at the same time.

If nothing's available, then I can start down the unhappy path of reinventing this wheel to fit my own needs, but I'd really rather not if I can avoid it. Any help would be greatly appreciated!

Harod answered 12/10, 2012 at 3:34 Comment(3)

You said that the full string must be read, then you can read the whole stream and store for processing? – Douche 12/10, 2012 at 3:38

Sorry, that was a bad typo. The data is streaming and we can't wait until the end to process it (the total amount of data passing through is more than 100 gigabytes). – Harod 12/10, 2012 at 3:46

This is a naive solution, but you could always keep a buffer of the last n bytes of the stream and apply the regex against that. Depending on your requirements, it might work for you. – Cunha 30/10, 2012 at 17:7

H

0

The "answer" to this unfortunately turns out to be that there is no prebuilt library for doing this. Instead I opted for the following compromise: I implemented a simple string matcher (no RE support), which keeps state using a single counter per stream and per search string, which tracks how many characters of the search string have been matched in that stream. It gets incremented with each correct character, and reset to zero when a non-matching character is found. This is fast and doesn't require too much memory overhead.

For more complicated searches, I instead just dump all streams out to disk, and then search through them using traditional tools. It's extremely slow, but luckily most of our use cases can get by with just simple string matching.

Harod answered 6/3, 2013 at 3:22 Comment(0)

N

5

I have exactly what you're looking for: https://github.com/agentzh/sregex http://agentzh.org/misc/slides/yapc-na-2013-sregex.pdf‎

And if you know javascript (or want a javascript version), I have an exercise that lets you easily do that using an existing RE implementation: https://github.com/dhruvbird/regexp-js

Nolin answered 28/10, 2013 at 18:32 Comment(7)

Thanks for posting this! :) Is your library meant to be usable in a browser? I'm not sure what to do with the requires. – Cataclysmic 10/1, 2017 at 1:18

@Mehrdad You should be able to get it to work in the browser with minor changes. I tested it in the browser to start with. I'm not sure how the requires pan out in the browser, but underscore.js is available in the browser. – Nolin 10/1, 2017 at 5:44

It seems DFA search is not implemented by the way. Am I missing it or is it really not there? – Cataclysmic 10/1, 2017 at 9:9

That's strange, then why is this statement empty? – Cataclysmic 10/1, 2017 at 20:21

Ah! I remember. I don't think I had quite figured out how to implement sub-match captures using DFA matching, so I didn't flesh it out. The part w/o capture should work if you implement it in the straight-forward manner. – Nolin 11/1, 2017 at 7:15

Ahh!! Funny you mention that... check this out and look who posted it: https://mcmap.net/q/1027447/-capture-groups-using-dfa-based-linear-time-regular-expressions-possible/541686 – Cataclysmic 11/1, 2017 at 7:31

By the way -- one other problem I ran into was that you did everything recursively rather than iteratively, which limits its usability. The reason I tried using your library was that I was trying to search for thousands of patterns at once (OR'ing them together), and JS's regular backtracking regexes obviously don't scale on that. But then when I tried it on yours it crashed after around ~2k patterns, and it took me a while to figure out that it was running out of stack space. I'm not really pursuing this anymore so don't worry about me, but for future users you may want to keep this in mind! – Cataclysmic 11/1, 2017 at 7:36

A

2

Intel has released hyperscan library that does exactly what is needed. Here it is a brief overview of it.

Algonquin answered 25/5, 2016 at 9:11 Comment(0)

K

0

Take a look at an old grep implementation like this one from Brian W. Kernighan and Rob Pike. They process the stream of text in a buffer, and apply on simple regex rules, not even BRE.

Kurd answered 15/11, 2012 at 23:31 Comment(0)

H

0

The "answer" to this unfortunately turns out to be that there is no prebuilt library for doing this. Instead I opted for the following compromise: I implemented a simple string matcher (no RE support), which keeps state using a single counter per stream and per search string, which tracks how many characters of the search string have been matched in that stream. It gets incremented with each correct character, and reset to zero when a non-matching character is found. This is fast and doesn't require too much memory overhead.

For more complicated searches, I instead just dump all streams out to disk, and then search through them using traditional tools. It's extremely slow, but luckily most of our use cases can get by with just simple string matching.

Harod answered 6/3, 2013 at 3:22 Comment(0)

Recommended topics

Hot tags