Apply a Regex on Stream?
Asked Answered
J

3

47

I'm searching for fast and safe way to apply Regular Expressions on Streams.

I found some examples over the internet that talking about converting each buffer to String and then apply the Regex on the string.

This approach have two problems:

  • Performance: converting to strings and GC'ing the strings is waste of time and CPU and sure can be avoided if there was a more native way to apply Regex on Streams.
  • Pure Regex support: Regex pattern sometimes can match only if combining two buffers together (buffer 1 ends with the first part of the match, and buffer 2 starts with the second part of the match). The convert-to-string way cannot handle this type of matching natively, I have to provide more information like the maximum length that the pattern can match, this does not support the + and * regex signs at all and will never support (unlimited match length).

So, the convert-to-string way is not fast, and doesn't fully support Regex.

Is there any way / Library that can be used to apply Regex on Streams without converting to strings and with full Regex support?

Jann answered 25/12, 2009 at 23:26 Comment(5)
So why can't you wait until you receive all of the data?Papuan
In my experience it's often the regex which will hinder performance, not converting and GC'ing strings. Unless your matching is very complex, I suggest creating your own stream scanner for matches instead of Regex. But you should benchmark it against using a regex to ensure you're on the right track.Voltameter
@ChaosPandion: If stream is a big file, i'm not going to load all its gigas to the memory, especially not in utf-16 (string encoding in memory in .net). If stream is from the internet, i want be able to scan it before all data received be (IE HTML Parser, display the downloaded parts before rest of page downloaded).Jann
You could take the Mono implementation of Regex and extend it to support streams. The only thing not working on streams would be scanning backwards, since that requires the whole stream to be read. Latest mono source can be found at: ftp.novell.com/pub/mono/sources-stableVoltameter
Was looking for the same thing and also agree about the two problems you point out. Seems you need a regex search engine that allows you to supply an iterator. Looks like the C++ STL std::regex_search supports iterators.Aleut
M
11

Intel has recently open sourced hyperscan library under BSD license. It's a high-performance non-backtracking NFA-based regex engine.

Features: ability to work on streams of input data and simultaneous multiple patterns matching. The last one differs from (pattern1|pattern2|...) approach, it actually matches patterns concurrently.

It also utilizes Intel's SIMD instructions sets like SSE4.2, AVX2 and BMI. The summary of the design and explanation of work can be found here. It also has great developer's reference guide with a lot of explanations as well as performance and usage considerations. Small article about using it in the wild (in russian).

Merganser answered 25/5, 2016 at 9:5 Comment(2)
github.com/intel/hyperscan -- very cool, but not a .NET library as the original question asked for.Sanious
@DrewNoakes And yet, it is interesting. I came here, not finding something for Common Lisp. So the .NET part is not really concerning me.Pryce
T
1

It seems that you would know the start and end delimiters of the matches you are trying to get, correct? (i.e. [,] or START,END etc.) So would it make sense to search for these delimiters as data from your stream comes in and then creating a sub-string between the delimiters and do further processing on those?

I know it's pretty much the same thing as rolling your own, but it will be with a more specific purpose and even be able to process it as it comes in.

The problem with regular expressions in this instance is that they work based on matches so you can only match against the amount of input you have. If you have a stream, you would have to read in all the data to get all the matches (space / time constraint issue), try to match against the character at a time brought in (pretty useless), match in chunks (again, something can be easily missed there) or generate strings of interest which if they match your criteria can be shipped off elsewhere for further processing.

Trapeze answered 31/12, 2009 at 22:41 Comment(0)
M
1

You could add an extra method to StreamReader (the source code of e.g. Mono could be used for that purpose):

    private StringBuilder lineBuilder;
    public int RegexBufferSize
    {
        set { lastRegexMatchedLength = value; }
        get { return lastRegexMatchedLength; }
    }
    private int lastRegexMatchedLength = 0;

    public virtual string ReadRegex(Regex regex)
    {
        if (base_stream == null)
            throw new ObjectDisposedException("StreamReader", "Cannot read from a closed RegexStreamReader");

        if (pos >= decoded_count && ReadBuffer() == 0)
            return null; // EOF Reached

        if (lineBuilder == null)
            lineBuilder = new StringBuilder();
        else
            lineBuilder.Length = 0;

        lineBuilder.Append(decoded_buffer, pos, decoded_count - pos);
        int bytesRead = ReadBuffer();

        bool dataTested = false;
        while (bytesRead > 0)
        {
            var lineBuilderStartLen = lineBuilder.Length;
            dataTested = false;
            lineBuilder.Append(decoded_buffer, 0, bytesRead);

            if (lineBuilder.Length >= lastRegexMatchedLength)
            {
                var currentBuf = lineBuilder.ToString();
                var match = regex.Match(currentBuf, 0, currentBuf.Length);
                if (match.Success)
                {
                    var offset = match.Index + match.Length;
                    pos = 0;
                    decoded_count = lineBuilder.Length - offset;
                    ensureMinDecodedBufLen(decoded_count);
                    lineBuilder.CopyTo(offset, decoded_buffer, 0, decoded_count);
                    var matchedString = currentBuf.Substring(match.Index, match.Length);
                    return matchedString;
                }
                else
                {
                    lastRegexMatchedLength *= (int) 1.1; // allow for more space before attempting to match
                    dataTested = true;
                }
            }

            bytesRead = ReadBuffer();
        }

        // EOF reached

        if (!dataTested)
        {
            var currentBuf = lineBuilder.ToString();
            var match = regex.Match(currentBuf, 0, currentBuf.Length);
            if (match.Success)
            {
                var offset = match.Index + match.Length;
                pos = 0;
                decoded_count = lineBuilder.Length - offset;
                ensureMinDecodedBufLen(decoded_count);
                lineBuilder.CopyTo(offset, decoded_buffer, 0, decoded_count);
                var matchedString = currentBuf.Substring(match.Index, match.Length);
                return matchedString;

            }
        }
        pos = decoded_count;

        return null;
    }

In the above method, the following vars are used:

  1. decoded_buffer : the char buffer that contains/will contain the data read
  2. pos: offset within the array containing unhandled data
  3. decoded_count: the last element within the buffer containing read data
  4. RegexBufferSize: the minimum size of the regex input before any matching occurs.

The method ReadBuffer() needs to read data from the stream. The method ensureMinDecodedBufLen() needs to make sure that the decoded_buffer is large enough.

When calling the method, pass the Regex that needs to be matched against.

Mountie answered 18/2, 2016 at 12:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.