Python regex parse stream
Asked Answered
I

5

36

Is there any way to use regex match on a stream in python? like

reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))

And I don't want to do this by getting the value of the whole string. I want to know if there's any way to match regex on a srtream(on-the-fly).

Iila answered 8/1, 2011 at 14:41 Comment(7)
that's against the idea of regex.Decipher
@SlientGhost: Not necessarily. You could want to parse some (infinite) stream using regexes, always matching at the current beginning of the stream and return the matches as an iterator (and consuming just the characters matched from the stream).Carce
@MartinStettner: Well, you could if it was an automata-theoretic matcher without backrefs (and a few other things too, such as lookahead constraints). As long as the RE can compile to a single finite automaton (either NFA or DFA), it can match things in one pass and so can handle spotting matches an infinite stream. (But Python uses PCRE, which is not automata-theoretic and which needs all the bytes there earlier.)Ballon
@DonalFellows I looked at pcre.org/pcre.txt and found no indication that the PCRE algorithm were not based on automata theory. For implementing backrefs and lookaheads of course it would need to maintain an internal buffer but this wouldn't prevent a mechanism like, say, some kind of needmore callback to work, (and for many cases, the buffer would not need to be very large compared to the possibly infinity stream size).Carce
@MartinStettner: It's one of these things that some people “just know”. Stack-based matchers can support a richer language — that's how you really tell — but need a token stream they can back up within. (I guess it comes of studying these things way back when I was a CS undergraduate.)Ballon
It's too bad this isn't readily available. Clearly there are some (serious) edge cases that would need to be addressed, such as backreferences, but it's certainly conceptually a reasonable thing to want. I imagine a dedicated utility that provides a subset of re's functionality might make sense.Labourer
It might be nice, to have true regular expressions, as defined in parser theory, to use in some places - guaranteed O(n) time (which python's re aren't), guaranteed O(1) memory, single iteration over the input. And sorry, backreferences aren't a "corner case" we should somehow "address". It's a feature that blows the whole thing up, just as "balanced groups" in .NET Regex. A FSA just won't do in such case. Just get a textbook if you still can't believe it. That said, I think it should be possible to implement these "regular" expressions in a way to be able to parse seekable streams...Cello
C
18

I had the same problem. The first thought was to implement a LazyString class, which acts like a string but only reading as much data from the stream as currently needed (I did this by reimplementing __getitem__ and __iter__ to fetch and buffer characters up to the highest position accessed...).

This didn't work out (I got a "TypeError: expected string or buffer" from re.match), so I looked a bit into the implementation of the re module in the standard library.

Unfortunately using regexes on a stream seems not possible. The core of the module is implemented in C and this implementation expects the whole input to be in memory at once (I guess mainly because of performance reasons). There seems to be no easy way to fix this.

I also had a look at PYL (Python LEX/YACC), but their lexer uses re internally, so this wouldnt solve the issue.

A possibility could be to use ANTLR which supports a Python backend. It constructs the lexer using pure python code and seems to be able to operate on input streams. Since for me the problem is not that important (I do not expect my input to be extensively large...), I will probably not investigate that further, but it might be worth a look.

Carce answered 3/1, 2012 at 13:54 Comment(4)
Well researched, interesting. Perhaps acooke.org/rxpy is a reasonable alternative?Smackdab
I just found another solution: pexpect (pexpect.readthedocs.org/en/latest/api/pexpect.html)Smallsword
(many years after my comment above) There is now a way to do this https://mcmap.net/q/422630/-python-regex-parse-streamSmallsword
@GiovanniFunchal this is missing the point. the OP question iks about streaming input which needs to be matched in a rolling fashion, as in this other answer, https://mcmap.net/q/422630/-python-regex-parse-stream, using regex partial matchingInjunction
A
9

In the specific case of a file, if you can memory-map the file with mmap and if you're working with bytestrings instead of Unicode, you can feed a memory-mapped file to re as if it were a bytestring and it'll just work. This is limited by your address space, not your RAM, so a 64-bit machine with 8 GB of RAM can memory-map a 32 GB file just fine.

If you can do this, it's a really nice option. If you can't, you have to turn to messier options.


The 3rd-party regex module (not re) offers partial match support, which can be used to build streaming support... but it's messy and has plenty of caveats. Things like lookbehinds and ^ won't work, zero-width matches would be tricky to get right, and I don't know if it'd interact correctly with other advanced features regex offers and re doesn't. Still, it seems to be the closest thing to a complete solution available.

If you pass partial=True to regex.match, regex.fullmatch, regex.search, or regex.finditer, then in addition to reporting complete matches, regex will also report things that could be a match if the data was extended:

In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>

It'll report a partial match instead of a complete match if more data could change the match result, so for example, regex.search(r'[\s\S]*', anything, partial=True) will always be a partial match.

With this, you can keep a sliding window of data to match, extending it when you hit the end of the window and discarding consumed data from the beginning. Unfortunately, anything that would get confused by data disappearing from the start of the string won't work, so lookbehinds, ^, \b, and \B are out. Zero-width matches would also need careful handling. Here's a proof of concept that uses a sliding window over a file or file-like object:

import regex

def findall_over_file_with_caveats(pattern, file):
    # Caveats:
    # - doesn't support ^ or backreferences, and might not play well with
    #   advanced features I'm not aware of that regex provides and re doesn't.
    # - Doesn't do the careful handling that zero-width matches would need,
    #   so consider behavior undefined in case of zero-width matches.
    # - I have not bothered to implement findall's behavior of returning groups
    #   when the pattern has groups.
    # Unlike findall, produces an iterator instead of a list.

    # bytes window for bytes pattern, unicode window for unicode pattern
    # We assume the file provides data of the same type.
    window = pattern[:0]
    chunksize = 8192
    sentinel = object()

    last_chunk = False

    while not last_chunk:
        chunk = file.read(chunksize)
        if not chunk:
            last_chunk = True
        window += chunk

        match = sentinel
        for match in regex.finditer(pattern, window, partial=not last_chunk):
            if not match.partial:
                yield match.group()

        if match is sentinel or not match.partial:
            # No partial match at the end (maybe even no matches at all).
            # Discard the window. We don't need that data.
            # The only cases I can find where we do this are if the pattern
            # uses unsupported features or if we're on the last chunk, but
            # there might be some important case I haven't thought of.
            window = window[:0]
        else:
            # Partial match at the end.
            # Discard all data not involved in the match.
            window = window[match.start():]
            if match.start() == 0:
                # Our chunks are too small. Make them bigger.
                chunksize *= 2
Ackerley answered 28/3, 2017 at 4:42 Comment(0)
G
2

This seems to be an old problem. As I have posted to a a similar question, you may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the kmp_example.py for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)

Goffer answered 11/1, 2015 at 11:43 Comment(0)
S
-5

The answers here are now outdated. Modern Python re package now supports bytes-like objects, which have an api you can implement yourself and get streaming behaviour.

Smallsword answered 20/10, 2021 at 19:43 Comment(1)
a bytes-like object is still a static contiguous blob in memory. OP is asking for a stream regex matcher, an online algorithm waiting for data. currently you have to chop the incoming stream into some blocks yourself and feed them chunk by chunk into sepratae calls of the regex matcher.Injunction
H
-9

Yes - using the getvalue method:

import cStringIO
import re

data = cStringIO.StringIO("some text")
regex = re.compile(r"\w+")
regex.match(data.getvalue())
Humour answered 8/1, 2011 at 14:51 Comment(1)
well that's the same thing as feeding it a string, i was wondering if there's any way to parse a streamIila

© 2022 - 2024 — McMap. All rights reserved.