In the specific case of a file, if you can memory-map the file with mmap
and if you're working with bytestrings instead of Unicode, you can feed a memory-mapped file to re
as if it were a bytestring and it'll just work. This is limited by your address space, not your RAM, so a 64-bit machine with 8 GB of RAM can memory-map a 32 GB file just fine.
If you can do this, it's a really nice option. If you can't, you have to turn to messier options.
The 3rd-party regex
module (not re
) offers partial match support, which can be used to build streaming support... but it's messy and has plenty of caveats. Things like lookbehinds and ^
won't work, zero-width matches would be tricky to get right, and I don't know if it'd interact correctly with other advanced features regex
offers and re
doesn't. Still, it seems to be the closest thing to a complete solution available.
If you pass partial=True
to regex.match
, regex.fullmatch
, regex.search
, or regex.finditer
, then in addition to reporting complete matches, regex
will also report things that could be a match if the data was extended:
In [10]: regex.search(r'1234', '12', partial=True)
Out[10]: <regex.Match object; span=(0, 2), match='12', partial=True>
It'll report a partial match instead of a complete match if more data could change the match result, so for example, regex.search(r'[\s\S]*', anything, partial=True)
will always be a partial match.
With this, you can keep a sliding window of data to match, extending it when you hit the end of the window and discarding consumed data from the beginning. Unfortunately, anything that would get confused by data disappearing from the start of the string won't work, so lookbehinds, ^
, \b
, and \B
are out. Zero-width matches would also need careful handling. Here's a proof of concept that uses a sliding window over a file or file-like object:
import regex
def findall_over_file_with_caveats(pattern, file):
# Caveats:
# - doesn't support ^ or backreferences, and might not play well with
# advanced features I'm not aware of that regex provides and re doesn't.
# - Doesn't do the careful handling that zero-width matches would need,
# so consider behavior undefined in case of zero-width matches.
# - I have not bothered to implement findall's behavior of returning groups
# when the pattern has groups.
# Unlike findall, produces an iterator instead of a list.
# bytes window for bytes pattern, unicode window for unicode pattern
# We assume the file provides data of the same type.
window = pattern[:0]
chunksize = 8192
sentinel = object()
last_chunk = False
while not last_chunk:
chunk = file.read(chunksize)
if not chunk:
last_chunk = True
window += chunk
match = sentinel
for match in regex.finditer(pattern, window, partial=not last_chunk):
if not match.partial:
yield match.group()
if match is sentinel or not match.partial:
# No partial match at the end (maybe even no matches at all).
# Discard the window. We don't need that data.
# The only cases I can find where we do this are if the pattern
# uses unsupported features or if we're on the last chunk, but
# there might be some important case I haven't thought of.
window = window[:0]
else:
# Partial match at the end.
# Discard all data not involved in the match.
window = window[match.start():]
if match.start() == 0:
# Our chunks are too small. Make them bigger.
chunksize *= 2
needmore
callback to work, (and for many cases, the buffer would not need to be very large compared to the possibly infinity stream size). – Carcere
's functionality might make sense. – Labourerre
aren't), guaranteed O(1) memory, single iteration over the input. And sorry, backreferences aren't a "corner case" we should somehow "address". It's a feature that blows the whole thing up, just as "balanced groups" in .NET Regex. A FSA just won't do in such case. Just get a textbook if you still can't believe it. That said, I think it should be possible to implement these "regular" expressions in a way to be able to parse seekable streams... – Cello