How to read records terminated by custom separator from file in python?

Asked 25/10, 2013 at 22:39 Answered 18/9, 2019 at 2:47

I would like a way to do for line in file in python, where the end of line is redefined to be any string that I want. Another way of saying that is I want to read records from file rather than lines; I want it to be equally fast and convenient to do as reading lines.

This is the python equivalent to setting perl's $/ input record separator, or using Scanner in java. This doesn't necessarily have to use for line in file (in particular, the iterator may not be a file object). Just something equivalent which avoids reading too much data into memory.

Mingy answered 25/10, 2013 at 22:39 Comment(0)

There is nothing in the Python 2.x file object, or the Python 3.3 io classes, that lets you specify a custom delimiter for readline. (The for line in file is ultimately using the same code as readline.)

But it's pretty easy to build it yourself. For example:

def delimited(file, delimiter='\n', bufsize=4096):
    buf = ''
    while True:
        newbuf = file.read(bufsize)
        if not newbuf:
            yield buf
            return
        buf += newbuf
        lines = buf.split(delimiter)
        for line in lines[:-1]:
            yield line
        buf = lines[-1]

Here's a stupid example of it in action:

>>> s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')
>>> d = delimited(s, 'ZZZ', bufsize=2)
>>> list(d)
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']

If you want to get it right for both binary and text files, especially in 3.x, it's a bit trickier. But if it only has to work for one or the other (and one language or the other), you can ignore that.

Likewise, if you're using Python 3.x (or using io objects in Python 2.x), and want to make use of the buffers that are already being maintained in a BufferedIOBase instead of just putting a buffer on top of the buffer, that's trickier. The io docs do explain how to do everything… but I don't know of any simple examples, so you're really going to have to read at least half of that page and skim the rest. (Of course, you could just use the raw files directly… but not if you want to find unicode delimiters…)

Valorie answered 25/10, 2013 at 22:48 Comment(5)

After reading through the whole tracker issue the OP linked, it looks like Douglas Alan already posted a very similar recipe 5 years into the discussion. I like his better because it allows you to transform the input newline into an output newline instead of just discarding it… but rather than edit mine to match, I'll just leave the link. – Valorie 25/10, 2013 at 23:0

Another advantage of the one linked is it returns the remainder of the buffer when the stream is closed. – Ahab 31/3, 2015 at 4:29

@jozxyqk: I'm not sure what you mean by that. This version yields the remainder of the buffer at EOF. (If the file is actually closed out from under you and raised an exception, I assume you want that exception--after all, the whole point is to work like "for line in file:" but with a different delimiter.) – Valorie 1/4, 2015 at 14:47

Aah, my mistake, I should have read more closely. I was testing by reading sys.stdin directly and printing the output, still using \n, and for some reason the remaining characters weren't printing when I hit ctrl-D. Looking at the code again, I'm not sure why and assume I've done something wrong. – Ahab 1/4, 2015 at 16:0

@jozxyqk: Reading from line-buffered stdin has some oddities with ^D which depend on your platform, terminal, and Python version, which can get in the way of testing other things. (See if "for line in sys.stdin():" and "for line in iter(input, ''):" do different things for you...) – Valorie 2/4, 2015 at 19:16

The issue discussion OP linked has yet another solution to reading data rows terminated by a custom separator from a file posted by Alan Barnet. It works both for text and binary files and is a big improvement on the fileLineIter recipe of Douglas Alan.

Here's my polished version of Alan Barnet's resplit. I have replaced the string addition += with the allegedly faster "".join string concatenation and I added the type hints for even faster performance. My version is tuned to work with binary files. I must use a regex pattern for splitting because my delimiter in its plain form also occurs inside the data rows in a non-delimiting function so I need to consider its context. However, you can retune it for text files and replace the regex pattern with common str if you have a simple and unique delimiter not used elsewhere.

import pathlib
import functools
import re
from typing import Iterator, Iterable, ByteString
import logging

logging.basicConfig(level=logging.DEBUG)
logging.getLogger().setLevel(logging.DEBUG)
logger = logging.getLogger(__name__)


def resplit(chunks_of_a_file: Iterator, split_pattern: re.Pattern) -> Iterable[ByteString]:
    """
    Reads chunks of a file one chunk at a time, 
    splits them into data rows by `split_pattern` 
    and joins partial data rows across chunk boundaries.
    borrowed from https://bugs.python.org/issue1152248#msg223491
    """
    partial_line = None
    for chunk in chunks_of_a_file:
        if partial_line:
            partial_line = b"".join((partial_line, chunk))
        else:
            partial_line = chunk
        if not chunk:
            break
        lines = split_pattern.split(partial_line)
        partial_line = lines.pop()
        yield from lines
    if partial_line:
        yield partial_line


if __name__ == "__main__":
    path_to_source_file = pathlib.Path("source.bin")
    with open(path_to_source_file, mode="rb") as file_descriptor:
        buffer_size = 8192
        sentinel = b""
        chunks = iter(functools.partial(file_descriptor.read, buffer_size), sentinel)
        data_rows_delimiter = re.compile(b"ABC")
        lines = resplit(chunks, data_rows_delimiter)
        for line in lines:
            logger.debug(line)

Linhliniment answered 18/9, 2019 at 2:47 Comment(0)

Recommended topics

Hot tags