Python how to read N number of lines at a time
Asked Answered
V

7

64

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).

I have been reading about using itertools islice for this operation. I think I am halfway there:

from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)

for lines in lines_gen:
     ...process my lines...

The trouble is that I would like to process the next batch of 16 lines, but I am missing something

Vestal answered 13/6, 2011 at 20:20 Comment(4)
possible duplicate of Lazy Method for Reading Big File in Python?Watt
@ken - OP is asking about how to do this using islice, in that post the OP asks how to do this with yield.Universalize
Possible duplicate of How to read file N lines at a time in Python?Arlberg
@JonathanH I determined that this is a better version of the question, mainly on the strength of the top answer. The top/accepted answer there only gets the first N lines, and includes a variation that reads the whole file into memory first (clearly undesirable).Cavitation
B
83

islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.

from itertools import islice
with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

An alternative is to use the grouper pattern:

from itertools import zip_longest
with open(...) as f:
    for next_n_lines in zip_longest(*[f] * n):
        # process next_n_lines
Buntline answered 13/6, 2011 at 20:24 Comment(10)
I am learning python these days, have a question, ideally if you were reading a database or a file of records, you will need to mark the records as read (another column needed) and in the next batch you will start processing the next unmarked records, how is that being achieved here? esp here next_n_lines = list(islice(infile, n))Mythology
@zengr: I don't understand your question. list(islice(infile, n)) will get the next chunk of n lines from the file. Files know what you already read, you can simply continue reading.Buntline
@Sven Say, my batch job runs once everyday. I have a huge text file of 1M lines. But, I only want to read first 1000lines on day1. The job stops. Now, day2: I should start processing the same file from 1001th line. So, how do you maintain that, except storing the line number count some where else.Mythology
@zengr: You have to store the counter somewhere. That's a completely unrelated question -- use the "Ask Question" button in the upper right corner.Buntline
@Sven Marnach Thank you! Your islice code snippet worked awesomely!Vestal
I wanted to read a file n lines by nlines. izip_longest() is exactly what I looked for ! However, I really don't understand the *[f] * n arguments ... any ideas ?Frentz
@Stéphane: This idiom is mentioned in the docs of itertools.izip(). Have a look at the equivalent Python code there, and keep in mind that e.g. izip(*[f] * 5) is equivalent to izip(f, f, f, f, f).Buntline
Thx. Doing some tests with test(*[f]*2) , test([f]*2) and test(f*2) helped me With def test(*args): print argsFrentz
(Admittedly not the same question as the original) What if the n, the number of lines to group next, changes and is specified in the input file itself? For example, 2 a b 3 a b c => we want to group the input as (a,b) (a,b,c)Benge
@dhfromkorea: I would suggest using a custom generator funciton fo this, see gist.github.com/smarnach/75146be0088e7b5c503f.Buntline
W
10

The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.

Thus:

with open('my_very_large_text_file') as f:
    for line in f:
        process(line)

is probably superior to any alternative in time, space, complexity and readability.

See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.

Wamsley answered 13/6, 2011 at 22:22 Comment(4)
Hi! The reason I have to process my enormous textfile in blocks of N lines is that I am choosing one random line out of each group of N. This is for a bioinformatics analysis, and I want to make an smaller file that has equal representation from the entire dataset. Not all data is created equally in biology! There may be a different (perhaps, better?) way to choose X number of random lines equally distributed from a huge data set, but this is the first thing that I thought of. Thanks for the links!Vestal
@Vestal that's a hugely different question for which there are far more statistically useful samplings. I shall look for for something off the shelf, and turn it into a new question here. I'll put a link here when I do. Auto-correlation is a sad artifact to introduce.Wamsley
I answered it in this question instead: #6336339Wamsley
@Wamsley What if I need to read, lets say 10 lines of a very huge file to send them to multiprocessing.Pool ? Your one by one line read wont be usefull. Would it ?Danford
M
3

Here is another way using groupby:

from itertools import count, groupby

N = 16
with open('test') as f:
    for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
        print list(group)

How it works:

Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :

# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1   
...
Mandamandaean answered 13/6, 2011 at 20:37 Comment(0)
W
2

Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.

"""randsamp - extract a random subset of n lines from a large file"""

import random

def scan_linepos(path):
    """return a list of seek offsets of the beginning of each line"""
    linepos = []
    offset = 0
    with open(path) as inf:     
        # WARNING: CPython 2.7 file.tell() is not accurate on file.next()
        for line in inf:
            linepos.append(offset)
            offset += len(line)
    return linepos

def sample_lines(path, linepos, nsamp):
    """return nsamp lines from path where line offsets are in linepos"""
    offsets = random.sample(linepos, nsamp)
    offsets.sort()  # this may make file reads more efficient

    lines = []
    with open(path) as inf:
        for offset in offsets:
            inf.seek(offset)
            lines.append(inf.readline())
    return lines

dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once

lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)

I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.

Just to check the performance of sample_lines I used the timeit module as so

import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)', 
        'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
        elapsed, (elapsed/trials) * (10 ** 6))

For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.

The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".

Wamsley answered 14/6, 2011 at 16:52 Comment(5)
@Vestal - thanks for the pleasant diversion from my real work o.OWamsley
@Wamsley Awesome solution. It runs very fast, and I love that random.sample takes a sample without replacement. The only problem is that I have a memory error when writing my output files... but I can probably fix it myself. (The first thing that I will try is writing the outputfile one line at a time, instead of all the lines joined together). Thanks for a great solution! I have 9 million lines, sampling them 11 times in a loop, so time saving measures are great! Manipulating lists and loading all the lines into lists was just taking way too long to run.Vestal
@Wamsley I have fixed it to write each line to the outfile one at a time to avoid memory issues. Everything runs great! It takes 4 min 25 seconds to run, which is way better than 2+ hours to run the previous version (iterating over lists). I really like that this solution is only loading into memory the lines that are sampled from their offset. It's a neat and efficient trick. I can say I learned something new today!Vestal
@Vestal - glad to be of assistance, however the credit for the approach goes to Kernighan and Plaugher "Software Tools in Pascal" (1981) where they use this index method for implementing ed(1) in a language without a native character type! Some tricks just never get old.Wamsley
@brokentypewriter, msw: scan_linepos() doesn't include the offset 0 in the list, but it does include the offset past the last line. This means the sample never includes the first line, but might include an empty line if the offset past the last line is hit. The easiest fix is to swap the two lines in the for-loop.Buntline
G
1

Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:

from itertools import izip_longest

def grouper(iterable, n, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(*args, fillvalue=fillvalue)


with open(filename) as f:
    for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
        """process lines like 
        lines[0], lines[1] , ... , lines[chunk_size-1]"""
Getaway answered 13/6, 2011 at 20:32 Comment(4)
@Sven Marnach; Sorry, that "grouper" must be "chunker". But I think(I don't really understand yours) it does the same with your grouper function. edit: no it doesn't.Getaway
Still confusing. 1. chunker() is defined with two parameters and called with three. 2. Passing f as seq will try to slice the file object, which simply doesn't work. You can only slice sequences.Buntline
@Sven Marnach; actually first I took the first answer from that question on my answer, created the code for that, and thought second answer is better, and changed the function, but I forgot to change function call. And you are right about slicing, my mistake, trying to correct it. thanks.Getaway
@Getaway izip_longest ---> zip_longestDanford
T
0

Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group.

interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
    interim_list.append(rec)
    ctr += 1
    if ctr > 15:
        process_list(interim_list)
        interim_list = []
        ctr = 0

the final group

process_list(interim_list)

Tarra answered 13/6, 2011 at 22:46 Comment(0)
W
0

Another solution might be to create an iterator that yields lists of n elements:

def n_elements(n, it):
    try:
        while True:
            yield [next(it) for j in range(0, n)]
    except StopIteration:
        return

with open(filename, 'rt') as f:
    for n_lines in n_elements(n, f):
        do_stuff(n_lines)

Wyeth answered 3/12, 2022 at 20:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.