Lazy Method for Reading Big File in Python?
Asked Answered
C

12

381

I have a very big file 4GB and when I try to read it my computer hangs. So I want to read it piece by piece and after processing each piece store the processed piece into another file and read next piece.

Is there any method to yield these pieces ?

I would love to have a lazy method.

Chickadee answered 6/2, 2009 at 9:11 Comment(0)
E
554

To write a lazy function, just use yield:

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data


with open('really_big_file.dat') as f:
    for piece in read_in_chunks(f):
        process_data(piece)

Another option would be to use iter and a helper function:

f = open('really_big_file.dat')
def read1k():
    return f.read(1024)

for piece in iter(read1k, ''):
    process_data(piece)

If the file is line-based, the file object is already a lazy generator of lines:

for line in open('really_big_file.dat'):
    process_data(line)
Elater answered 6/2, 2009 at 9:20 Comment(27)
So the line f = open('really_big_file.dat') is just a pointer without any memory consumption? (I mean the memory consumed is the same regardless the file size?) How it will affect performance if I use urllib.readline() instead of f.readline()?Moua
Good practice to use open('really_big_file.dat', 'rb') for compatibility with our Posix-challenged Windows using colleagues.Griselgriselda
to make even shorter, use functools.partial.Derekderelict
As this question was answered in '09 I had one follow up question on this. Is the method of reading lines from a big file using file_handle.readline() is more efficient then this? (Python 2.7)Bowne
@Bowne no, iterating is a bit more efficient, as it caches a bit more in python 2.7;Elater
@Elater U mean the readline() is better? Thanks for the responseBowne
@Bowne readline() buffers less data. If that's a good thing or not is another matter, it depends on the problem. When reading a file descriptor connected to another process, maybe it can be better, so you can get results as they appear in the pipe (no buffering). On the other hand, buffering gives more performance if you're going to process all data anyway. It's a tradeoff.Elater
Missing rb as @Tal Weiss mentioned; and missing a file.close() statement (could use with open('really_big_file.dat', 'rb') as f: to accomplish same; See here for another concise implementationCollotype
Ran into an issue where I was writing to a log file within the file read loop, as such when I read the log file it never finished reading.Formic
@cod3monk3y: text and binary files are different things. Both types are useful but in different cases. The default (text) mode may be useful here i.e., 'rb' is not missing.Indemnification
@j-f-sebastian: true, the OP did not specify whether he was reading textual or binary data. But if he's using python 2.7 on Windows and is reading binary data, it is certainly worth noting that if he forgets the 'b' his data will very likely be corrupted. From the docs - Python on Windows makes a distinction between text and binary files; [...] it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files.Collotype
Can we use this method to upload large chunks of file over UDP to a remote server?Almetaalmighty
How the method read_in_chunks() will know that it has reached the end of file?Adallard
Once it has yielded, the control goes back to the line data = file_object.read(chunk_size), wouldn't it read the same chunk again?Adallard
Here's a generator that returns 1k chunks: buf_iter = (x for x in iter(lambda: buf.read(1024), '')). Then for chunk in buf_iter: to loop through the chunks.Mentor
@Mentor your generator comprehension is useless, you can use just for chunk in iter(lambda: buf.read(1024), '')): directlyElater
@nikhilvj huh, read the whole answer? There is already a method for that listed!Elater
If I am reading a text file in chunks... how can I make sure my chunk doesnt end up splitting a word in pieces? I want the chunk breaks to happen on whitespace or some other delimiter (basically any non ASCII letter)Becka
@Becka you can use yield to write a function that reads a chunk and splits it in words, and then yield each word separately. You can then store the split word piece in a buffer variable, to join with the next chunk iteration. It all depends on what your code needs.Elater
@nosklo, would this approach work with hierarchical structured JSON data which contains list of records, but where each record is not line-oriented, and spans multiple (possibly different) key-value JSON structures.Chemotaxis
@Elater I want to return this iter to some other method. How to ensure file handle is closed when iteration is completed. To take an example, for piece in iter(read1k, ''): process_data(piece). I want this file pointer to be closedCassandra
use a with statement inside the generator function @DeepanshuArora - search for questions about it there are many alreadyElater
Is there any way to use this with threading or multiprocessing if I am also doing: tokens = nltk.word_tokenize(raw); af = nltk.Text(tokens) right after reading the file into raw. Don't care if some words get lost.Whetstone
In your last code example, you use open. Then you also have to close the file, otherwise it will remain open. Using, with will auto-close it. Can I use with open() as f: for line in f... having a lazy generator?Ensor
@Ensor sure, it just is beyond the scope of the actual question; That was just an example. You can use the file object as lazy generator regardless of how you open/close it. And since there's no reference to the file object it will be immediately closed as well, in CPython.Elater
@nosklom, thanks, you wrote that when looping a file you do not need yield because of the lazy generator. But how is then this interpreted that uses yield?Ensor
@Ensor The comment section is not a good place to ask new questions, use the "Ask a question" button instead!! that in the link is a lazy generator that itself uses a lazy generator internally - like if you make a function that calls another function - it lazily consumes the line generator from the file object and yields the lines to its caller.Elater
C
49

file.readlines() takes in an optional size argument which approximates the number of lines read in the lines returned.

bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
    process([line for line in tmp_lines])
    tmp_lines = bigfile.readlines(BUF_SIZE)
Corenecoreopsis answered 21/1, 2010 at 18:27 Comment(6)
it's a really great idea, especially when it is combined with the defaultdict to split big data into smaller ones.Neckcloth
I would recommend to use .read() not .readlines(). If the file is binary it's not going to have line breaks.Faunia
What if the file is one huge string?Yance
This solution is buggy. If one of the lines is larger than your BUF_SIZE, you are going to process an incomplete line. @Yance is correct.Lass
@MyersCarpenter Will that line be repeated twice? tmp_lines = bigfile.readlines(BUF_SIZE)Raab
Messy solution. You're breaking the file up into BUF_SIZE chunks, then splitting those BUF_SIZE chunks into a list unnecessarily. Why not just use file.readline(BUF_SIZE) instead? (Obviously that, too, would be hideous -- it just wouldn't be as hideous...)Dispassionate
L
48

There are already many good answers, but if your entire file is on a single line and you still want to process "rows" (as opposed to fixed-size blocks), these answers will not help you.

99% of the time, it is possible to process files line by line. Then, as suggested in this answer, you can to use the file object itself as lazy generator:

with open('big.csv') as f:
    for line in f:
        process(line)

However, one may run into very big files where the row separator is not '\n' (a common case is '|').

  • Converting '|' to '\n' before processing may not be an option because it can mess up fields which may legitimately contain '\n' (e.g. free text user input).
  • Using the csv library is also ruled out because the fact that, at least in early versions of the lib, it is hardcoded to read the input line by line.

For these kind of situations, I created the following snippet [Updated in May 2021 for Python 3.8+]:

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    """
    row = ''
    while (chunk := f.read(chunksize)) != '':   # End of file
        while (i := chunk.find(sep)) != -1:     # No separator found
            yield row + chunk[:i]
            chunk = chunk[i+1:]
            row = ''
        row += chunk
    yield row

[For older versions of python]:

def rows(f, chunksize=1024, sep='|'):
    """
    Read a file where the row separator is '|' lazily.

    Usage:

    >>> with open('big.csv') as f:
    >>>     for r in rows(f):
    >>>         process(r)
    """
    curr_row = ''
    while True:
        chunk = f.read(chunksize)
        if chunk == '': # End of file
            yield curr_row
            break
        while True:
            i = chunk.find(sep)
            if i == -1:
                break
            yield curr_row + chunk[:i]
            curr_row = ''
            chunk = chunk[i+1:]
        curr_row += chunk

I was able to use it successfully to solve various problems. It has been extensively tested, with various chunk sizes. Here is the test suite I am using, for those who need to convince themselves:

test_file = 'test_file'

def cleanup(func):
    def wrapper(*args, **kwargs):
        func(*args, **kwargs)
        os.unlink(test_file)
    return wrapper

@cleanup
def test_empty(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1_char_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1_char(chunksize=1024):
    with open(test_file, 'w') as f:
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1025_chars_1_row(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1

@cleanup
def test_1024_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1023):
            f.write('a')
        f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1025):
            f.write('|')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 1026

@cleanup
def test_2048_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

@cleanup
def test_2049_chars_2_rows(chunksize=1024):
    with open(test_file, 'w') as f:
        for i in range(1022):
            f.write('a')
        f.write('|')
        f.write('a')
        # -- end of 1st chunk --
        for i in range(1024):
            f.write('a')
        # -- end of 2nd chunk
        f.write('a')
    with open(test_file) as f:
        assert len(list(rows(f, chunksize=chunksize))) == 2

if __name__ == '__main__':
    for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
        test_empty(chunksize)
        test_1_char_2_rows(chunksize)
        test_1_char(chunksize)
        test_1025_chars_1_row(chunksize)
        test_1024_chars_2_rows(chunksize)
        test_1025_chars_1026_rows(chunksize)
        test_2048_chars_2_rows(chunksize)
        test_2049_chars_2_rows(chunksize)
Lass answered 11/6, 2015 at 8:23 Comment(0)
P
41

If your computer, OS and python are 64-bit, then you can use the mmap module to map the contents of the file into memory and access it with indices and slices. Here an example from the documentation:

import mmap
with open("hello.txt", "r+") as f:
    # memory-map the file, size 0 means whole file
    map = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print map.readline()  # prints "Hello Python!"
    # read content via slice notation
    print map[:5]  # prints "Hello"
    # update content using slice notation;
    # note that new content must have same size
    map[6:] = " world!\n"
    # ... and read again using standard file methods
    map.seek(0)
    print map.readline()  # prints "Hello  world!"
    # close the map
    map.close()

If either your computer, OS or python are 32-bit, then mmap-ing large files can reserve large parts of your address space and starve your program of memory.

Putative answered 6/2, 2009 at 9:41 Comment(6)
How is this supposed to work? What if I have a 32GB file? What if I'm on a VM with 256MB RAM? Mmapping such a huge file is really never a good thing.Uella
This answer deserve a -12 vote . THis will kill anyone using that for big files.Towage
This can work on a 64-bit Python even for big files. Even though the file is memory-mapped, it's not read to memory, so the amount of physical memory can be much smaller than the file size.Perth
@SavinoSguera does the size of physical memory matter with mmaping a file?Wilda
@V3ss0n: I've tried to mmap 32GB file on 64-bit Python. It works (I have RAM less than 32GB): I can access the start, the middle, and the end of the file using both Sequence and file interfaces.Indemnification
Don't think it's very practical for big files, also because depending of the data structure you may end up lost in seeking the correct item to retrieve? Just my guts feeling.Fess
I
14
f = ... # file-like object, i.e. supporting read(size) function and 
        # returning empty string '' when there is nothing to read

def chunked(file, chunk_size):
    return iter(lambda: file.read(chunk_size), '')

for data in chunked(f, 65536):
    # process the data

UPDATE: The approach is best explained in https://mcmap.net/q/88193/-what-is-the-idiomatic-way-to-iterate-over-a-binary-file

Intersexual answered 31/3, 2012 at 1:50 Comment(3)
This works well for blobs, but may not be good for line separated content (like CSV, HTML, etc where processing needs to be handled line by line)Jonquil
excuse me. what is the value of f ?Raab
@user1, it can be open('filename')Intersexual
A
14

In Python 3.8+ you can use .read() in a while loop:

with open("somefile.txt") as f:
    while chunk := f.read(8192):
        do_something(chunk)

Of course, you can use any chunk size you want, you don't have to use 8192 (2**13) bytes. Unless your file's size happens to be a multiple of your chunk size, the last chunk will be smaller than your chunk size.

Aiaia answered 20/7, 2020 at 19:17 Comment(2)
Do you have any idea why your code gived this error: Traceback (most recent call last): File "C:\Users\DKMK01256\AppData\Local\Programs\Python\Python310\lib\tkinter_init_.py", line 1921, in call return self.func(*args) File "C:\Users\DKMK01256\OneDrive - WSP O365\Python\coordinatesystem changer gui.py", line 66, in convert while data := f.readlines(): ValueError: I/O operation on closed file.Saar
while data := f.readlines() is pretty different from the code in this answer you're commenting on, it uses a different function. readlines() reads the entire file, so the fact that you're doing it many times in a while loop is almost certainly a mistake.Aiaia
I
7

Refer to python's official documentation https://docs.python.org/3/library/functions.html#iter

Maybe this method is more pythonic:

"""A file object returned by open() is a iterator with
read method which could specify current read's block size
"""
with open('mydata.db', 'r') as f_in:
    block_read = partial(f_in.read, 1024 * 1024)
    block_iterator = iter(block_read, '')

    for index, block in enumerate(block_iterator, start=1):
        block = process_block(block)  # process your block data

        with open(f'{index}.txt', 'w') as f_out:
            f_out.write(block)
Inness answered 23/6, 2019 at 7:49 Comment(1)
Bruce is correct. I use functools.partial to parse video streams. With py;py3, I can parse over 1GB a second . ` for pkt in iter(partial(vid.read, PACKET_SIZE ), b""):`Spinthariscope
E
5

I think we can write like this:

def read_file(path, block_size=1024): 
    with open(path, 'rb') as f: 
        while True: 
            piece = f.read(block_size) 
            if piece: 
                yield piece 
            else: 
                return

for piece in read_file(path):
    process_piece(piece)
Exultation answered 6/11, 2013 at 2:15 Comment(0)
F
2

i am not allowed to comment due to my low reputation, but SilentGhosts solution should be much easier with file.readlines([sizehint])

python file methods

edit: SilentGhost is right, but this should be better than:

s = "" 
for i in xrange(100): 
   s += file.next()
Foxy answered 6/2, 2009 at 10:37 Comment(5)
ok, sorry, you are absolutely right. but maybe this solution will make you happier ;) : s = "" for i in xrange(100): s += file.next()Foxy
-1: Terrible solution, this would mean creating a new string in memory each line, and copying the entire file data read to the new string. The worst performance and memory.Elater
why would it copy the entire file data into a new string? from the python documentation: In order to make a for loop the most efficient way of looping over the lines of a file (a very common operation), the next() method uses a hidden read-ahead buffer.Foxy
@sinzi: "s +=" or concatenating strings makes a new copy of the string each time, since the string is immutable, so you are creating a new string.Elater
@nosklo: these are details of implementation, list comprehension can be used in it's placeSydelle
S
1

I'm in a somewhat similar situation. It's not clear whether you know chunk size in bytes; I usually don't, but the number of records (lines) that is required is known:

def get_line():
     with open('4gb_file') as file:
         for i in file:
             yield i

lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]

Update: Thanks nosklo. Here's what I meant. It almost works, except that it loses a line 'between' chunks.

chunk = [next(gen) for i in range(lines_required)]

Does the trick w/o losing any lines, but it doesn't look very nice.

Sydelle answered 6/2, 2009 at 10:12 Comment(1)
is this pseudo code? it won't work. It is also needless confusing, you should make the number of lines an optional parameter to the get_line function.Elater
T
0

Update :- You can also use file_object.readlines if you want the chunk to give you results in complete line by that i mean no unfinished lines will be present in the result.

for example :-

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.readlines(chunk_size)
        if not data:
            break
        yield data

-- Adding on to the answer given --

When i was reading file in chunk let's suppose a text file with the name of split.txt the issue i was facing while reading in chunks was I had a use case where i was processing the data line by line and just because the text file i was reading in chunks it(chunk of file) sometimes end with partial lines that end up breaking my code(since it was expecting the complete line to be processed)

so after reading here and there I came to know I can overcome this issue by keeping a track of the last bit in the chunk so what I did was if the chunk has a /n in it that means the chunk consists of a complete line otherwise I usually store the partial last line and keep it in a variable so that I can use this bit and concatenate it with the next unfinished line coming in the next chunk with this I successfully able to get over this issue.

sample code :-

# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

# file where i am writing my final output
write_file=open('split.txt','w')

# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1

try:
    with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
        for piece in read_in_chunks(f):
            #print('---->>>',piece,'<<<--')
            line_by_line = piece.split('\n')

            for one_line in line_by_line:
                # if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one 
                if placeholder:
                    # print('----->',placeholder)
                    # concatinating the previous partial line with the current one
                    one_line=placeholder+one_line
                    # then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
                    placeholder=''
                
                # futher logic that revolves around my specific use case
                segregated_data= one_line.split('~')
                #print(len(segregated_data),type(segregated_data), one_line)
                if len(segregated_data) < 18:
                    placeholder=one_line
                    continue
                else:
                    placeholder=''
                #print('--------',segregated_data)
                if segregated_data[2]=='2020' and segregated_data[3]=='2021':
                    #write this
                    data=str("~".join(segregated_data))
                    #print('data',data)
                    #f.write(data)
                    write_file.write(data)
                    write_file.write('\n')
                    print(write_file.tell())
                elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
                    #write this
                    data=str("-".join(segregated_data))
                    write_file.write(data)
                    write_file.write('\n')
                    print(write_file.tell())
except Exception as e:
    print('error is', e)                
Tarragona answered 8/11, 2022 at 12:29 Comment(0)
B
-2

you can use following code.

file_obj = open('big_file') 

open() returns a file object

then use os.stat for getting size

file_size = os.stat('big_file').st_size

for i in range( file_size/1024):
    print file_obj.read(1024)
Batch answered 18/6, 2015 at 13:20 Comment(1)
wouldn't read the whole file if size isn't a multiply of 1024Artwork

© 2022 - 2024 — McMap. All rights reserved.