Python fastest way to read a large text file (several GB) [duplicate]
Asked Answered
A

1

41

i have a large text file (~7 GB). I am looking if exist the fastest way to read large text file. I have been reading about using several approach as read chunk-by-chunk in order to speed the process.

at example effbot suggest

# File: readline-example-3.py

file = open("sample.txt")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        pass # do something**strong text**

in order to process 96,900 lines of text per second. Other authors suggest to use islice()

from itertools import islice

with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines

Allonym answered 18/2, 2013 at 19:50 Comment(5)
Why won't you check yourself what's fastest for you?Jennet
Cehck out the suggestions here: #14863724Splice
@Nix i don't wish to read line by line, but chunk by chunkAllonym
If you look through the answers, someone shows how to do it in chunks.Weatherwise
dear @nix i read in effbot.org/zone/readline-performance.htm about "Speeding up line reading" the author suggests " if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable". The page is quite old "June 09, 2000" and i am looking if there is a more new (and fast) approach.Allonym
F
16
with open(<FILE>) as FileObj:
    for lines in FileObj:
        print lines # or do some other thing with the line...

will read one line at the time to memory, and close the file when done...

Francophobe answered 18/2, 2013 at 19:57 Comment(8)
Morten line-by-line became too slow.Allonym
aay, read too fast...Francophobe
Looks like that the result of the loop of FileObj is a single character, not line.Danley
The large 7GB file can contain only one line, and in this case, your solution will be as inefficient as just reading the whole file by FileObj.read(). It would be better to try several MB-chunks here (for example by 5 MB chunks), which can be accomplished by using FileObj.read(5 * 1024 * 1024) multiple times.Kimberlite
@DemianWolf Thanks for the comment, I have a question. What happens if the given input size truncates half of a word. For example, if the last word is Responsibility and you hit the chunk limit at Respon of the full word Responsibility, how would you handle it. Is there is way not to break the words or should we need to follow some other approach? Thanks!Wolcott
@Sunny, if the file is comparably small, you may just get all the words from the whole file content (with open("my_file.txt") as fp: print(fp.read().split()). Though in your case, it seems to me, you are trying to read a large file (otherwise why would you split it in chunks?). For this case, you can use the same chunking approach but with one difference. After you read a chunk, you should read next characters one-by-one until you get a space (or another similar character as \n, \r etc.), and then add the newly-read part of file to the last chunk.Kimberlite
@DemianWolf, I had a similar approach in mind but I was hoping maybe there will be a better way to handle it. Thanks anyway!Wolcott
I think this is the slowest method. It would be faster if it loads the data portion to the memory partially not the complete file content.Understand

© 2022 - 2024 — McMap. All rights reserved.