Python fastest way to read a large text file (several GB) [duplicate]

About

Asked 18/2, 2013 at 19:50 Answered 18/2, 2013 at 19:57

python performance optimization line chunking

i have a large text file (~7 GB). I am looking if exist the fastest way to read large text file. I have been reading about using several approach as read chunk-by-chunk in order to speed the process.

at example effbot suggest

# File: readline-example-3.py

file = open("sample.txt")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        pass # do something**strong text**

in order to process 96,900 lines of text per second. Other authors suggest to use islice()

from itertools import islice

with open(...) as f:
    while True:
        next_n_lines = list(islice(f, n))
        if not next_n_lines:
            break
        # process next_n_lines

list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines

Allonym answered 18/2, 2013 at 19:50 Comment(5)

Why won't you check yourself what's fastest for you? – Jennet 18/2, 2013 at 19:54

Cehck out the suggestions here: #14863724 – Splice 18/2, 2013 at 19:56

@Nix i don't wish to read line by line, but chunk by chunk – Allonym 18/2, 2013 at 20:7

If you look through the answers, someone shows how to do it in chunks. – Weatherwise 18/2, 2013 at 20:15

dear @nix i read in effbot.org/zone/readline-performance.htm about "Speeding up line reading" the author suggests " if you’re processing really large files, it would be nice if you could limit the chunk size to something reasonable". The page is quite old "June 09, 2000" and i am looking if there is a more new (and fast) approach. – Allonym 18/2, 2013 at 20:18

with open(<FILE>) as FileObj:
    for lines in FileObj:
        print lines # or do some other thing with the line...

will read one line at the time to memory, and close the file when done...

Francophobe answered 18/2, 2013 at 19:57 Comment(8)

Morten line-by-line became too slow. – Allonym 18/2, 2013 at 20:5

aay, read too fast... – Francophobe 18/2, 2013 at 21:27

Looks like that the result of the loop of FileObj is a single character, not line. – Danley 2/6, 2017 at 1:13

The large 7GB file can contain only one line, and in this case, your solution will be as inefficient as just reading the whole file by FileObj.read(). It would be better to try several MB-chunks here (for example by 5 MB chunks), which can be accomplished by using FileObj.read(5 * 1024 * 1024) multiple times. – Kimberlite 21/6, 2020 at 14:20

@DemianWolf Thanks for the comment, I have a question. What happens if the given input size truncates half of a word. For example, if the last word is Responsibility and you hit the chunk limit at Respon of the full word Responsibility, how would you handle it. Is there is way not to break the words or should we need to follow some other approach? Thanks! – Wolcott 17/7, 2020 at 2:29

@Sunny, if the file is comparably small, you may just get all the words from the whole file content (with open("my_file.txt") as fp: print(fp.read().split()). Though in your case, it seems to me, you are trying to read a large file (otherwise why would you split it in chunks?). For this case, you can use the same chunking approach but with one difference. After you read a chunk, you should read next characters one-by-one until you get a space (or another similar character as \n, \r etc.), and then add the newly-read part of file to the last chunk. – Kimberlite 17/7, 2020 at 14:47

@DemianWolf, I had a similar approach in mind but I was hoping maybe there will be a better way to handle it. Thanks anyway! – Wolcott 18/7, 2020 at 0:14

I think this is the slowest method. It would be faster if it loads the data portion to the memory partially not the complete file content. – Understand 12/3, 2022 at 21:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags