(Python) Counting lines in a huge (>10GB) file as fast as possible [duplicate]

Asked 9/3, 2012 at 5:5 Answered 6/12, 2016 at 21:21

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

Forge answered 9/3, 2012 at 5:5 Comment(4)

In general it would probably be faster to treat the file as binary data, read through it in reasonably-sized chunks (say, 4KB at a time), and count the \n characters in each chunk as you go. – Flavius 9/3, 2012 at 5:11

This is not better performing than your naive solution, but fyi the pythonic way to write what you have here would be simply with open(fname) as f: print sum(1 for line in f) – Pewee 9/3, 2012 at 5:37

aroth: Thanks for the tip, I should look into that. wim: great, thanks, that's much shorter... – Forge 9/3, 2012 at 5:43

Take a look at rawbigcount at Michael Bacon's answer. It may be helpful you! – Unmake 3/2, 2016 at 13:20

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))

Monecious answered 9/3, 2012 at 9:24 Comment(6)

Just as one data point, a read of a large file of about 51 MB went from about a minute using the naive approach to under one second using this approach. – Serpent 16/12, 2013 at 21:23

@MKatz What now, "a large file" or "a file of about 51 MB"? ;-) – Monecious 11/3, 2014 at 20:27

this solution might miss out the last line but that might not matter for a huge file. – Chokebore 25/7, 2017 at 13:6

@ngọcminh.oss Only if the last line is incomplete. A text file is defined to end with a line break, see pubs.opengroup.org/onlinepubs/9699919799/basedefs/… and https://mcmap.net/q/41675/-why-should-text-files-end-with-a-newline. – Monecious 25/7, 2017 at 14:10

not that people care about definition. when you work with real data, everything is messy. but it doesn't matter anyway. – Chokebore 26/7, 2017 at 17:19

Re missing lines (i.e. lines that don't end in a "line break"): Relatively unimportant, I suppose, if the file is large. But I have files that vary from huge to one-liners. Unfortunately, some of the one-liners lack a trailing newline, and the program that uses this function assumes a return value of 0 means an empty file...which may not be true. So I had to do some other checking. – Cystocarp 12/10, 2020 at 18:12

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If you're on Windows, check out Coreutils.

Hamon answered 9/3, 2012 at 9:31 Comment(4)

My solution takes only 1m37 real time. – Hamon 9/3, 2012 at 9:44

this is far faster – Cacology 14/9, 2017 at 8:27

Seems like you need to do int(subprocess.check_output("/usr/bin/wc -l cred", shell=True).split()[0]) for python3 – Acquaintance 4/12, 2017 at 20:12

If you have large files or a lot of files, please consider using this approach if you are looking for pure performance without resorting to another language. – Resonate 9/4, 2019 at 18:2

A fast, 1-line solution is:

sum(1 for i in open(file_path, 'rb'))

It should work on files of arbitrary size.

Granddaughter answered 2/6, 2016 at 19:56 Comment(5)

I confirm that this is the fastest one (except the wc -l hack). Using the text mode gives a little dropdown in performance, but it is insignificant in comparison with other solutions. – Eventual 18/11, 2018 at 13:38

There is an unneeded extra generator parenthesis, btw. – Eventual 18/11, 2018 at 13:40

Without the unneeded extra generator parenthesis， it appears to be sligntly faster (per timeit) and consumes about 3MB less memory (per memit for a file of 100,000 lines).. – Granduncle 7/1, 2019 at 4:16

doesn't seem to work if the file is a text file with newlines. My problem are large txt files that need character counting. – Rocaille 7/2, 2021 at 4:40

The file is not closed. – Calkins 10/3, 2022 at 12:44

mmap the file, and count up the newlines.

import mmap

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

Korten answered 9/3, 2012 at 5:9 Comment(3)

Please consider adding a short example to demonstrate this, thanks! – Nickels 5/8, 2018 at 19:57

short example might be a good idea, I agree – Upcountry 21/8, 2019 at 6:18

Why is this faster when we'd expect the speed to be I/O bound? Does this load/read the file faster from disk? If so, why? – Presentationism 8/2 at 11:44

I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('\n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('\n')
                last_bl=bl

        if not last_bl.endswith('\n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join()

This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.

Koralle answered 6/12, 2016 at 21:21 Comment(1)

@Koralle Thank you for sharing the multiprocessing approach. Quick question as a newbie, how do we run this code to count a line in a big file (say, 'myfile.txt')? Do I tried pool = multiprocessing.Pool(4); pool_outputs = pool.map(get_chunk_line_count, 'myfile.txt'), but that causes error. Thank you in advanced for your answer! – Murage 28/8, 2019 at 19:8

Recommended topics

Hot tags