(Python) Counting lines in a huge (>10GB) file as fast as possible [duplicate]
Asked Answered
F

5

34

I have a really simple script right now that counts lines in a text file using enumerate():

i = 0
f = open("C:/Users/guest/Desktop/file.log", "r")
for i, line in enumerate(f):
      pass
print i + 1
f.close()

This takes around 3 and a half minutes to go through a 15GB log file with ~30 million lines. It would be great if I could get this under two minutes or less, because these are daily logs and we want to do a monthly analysis, so the code will have to process 30 logs of ~15GB - more than one and a half hour possibly, and we'd like to minimise the time & memory load on the server.

I would also settle for a good approximation/estimation method, but it needs to be about 4 sig fig accurate...

Thank you!

Forge answered 9/3, 2012 at 5:5 Comment(4)
In general it would probably be faster to treat the file as binary data, read through it in reasonably-sized chunks (say, 4KB at a time), and count the \n characters in each chunk as you go.Flavius
This is not better performing than your naive solution, but fyi the pythonic way to write what you have here would be simply with open(fname) as f: print sum(1 for line in f)Pewee
aroth: Thanks for the tip, I should look into that. wim: great, thanks, that's much shorter...Forge
Take a look at rawbigcount at Michael Bacon's answer. It may be helpful you!Unmake
M
49

Ignacio's answer is correct, but might fail if you have a 32 bit process.

But maybe it could be useful to read the file block-wise and then count the \n characters in each block.

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r") as f:
    print sum(bl.count("\n") for bl in blocks(f))

will do your job.

Note that I don't open the file as binary, so the \r\n will be converted to \n, making the counting more reliable.

For Python 3, and to make it more robust, for reading files with all kinds of characters:

def blocks(files, size=65536):
    while True:
        b = files.read(size)
        if not b: break
        yield b

with open("file", "r",encoding="utf-8",errors='ignore') as f:
    print (sum(bl.count("\n") for bl in blocks(f)))
Monecious answered 9/3, 2012 at 9:24 Comment(6)
Just as one data point, a read of a large file of about 51 MB went from about a minute using the naive approach to under one second using this approach.Serpent
@MKatz What now, "a large file" or "a file of about 51 MB"? ;-)Monecious
this solution might miss out the last line but that might not matter for a huge file.Chokebore
@ngọcminh.oss Only if the last line is incomplete. A text file is defined to end with a line break, see pubs.opengroup.org/onlinepubs/9699919799/basedefs/… and https://mcmap.net/q/41675/-why-should-text-files-end-with-a-newline.Monecious
not that people care about definition. when you work with real data, everything is messy. but it doesn't matter anyway.Chokebore
Re missing lines (i.e. lines that don't end in a "line break"): Relatively unimportant, I suppose, if the file is large. But I have files that vary from huge to one-liners. Unfortunately, some of the one-liners lack a trailing newline, and the program that uses this function assumes a return value of 0 means an empty file...which may not be true. So I had to do some other checking.Cystocarp
H
24

I know its a bit unfair but you could do this

int(subprocess.check_output("wc -l C:\\alarm.bat").split()[0])

If you're on Windows, check out Coreutils.

Hamon answered 9/3, 2012 at 9:31 Comment(4)
My solution takes only 1m37 real time.Hamon
this is far fasterCacology
Seems like you need to do int(subprocess.check_output("/usr/bin/wc -l cred", shell=True).split()[0]) for python3Acquaintance
If you have large files or a lot of files, please consider using this approach if you are looking for pure performance without resorting to another language.Resonate
G
17

A fast, 1-line solution is:

sum(1 for i in open(file_path, 'rb'))

It should work on files of arbitrary size.

Granddaughter answered 2/6, 2016 at 19:56 Comment(5)
I confirm that this is the fastest one (except the wc -l hack). Using the text mode gives a little dropdown in performance, but it is insignificant in comparison with other solutions.Eventual
There is an unneeded extra generator parenthesis, btw.Eventual
Without the unneeded extra generator parenthesis, it appears to be sligntly faster (per timeit) and consumes about 3MB less memory (per memit for a file of 100,000 lines)..Granduncle
doesn't seem to work if the file is a text file with newlines. My problem are large txt files that need character counting.Rocaille
The file is not closed.Calkins
K
6

mmap the file, and count up the newlines.

import mmap

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines
Korten answered 9/3, 2012 at 5:9 Comment(3)
Please consider adding a short example to demonstrate this, thanks!Nickels
short example might be a good idea, I agreeUpcountry
Why is this faster when we'd expect the speed to be I/O bound? Does this load/read the file faster from disk? If so, why?Presentationism
K
2

I'd extend gl's answer and run his/her code using multiprocessing Python module for faster count:

def blocks(f, cut, size=64*1024): # 65536
    start, chunk =cut
    iter=0
    read_size=int(size)
    _break =False
    while not _break:
        if _break: break
        if f.tell()+size>start+chunk:
            read_size=int(start+chunk- f.tell() )
            _break=True
        b = f.read(read_size)
        iter +=1
        if not b: break
        yield b


def get_chunk_line_count(data):
    fn,  chunk_id, cut = data
    start, chunk =cut
    cnt =0
    last_bl=None

    with open(fn, "r") as f:
        if 0:
            f.seek(start)
            bl = f.read(chunk)
            cnt= bl.count('\n')
        else:
            f.seek(start)
            for i, bl  in enumerate(blocks(f,cut)):
                cnt +=  bl.count('\n')
                last_bl=bl

        if not last_bl.endswith('\n'):
            cnt -=1

        return cnt
....
pool = multiprocessing.Pool(processes=pool_size,
                            initializer=start_process,
                            )
pool_outputs = pool.map(get_chunk_line_count, inputs)
pool.close() # no more tasks
pool.join() 

This will improve counting performance 20 folds. I wrapped it to a script and put it to Github.

Koralle answered 6/12, 2016 at 21:21 Comment(1)
@Koralle Thank you for sharing the multiprocessing approach. Quick question as a newbie, how do we run this code to count a line in a big file (say, 'myfile.txt')? Do I tried pool = multiprocessing.Pool(4); pool_outputs = pool.map(get_chunk_line_count, 'myfile.txt'), but that causes error. Thank you in advanced for your answer!Murage

© 2022 - 2024 — McMap. All rights reserved.