Fastest way to find the number of lines in a text (C++)
Asked Answered
H

9

28

I need to read the number of lines in a file before doing some operations on that file. When I try to read the file and increment the line_count variable at each iteration until I reach EOF. It was not that fast in my case. I used both ifstream and fgets. They were both slow. Is there a hacky way to do this, which is also used by, for instance BSD, Linux kernel or berkeley db (may be by using bitwise operations).

The number of lines is in the millions in that file and it keeps getting larger, each line is about 40 or 50 characters. I'm using Linux.

Note:
I'm sure there will be people who might say use a DB idiot. But briefly in my case I can't use a db.

Heroism answered 9/5, 2009 at 11:28 Comment(0)
S
20

The only way to find the line count is to read the whole file and count the number of line-end characters. The fastest way to do this is probably to read the whole file into a large buffer with one read operation and then go through the buffer counting the '\n' characters.

As your current file size appears to be about 60Mb, this is not an attractive option. You can get some of the speed by not reading the whole file, but reading it in chunks, say of size 1Mb. You also say that a database is out of the question, but it really does look to be the best long-term solution.

Edit: I just ran a small benchmark on this and using the buffered approach (buffer size 1024K) seems to be a bit more than twice as fast as reading a line at a time with getline(). Here's the code - my tests were done with g++ using -O2 optimisation level:

#include <iostream>
#include <fstream>
#include <vector>
#include <ctime>
using namespace std;

unsigned int FileRead( istream & is, vector <char> & buff ) {
    is.read( &buff[0], buff.size() );
    return is.gcount();
}

unsigned int CountLines( const vector <char> & buff, int sz ) {
    int newlines = 0;
    const char * p = &buff[0];
    for ( int i = 0; i < sz; i++ ) {
        if ( p[i] == '\n' ) {
            newlines++;
        }
    }
    return newlines;
}

int main( int argc, char * argv[] ) {
    time_t now = time(0);
    if ( argc == 1  ) {
        cout << "lines\n";
        ifstream ifs( "lines.dat" );
        int n = 0;
        string s;
        while( getline( ifs, s ) ) {
            n++;
        }
        cout << n << endl;
    }
    else {
        cout << "buffer\n";
        const int SZ = 1024 * 1024;
        std::vector <char> buff( SZ );
        ifstream ifs( "lines.dat" );
        int n = 0;
        while( int cc = FileRead( ifs, buff ) ) {
            n += CountLines( buff, cc );
        }
        cout << n << endl;
    }
    cout << time(0) - now << endl;
}
Switchback answered 9/5, 2009 at 11:37 Comment(11)
I'd like to add that the size of the 1MB should be 1,048,576 bytes not 1,000,000 bytes. More precisely, it should be a multiple of the "block size" for the device which is storing the file. In my experience this is 512 bytes for hard disk drives, and 2048 bytes for CD/DVD drives. Don't know if that is in some specification or not.Pereira
Also, you should not read one block at a time. There is a constant overhead for every read operation, so the more data you read at once, the faster your process will be. However, I've also found out that there is little difference in overall speed (for a HDD) if you read 400 blocks at once or if you read 800 blocks at once. So I guess that 1,048,576 bytes will be more than enough.Pereira
Last time I played this game, I tested a lot of buffer sizes to find the sweet spot where we got most bang per buck (both RAM and speed were at a premium). This was output to flash, so the result isn't applicable to this problem, but I found that you got 90% or so of the max possible speed with 25% or so of the max beneficial memory. So it's all down to how the two trade off. Of course 1MB is nothing on a PC.Keverne
Actually, the code posted is a bit flawed - the big buffer is zero initialised (because it is a vector). You can remove a startup hit by using new char[SZ] instead. My excuse is it is some code I had lying around (with a much smaller buffer) & I did a quick copy & paste.Switchback
Above a certain point, using a larger buffer will actually result in slower performance, since less of the RAM will fit in L2 cache. And for really large buffers, the OS may need to page to disk in order to provide you with the requested amount of RAM.Martine
Little bug: if the file does not end with the character \n, your second loop will report a line count one lower than your first.Martine
True - I wasn't offering this as an actual solution to the question, just as a rough benchmark for performance of two possible approaches.Switchback
Creating your own buffer is overkill. The fstream objects do their own buffering.Northumberland
The buffered version is faser than using the built-in stream buffering. See my comments to your answer.Switchback
You can replace the fstream's own buffers by calling the relevant methods on ifs.rdbuf(). You'd still have more call overhead scanning smaller chunks of data, but it should help with the overhead from blocking I/O.Keverne
You can replace CountLines with std::count(buf.begin(), buf.end(), '\n')Acrocarpous
S
11

Don't use C++ stl strings and getline ( or C's fgets), just C style raw pointers and either block read in page-size chunks or mmap the file.

Then scan the block at the native word size of your system ( ie either uint32_t or uint64_t) using one of the magic algorithms 'SIMD Within A Register (SWAR) Operations' for testing the bytes within the word. An example is here; the loop with the 0x0a0a0a0a0a0a0a0aLL in it scans for line breaks. ( that code gets to around 5 cycles per input byte matching a regex on each line of a file )

If the file is only a few tens or a hundred or so megabytes, and it keeps growing (ie something keeps writing to it), then there's a good likelihood that linux has it cached in memory, so it won't be disk IO limited, but memory bandwidth limited.

If the file is only ever being appended to, you could also remember the number of lines and previous length, and start from there.


It has been pointed out that you could use mmap with C++ stl algorithms, and create a functor to pass to std::foreach. I suggested that you shouldn't do it not because you can't do it that way, but there is no gain in writing the extra code to do so. Or you can use boost's mmapped iterator, which handles it all for you; but for the problem the code I linked to was written for this was much, much slower, and the question was about speed not style.

Simpkins answered 9/5, 2009 at 11:38 Comment(8)
There is no reason you can't use mmap or reading blocks in C++.Switchback
You have to jump through a lot of hoops to convert between stl containers and strings and raw mmapped memory, which often involves copying and indirection, rather than just calling the functions and using the memory directly using the C subset of C++.Simpkins
There is of course no "C subset" of C++. Just because you don't use a std container or string does not make it "not C++" in some way.Switchback
And to add to what Neil said, std algorithms work perfectly fine on raw memory, using pointers as iterators. You could trivially write a functor performing the SIMD tricks specified, and run that over the file data with std::for_each. STL is more than just the containersConsidering
@Neil you know full well that there are a subset of C++ which is close to idiomatic C I've posted this exact response before and been criticised for it being C rather than C++; you can't win.Simpkins
Thanx Pete, this is actually the technique that i tried to imply in the question. I'm going to try to implement this.Heroism
You can write a custom allocator and then you can use std::vector.Photomultiplier
@rightfold you can do a lot of things in C++. I didn't try that approach when I was measuring and optimising code for a similar problem. Have you any evidence that the overhead of your suggested technique is negligible in performance critical subsystems, and does it gain anything in maintainability?Simpkins
H
10

You wrote that it keeps getting larger.

This sounds like it is a log file or something similar where new lines are appended but existing lines are not changed. If this is the case you could try an incremental approach:

  • Parse to the end of file.
  • Remember the line count and the offset of EOF.
  • When the file grows fseek to the offset, parse to EOF and update the line count and the offset.
Highminded answered 9/5, 2009 at 12:42 Comment(1)
yes that's true it is like a log file (but not a log file: an aggrete of statistics from several log files from several developers) and this program is developed by another person and he maintains it. But your for a feasible and local solution. But i just wondered a global optimum solution which can calculate very fastly without having an apriori knowledge about the file. This is just for curiosity :).Heroism
A
7

There's a difference between counting lines and counting line separators. Some common gotchas to watch out for if getting an exact line count is important:

  1. What's the file encoding? The byte-by-byte solutions will work for ASCII and UTF-8, but watch out if you have UTF-16 or some multibyte encoding that doesn't guarantee that a byte with the value of a line feed necessarily encodes a line feed.

  2. Many text files don't have a line separator at the end of the last line. So if your file says "Hello, World!", you could end up with a count of 0 instead of 1. Rather than just counting the line separators, you'll need a simple state machine to keep track.

  3. Some very obscure files use Unicode U+2028 LINE SEPARATOR (or even U+2029 PARAGRAPH SEPARATOR) as line separators instead of the more common carriage return and/or line feed. You might also want to watch out for U+0085 NEXT LINE (NEL).

  4. You'll have to consider whether you want to count some other control characters as line breakers. For example, should a U+000C FORM FEED or U+000B LINE TABULATION (a.k.a. vertical tab) be considered going to a new line?

  5. Text files from older versions of Mac OS (before OS X) use carriage returns (U+000D) rather than line feeds (U+000A) to separate lines. If you're reading the raw bytes into a buffer (e.g., with your stream in binary mode) and scanning them, you'll come up with a count of 0 on these files. You can't count both carriage returns and line feeds, because PC files generally end a line with both. Again, you'll need a simple state machine. (Alternatively, you can read the file in text mode rather than binary mode. The text interfaces will normalize line separators to '\n' for files that conform to the convention used on your platform. If you're reading files from other platforms, you'll be back to binary mode with a state machine.)

  6. If you ever have a super long line in the file, the getline() approach can throw an exception causing your simple line counter to fail on a small number of files. (This is particularly true if you're reading an old Mac file on a non-Mac platform, causing getline() to see the entire file as one gigantic line.) By reading chunks into a fixed-size buffer and using a state machine, you can make it bullet proof.

The code in the accepted answer suffers from most of these traps. Make it right before you make it fast.

Alcoholize answered 9/5, 2009 at 15:18 Comment(6)
+1, excellent points. #3 scared me though -- first time I've heard of that! <shiver>Martine
@Martine Yeah, me too! :ORuralize
Worth noting in regards to pt 2.: If a line does not end in a new-line it is by definition not a line by POSIX standards, it is an incomplete line, and it is a very good reason for this. A line is zero or more bytes ending with a 0x0a. Text-files are expected to end in new-line. There are some people that go to extreme lengths to be able to write for example code files that do not end in new-line. These are likely to bug or break various tools / chains. So by that "Hello, World!" is zero lines and not one ;)Worldling
@user3342816: Posix is but a subset of the systems that deal with text files. In the world of the internet, where people share files from all sorts of systems, (and where network protocols typically use "\x0D\x0A" as line separators), one would be wise to consider that there are many other standards.Alcoholize
Sure. But think you might miss interpret my intention with the comment. One can not count lines without knowing the "format/origin" of the file. OP is on Linux where text files typically consist of lines ending in 0x0a. If one work with Windows text-files on Linux one typically convert them first. As for network protocols, sure they mostly use 0x0d0a, but that is not for the payload. They (most?) do not deal with lines, but bytes. I'm in no way saying it's a bad answer, but thought it was worth a mention at least in comments.Worldling
@Worldling It took ages for some programmers to figure out how to parse "\r", "\n", or "\r\n" as one end of line mark. Now most tools understand either one (finally). Frankly, it's not that hard and that way you can avoid the conversion. Now, with the Internet, the file could receive data from any source, including an old Mac, a MS-Windows system, and Unices, meaning that you'd get varying newlines in your file. I think Adrian tries to address that issue in his answer.Dugger
N
4

Remember that all fstreams are buffered. So they in-effect do actually reads in chunks so you do not have to recreate this functionality. So all you need to do is scan the buffer. Don't use getline() though as this will force you to size a string. So I would just use the STL std::count and stream iterators.

#include <iostream>
#include <fstream>
#include <iterator>
#include <algorithm>


struct TestEOL
{
    bool operator()(char c)
    {
        last    = c;
        return last == '\n';
    }
    char    last;
};

int main()
{
    std::fstream  file("Plop.txt");

    TestEOL       test;
    std::size_t   count   = std::count_if(std::istreambuf_iterator<char>(file),
                                          std::istreambuf_iterator<char>(),
                                          test);

    if (test.last != '\n')  // If the last character checked is not '\n'
    {                       // then the last line in the file has not been 
        ++count;            // counted. So increement the count so we count
    }                       // the last line even if it is not '\n' terminated.
}
Northumberland answered 9/5, 2009 at 19:16 Comment(4)
I ran your code against the two possible implementations I posted. Yours runs at the same speed as the getline() version (i.e. half the speed of the explicitly buffered one) but unfortunately also prints out the wrong number of lines - 1000001 versus 1000000 for the implemntations I posted on my million line text file, which being in windows format is terminated by CR/LF.Switchback
PS It seems to think that test.last contains the zero byte.Switchback
This is because the file is opened in text mode. This forces a EOL translation which requires processing. If the file is opened in binary mode there is no EOL translation and it will be faster. Weather you need EOL translation or will depend on what you are doing.Northumberland
@Neil: windows format is terminated by CR/LF: This sequence should be auto translated by the stream into '\n' because it is the EOL marker. If you move files from one OS to another then all bets are off and you should transform your files before use. see dos2unixNorthumberland
L
3

It isn't slow because of your algorithm , It is slow because IO operations are slow. I suppose you are using a simple O(n) algorithm that is simply going over the file sequentially. In that case , there is no faster algorithm that can optimize your program.

However , I said there is no faster algorithm , but there is a faster mechanism which called "Memory Mapped file " , There are some drawback for mapped files and it might not be appropiate for you case , So you'll have to read about it and figure out by yourself.

Memory mapped files won't let you implement an algorithm better then O(n) but it may will reduce IO access time.

Leralerch answered 9/5, 2009 at 11:37 Comment(1)
The question was about making the runtime faster, not how the algorithm scales. An O(n) algorithm which takes a couple of cycles per n is faster than one which takes a year for each n.Simpkins
M
3

You can only get a definitive answer by scanning the entire file looking for newline characters. There's no way around that.

However, there are a couple of possibilities which you may want to consider.

1/ If you're using a simplistic loop, reading one character at a time checking for newlines, don't. Even though the I/O may be buffered, function calls themselves are expensive, time-wise.

A better option is to read large chunks of the file (say 5M) into memory with a single I/O operation, then process that. You probably don't need to worry too much about special assembly instruction since the C runtime library will be optimized anyway - a simple strchr() should do it.

2/ If you're saying that the general line length is about 40-50 characters and you don't need an exact line count, just grab the file size and divide by 45 (or whatever average you deem to use).

3/ If this is something like a log file and you don't have to keep it in one file (may require rework on other parts of the system), consider splitting the file periodically.

For example, when it gets to 5M, move it (e.g., x.log) to a dated file name (e.g., x_20090101_1022.log) and work out how many lines there are at that point (storing it in x_20090101_1022.count, then start a new x.log log file. Characteristics of log files mean that this dated section that was created will never change so you will never have to recalculate the number of lines.

To process the log "file", you'd just cat x_*.log through some process pipe rather than cat x.log. To get the line count of the "file", do a wc -l on the current x.log (relatively fast) and add it to the sum of all the values in the x_*.count files.

Millais answered 9/5, 2009 at 12:10 Comment(0)
C
1

The thing that takes time is loading 40+ MB into memory. The fastest way to do that is to either memorymap it, or load it in one go into a big buffer. Once you have it in memory, one way or another, a loop traversing the data looking for \n characters is almost instantaneous, no matter how it is implemented.

So really, the most important trick is to load the file into memory as fast as possible. And the fastest way to do that is to do it as a single operation.

Otherwise, plenty of tricks may exist to speed up the algorithm. If lines are only added, never modified or removed, and if you're reading the file repeatedly, you can cache the lines read previously, and the next time you have to read the file, only read the newly added lines.

Or perhaps you can maintain a separate index file showing the location of known '\n' characters, so those parts of the file can be skipped over.

Reading large amounts of data from the harddrive is slow. There's no way around that.

Considering answered 9/5, 2009 at 12:16 Comment(1)
Above a certain point, using a larger buffer will actually result in slower performance, since less of the RAM will fit in L2 cache. And for really large buffers, the OS may need to page to disk in order to provide you with the requested amount of RAMMartine
D
0

If your file only grows, then Ludwig Weinzierl is the best solution if you do not have control of the writers. Otherwise, you can make it even faster: increment the counter by one each time a line is written to the file. If multiple writers may try to write to the file simultaneously, then make sure to use a lock. Locking your existing file is enough. The counter can be 4 or 8 bytes written in binary in a file written under /run/<your-prog-name>/counter (which is RAM so dead fast).

Ludwig Algorithm

  • Initialize offset to 0
  • Read file from offset to EOF counting '\n' (as mentioned by others, make sure to use buffered I/O and count the '\n' inside that buffer)
  • Update offset with position at EOF
  • Save counter & offset to a file or in a variable if you only need it in your software
  • Repeat from "Read file ..." on a change

This is actually how various software processing log files function (i.e. fail2ban comes to mind).

The first time, it has to process a huge file. Afterward, it is very small and thus goes very fast.

Proactive Algorithm

When creating the files, reset counter to 0.

Then each time you receive a new line to add to the file:

  • Lock file
  • Write one line
  • Load counter
  • Add one to counter
  • Save counter
  • Unlock file

This is very close to what database systems do so a SELECT COUNT(*) FROM table on a table with millions of rows return instantly. Databases also do that per index. So if you add a WHERE clause which matches a specific index, you also get the total instantly. Same principle as above.


Personal note: I see a huge number of Internet software which are backward. A watchdog makes sense for various things in a software environment. However, in most cases, when something of importance happens, you should send a message at the time it happens. Not use a backward concept of checking logs to detect that something bad just happened.

For example, you detect that a user tried to access a website and entered the wrong password 5 times in a row. You want to send a instant message to the admin to make sure there wasn't a 6th time which was successful and the hacker can now see all your user's data... If you use logs, the "instant message" is going to be late by seconds if not minutes.

Don't do processing backward.

Dugger answered 9/7, 2022 at 21:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.