Is StreamReader.Readline() really the fastest method to count lines in a file?
Asked Answered
S

7

15

While looking around for a while I found quite a few discussions on how to figure out the number of lines in a file.

For example these three:
c# how do I count lines in a textfile
Determine the number of lines within a text file
How to count lines fast?

So, I went ahead and ended up using what seems to be the most efficient (at least memory-wise?) method that I could find:

private static int countFileLines(string filePath)
{
    using (StreamReader r = new StreamReader(filePath))
    {
        int i = 0;
        while (r.ReadLine() != null) 
        { 
            i++; 
        }
        return i;
    }
}

But this takes forever when the lines themselves from the file are very long. Is there really not a faster solution to this?

I've been trying to use StreamReader.Read() or StreamReader.Peek() but I can't (or don't know how to) make the either of them move on to the next line as soon as there's 'stuff' (chars? text?).

Any ideas please?


CONCLUSION/RESULTS (After running some tests based on the answers provided):

I tested the 5 methods below on two different files and I got consistent results that seem to indicate that plain old StreamReader.ReadLine() is still one of the fastest ways... To be honest, I'm perplexed after all the comments and discussion in the answers.

File #1:
Size: 3,631 KB
Lines: 56,870

Results in seconds for File #1:
0.02 --> ReadLine method.
0.04 --> Read method.
0.29 --> ReadByte method.
0.25 --> Readlines.Count method.
0.04 --> ReadWithBufferSize method.

File #2:
Size: 14,499 KB
Lines: 213,424

Results in seconds for File #1:
0.08 --> ReadLine method.
0.19 --> Read method.
1.15 --> ReadByte method.
1.02 --> Readlines.Count method.
0.08 --> ReadWithBufferSize method.

Here are the 5 methods I tested based on all the feedback I received:

private static int countWithReadLine(string filePath)
{
    using (StreamReader r = new StreamReader(filePath))
    {
    int i = 0;
    while (r.ReadLine() != null)
    {
        i++;
    }
    return i;
    }
}

private static int countWithRead(string filePath)
{
    using (StreamReader _reader = new StreamReader(filePath))
    {
    int c = 0, count = 0;
    while ((c = _reader.Read()) != -1)
    {
        if (c == 10)
        {
        count++;
        }
    }
    return count;
    }            
}

private static int countWithReadByte(string filePath)
{
    using (Stream s = new FileStream(filePath, FileMode.Open))
    {
    int i = 0;
    int b;

    b = s.ReadByte();
    while (b >= 0)
    {
        if (b == 10)
        {
        i++;
        }
        b = s.ReadByte();
    }
    return i;
    }
}

private static int countWithReadLinesCount(string filePath)
{
    return File.ReadLines(filePath).Count();
}

private static int countWithReadAndBufferSize(string filePath)
{
    int bufferSize = 512;

    using (Stream s = new FileStream(filePath, FileMode.Open))
    {
    int i = 0;
    byte[] b = new byte[bufferSize];
    int n = 0;

    n = s.Read(b, 0, bufferSize);
    while (n > 0)
    {
        i += countByteLines(b, n);
        n = s.Read(b, 0, bufferSize);
    }
    return i;
    }
}

private static int countByteLines(byte[] b, int n)
{
    int i = 0;
    for (int j = 0; j < n; j++)
    {
    if (b[j] == 10)
    {
        i++;
    }
    }

    return i;
}
Stockman answered 9/1, 2013 at 17:45 Comment(7)
how would read() or peek() know where the next line is in the stream?Perceptible
@Perceptible By looking for \n and \r characters.Selfsame
Does each line have exactly the same number of bytes, or almost exactly? If they're exact you could just count the file size, and if they're close you could come up with a close approximation based on the average line length.Selfsame
I was referring to the comment regarding having the stream jump forwardPerceptible
@John: Thanks, John. Your answer(?) helps me realize I'm looking in the wrong spot, even if you meant it to be sarcastic.Stockman
@Servy: No, the lengths vary in size and are almost never identical...Stockman
This only works correct for single byte encodings and utf-8! For utf-16/utf-32 you will get false line shifts because the the byte value 10 can be part of a character.Rye
E
9

No, it is not. Point is - it materializes the strings, which is not needed.

To COUNT it you are much better off to ignore the "string" Part and to go the "line" Part.

a LINE is a seriees of bytes ending with \r\n (13, 10 - CR LF) or another marker.

Just run along the bytes, in a buffered stream, counting the number of appearances of your end of line marker.

Etherify answered 9/1, 2013 at 17:51 Comment(8)
Could you elaborate a bit more with some sample code? Thank you!Stockman
If it is not school or other trainee homework, you are wrong in programming. THis IS the level of a "introduction into programming" test.Etherify
I guess I'm at that level in this regard, and I don't take offense. I'm not looking for 'free lunch' here, I just don't like it when people just judge you without knowing the full story. I do appreciate your 'answer' and will absolutely use it to my advantage while I keep trying to figure this out... with or without a code example. Thanks.Stockman
Looking at Brian's code example and re-reading your answer makes me realize what's going on. I also suspect that I may have sounded "non-appreciative" when I said "Could you elaborate a bit more...?" Since your answer actually explains things I will likely end up selecting it as the answer for this question. Thank you, TomTom.Stockman
TomTom, I ran some tests and apparently my original method, Readline() seems to be the fastest. I added my notes into my original question, in case you had some more feedback on this. Thanks!Stockman
Try not reading byte by byte. ALlocate a 128kb buffer, read it, then run along it, for performance possibly with unsafe pointer code. THis is relatively trivial and should provide a significant boost, possibly.Etherify
Allocating buffer size helps, although still close to ReadLine(). I updated my entry again to add the code I utilized there. Thanks!Stockman
This only works correct for single byte encodings and utf-8! For utf-16/utf-32 you will get false line shifts because the the byte value 10 can be part of a character.Rye
P
5

The best way to know how to do this fast is to think about the fastest way to do it without using C/C++.

In assembly there is a CPU level operation that scans memory for a character so in assembly you would do the following

  • Read big part (or all) of the file into memory
  • Execute the SCASB command
  • Repeat as needed

So, in C# you want the compiler to get as close to that as possible.

Partizan answered 9/1, 2013 at 17:57 Comment(0)
A
3
public static int CountLines(Stream stm)
{
    StreamReader _reader = new StreamReader(stm);
    int c = 0, count = 0;
    while ((c = _reader.Read()) != -1)
    {
        if (c == '\n')
        {
            count++;
        }
    }
    return count;
}
Avivah answered 9/1, 2013 at 17:52 Comment(8)
This is likely to have horrible performance, calling a method for each character in the file.Bankruptcy
@500-InternalServerError How would you be able to do any better? I don't see any possible way around it; the best you could do would just be hiding the fact that some API is doing the same thing. It would have a smaller memory footprint than the OP with similar performance, given that each line is rather large.Selfsame
@Selfsame - I would at least use a buffer and read into that a chunk at a time. Another approach would be to use a file mapping but I have yet to try that out from .NET.Bankruptcy
@500-InternalServerError The OS and/or the hard drive will be buffering it internally, you're unlikely to see any benefits out of buffering it again.Selfsame
@Selfsame - I respectfully disagree but grant that the difference may not matter to the OP.Bankruptcy
@500-InternalServerError And respectfully, you know little about programming then. Open a file stream - parameter for buffer size. Or put a buffered stream around it.Etherify
@Etherify try it for yourself if you don't believe me: a version using an explicit buffer and a for loop is going to be at least twice as fast for larger files due to the sheer call overhead of reader.read.Bankruptcy
@TomTom: I see that this code example is basically the answer you posted. But your response explains what's going on. Thanks.Stockman
U
3

Yes, reading lines like that is the fastest and easiest way in any practical sense.

There are no shortcuts here. Files are not line based, so you have to read every single byte from the file to determine how many lines there are.

As TomTom pointed out, creating the strings is not strictly needed to count the lines, but a vast majority of the time spent will be waiting for the data to be read from the disk. Writing a much more complicated algorithm would perhaps shave off a percent of the execution time, and it would dramatically increase the time for writing and testing the code.

Unifoliolate answered 9/1, 2013 at 18:18 Comment(3)
Note that the change wouldn't be so much in speed, but in the memory footprint. If the lines are large it's the difference between storing each line in memory vs only storing one character at a time in memory (although with buffering, that won't quite be the case, but it means the memory footprint will be almost exactly the size of the buffer, no more).Selfsame
@Servy: Yes, but that has very little impact on the speed.Unifoliolate
Yep, which is why I opened saying it wouldn't impact the speed.Selfsame
R
3

I tried multiple methods and tested their performance:

The one that reads a single byte is about 50% slower than the other methods. The other methods all return around the same amount of time. You could try creating threads and doing this asynchronously, so while you are waiting for a read you can start processing a previous read. That sounds like a headache to me.

I would go with the one liner: File.ReadLines(filePath).Count(); it performs as well as the other methods I tested.

        private static int countFileLines(string filePath)
        {
            using (StreamReader r = new StreamReader(filePath))
            {
                int i = 0;
                while (r.ReadLine() != null)
                {
                    i++;
                }
                return i;
            }
        }

        private static int countFileLines2(string filePath)
        {
            using (Stream s = new FileStream(filePath, FileMode.Open))
            {
                int i = 0;
                int b;

                b = s.ReadByte();
                while (b >= 0)
                {
                    if (b == 10)
                    {
                        i++;
                    }
                    b = s.ReadByte();
                }
                return i + 1;
            }
        }

        private static int countFileLines3(string filePath)
        {
            using (Stream s = new FileStream(filePath, FileMode.Open))
            {
                int i = 0;
                byte[] b = new byte[bufferSize];
                int n = 0;

                n = s.Read(b, 0, bufferSize);
                while (n > 0)
                {
                    i += countByteLines(b, n);
                    n = s.Read(b, 0, bufferSize);
                }
                return i + 1;
            }
        }

        private static int countByteLines(byte[] b, int n)
        {
            int i = 0;
            for (int j = 0; j < n; j++)
            {
                if (b[j] == 10)
                {
                    i++;
                }
            }

            return i;
        }

        private static int countFileLines4(string filePath)
        {
            return File.ReadLines(filePath).Count();
        }
Renz answered 9/1, 2013 at 21:20 Comment(4)
Thank you very much for taking the time to test this out, Nick! I will run some tests with the examples you give definitively use whichever seems to be more efficient/fast. Thanks, again!Stockman
Nick, I added some conclusion comments to my original question based on some tests I ran, using most of your methods proposed, in case you wanted to take a look and had more feedback. Thanks, again!Stockman
Yeah, the file I used was much larger 1GB.Renz
This only works correct for single byte encodings and utf-8! For utf-16/utf-32 you will get false line shifts because the the byte value 10 can be part of a character.Rye
P
1

There are numerous ways to read a file. Usually, the fastest way is the simplest:

using (StreamReader sr = File.OpenText(fileName))
{
        string s = String.Empty;
        while ((s = sr.ReadLine()) != null)
        {
               //do what you gotta do here
        }
}

This page does a great performance comparison between several different techniques including using BufferedReaders, reading into StringBuilder objects, and into an entire array.

Proboscis answered 6/9, 2014 at 2:16 Comment(1)
K.I.S.S. principle at its finest.Blancablanch
F
-1

StreamReader is not the fastest way to read files in general because of the small overhead from encoding the bytes to characters, so reading the file in a byte array is faster.
The results I get are a bit different each time due to caching and other processes, but here is one of the results I got (in milliseconds) with a 16 MB file :

75 ReadLines 
82 ReadLine 
22 ReadAllBytes 
23 Read 32K 
21 Read 64K 
27 Read 128K 

In general File.ReadLines should be a little bit slower than a StreamReader.ReadLine loop. File.ReadAllBytes is slower with bigger files and will throw out of memory exception with huge files. The default buffer size for FileStream is 4K, but on my machine 64K seemed the fastest.

    private static int countWithReadLines(string filePath)
    {
        int count = 0;
        var lines = File.ReadLines(filePath);

        foreach (var line in lines) count++;
        return count;
    }

    private static int countWithReadLine(string filePath)
    {
        int count = 0;
        using (var sr = new StreamReader(filePath))      
            while (sr.ReadLine() != null)
                count++;
        return count;
    }

    private static int countWithFileStream(string filePath, int bufferSize = 1024 * 4)
    {
        using (var fs = new FileStream(filePath, FileMode.Open, FileAccess.Read))
        {
            int count = 0;
            byte[] array = new byte[bufferSize];

            while (true)
            {
                int length = fs.Read(array, 0, bufferSize);

                for (int i = 0; i < length; i++)
                    if(array[i] == 10)
                        count++;

                if (length < bufferSize) return count;
            }
        } // end of using
    }

and tested with:

var path = "1234567890.txt"; Stopwatch sw; string s = "";
File.WriteAllLines(path, Enumerable.Repeat("1234567890abcd", 1024 * 1024 )); // 16MB (16 bytes per line)

sw = Stopwatch.StartNew(); countWithReadLines(path)   ; sw.Stop(); s += sw.ElapsedMilliseconds + " ReadLines \n";
sw = Stopwatch.StartNew(); countWithReadLine(path)    ; sw.Stop(); s += sw.ElapsedMilliseconds + " ReadLine \n";
sw = Stopwatch.StartNew(); countWithReadAllBytes(path); sw.Stop(); s += sw.ElapsedMilliseconds + " ReadAllBytes \n";

sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 * 32); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 32K \n";
sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 * 64); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 64K \n";
sw = Stopwatch.StartNew(); countWithFileStream(path, 1024 *128); sw.Stop(); s += sw.ElapsedMilliseconds + " Read 128K \n";

MessageBox.Show(s);
Frederik answered 29/7, 2016 at 22:14 Comment(1)
This only works correct for single byte encodings and utf-8! For utf-16/utf-32 you will get false line shifts because the the byte value 10 can be part of a character.Rye

© 2022 - 2024 — McMap. All rights reserved.