Reading large text files with streams in C#
Asked Answered
E

13

114

I've got the lovely task of working out how to handle large files being loaded into our application's script editor (it's like VBA for our internal product for quick macros). Most files are about 300-400 KB which is fine loading. But when they go beyond 100 MB the process has a hard time (as you'd expect).

What happens is that the file is read and shoved into a RichTextBox which is then navigated - don't worry too much about this part.

The developer who wrote the initial code is simply using a StreamReader and doing

[Reader].ReadToEnd()

which could take quite a while to complete.

My task is to break this bit of code up, read it in chunks into a buffer and show a progressbar with an option to cancel it.

Some assumptions:

  • Most files will be 30-40 MB
  • The contents of the file is text (not binary), some are Unix format, some are DOS.
  • Once the contents is retrieved we work out what terminator is used.
  • No-one's concerned once it's loaded the time it takes to render in the richtextbox. It's just the initial load of the text.

Now for the questions:

  • Can I simply use StreamReader, then check the Length property (so ProgressMax) and issue a Read for a set buffer size and iterate through in a while loop WHILST inside a background worker, so it doesn't block the main UI thread? Then return the stringbuilder to the main thread once it's completed.
  • The contents will be going to a StringBuilder. can I initialise the StringBuilder with the size of the stream if the length is available?

Are these (in your professional opinions) good ideas? I've had a few issues in the past with reading content from Streams, because it will always miss the last few bytes or something, but I'll ask another question if this is the case.

Ean answered 29/1, 2010 at 12:36 Comment(4)
30-40MB script files? Holy mackerel! I'd hate to have to code review that...Sporadic
I know this questions is rather old but I found it the other day and have tested the recommendation for MemoryMappedFile and this is hands down the fastest method. A comparison is reading a 7,616,939 line 345MB file via a readline method takes 12+ hours on my machine while performing the same load and read via MemoryMappedFile took 3 seconds.Expressivity
It's just few lines of code. See this library I am using to read 25gb and more large files as well. github.com/Agenty/FileReaderSavoie
@VikashRathee That library uses foreach (string line in File.ReadLines(path).Skip(skip)). That's horrible.Drye
I
207

You can improve read speed by using a BufferedStream, like this:

using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {

    }
}

March 2013 UPDATE

I recently wrote code for reading and processing (searching for text in) 1 GB-ish text files (much larger than the files involved here) and achieved a significant performance gain by using a producer/consumer pattern. The producer task read in lines of text using the BufferedStream and handed them off to a separate consumer task that did the searching.

I used this as an opportunity to learn TPL Dataflow, which is very well suited for quickly coding this pattern.

Why BufferedStream is faster

A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance. A buffer can be used for either reading or writing, but never both simultaneously. The Read and Write methods of BufferedStream automatically maintain the buffer.

December 2014 UPDATE: Your Mileage May Vary

Based on the comments, FileStream should be using a BufferedStream internally. At the time this answer was first provided, I measured a significant performance boost by adding a BufferedStream. At the time I was targeting .NET 3.x on a 32-bit platform. Today, targeting .NET 4.5 on a 64-bit platform, I do not see any improvement.

Related

I came across a case where streaming a large, generated CSV file to the Response stream from an ASP.Net MVC action was very slow. Adding a BufferedStream improved performance by 100x in this instance. For more see Unbuffered Output Very Slow

Interlaken answered 10/3, 2012 at 1:22 Comment(26)
Dude, BufferedStream makes all the difference. +1 :)Stagemanage
Much much faster than streamReader.ReadLine only...thanks a lot Eric.Can you also explain why it is so much faster/or point me to resource where I can read about it. Thanks in advance.Parous
There is a cost to requesting data from an IO subsystem. In the case of rotating disks, you might have to wait for the platter to spin into position to read the next chunk of data, or worse, wait for the disk head to move. While SSD's don't have mechanical parts to slow things down, there is still a per-IO-operation cost to access them. Buffered streams read more than just what the StreamReader requests, reducing the number of calls to the OS and ultimately the number of separate IO requests.Interlaken
Is it as fast as calling windows native ReadFile & WriteFile functions? https://mcmap.net/q/189826/-how-to-write-super-fast-file-streaming-code-in-cCalyces
Performance is probably very similar. Unfortunately the author of that blog post did not use a BufferedStream in his tests.Interlaken
Really? This makes no difference in my test scenario. According to Brad Abrams there is no benefit to using BufferedStream over a FileStream.Highflier
@NickCox: Your results may vary based on your underlying IO subsystem. On a rotating disk and a disk controller that does not have the data in its cache (and also data not cached by Windows), the speedup is huge. Brad's column was written in 2004. I measured actual, drastic improvements recently.Interlaken
Thanks, Eric. Any chance you could post your test scripts somewhere?Highflier
@NickCox: I did not have test scripts, I had a poor performing IO issue in an application that read text files in the low GB range that was solved using a BufferedStream.Interlaken
Thanks for the reply Eric. If anyone does come up with a good benchmark please post back here!Highflier
Do I need to close the file I have previously opened by saying: File.Open()?Burin
@Sebastian: No, not if you wrapped the call to File.Open() with the using keyword as in my example. When execution exits using block scope, it will call IDispoable() on the File instance, which calls Close() for you.Interlaken
This is useless according to: #492783 FileStream already uses a buffer internally.Gesner
That issue was previously discussed in the chat above.Interlaken
@Marcus: Could you share the specific case where this made a large difference for you? In some cases this makes a huge difference, and in others none at all. I'm trying to update my answer to differentiate those cases. (I also suspect the version of .NET may play a role as well).Interlaken
According to this article the BufferedStream is not always faster and even if it is the difference is negligible. Reading line by line is always faster than reading the whole file.Malcah
@TimSchmelter: It seems that may depend on the version of the .NET platform in use. I measured very substantial gains at the time I posted my answer. However, I cannot reproduce them targeting .NET 4.5 (64 bit). I know that the .NET team says they use a buffer internally. Perhaps there was some bug or omission in some .NET releases. Clearly this helps some people, including me in the past, while others measure no difference.Interlaken
@NickCox the Brad Abrams link is dated 15 Apr 2004 so may be a bit out of date?Hazelwood
@Redeemed1: Maybe. I guess you'll just have to suck it and see. Like I said, it didn't make any difference in my scenario.Highflier
hi Eric, i am reading .sql file which is 100 MB both fileInfo.OpenText().ReadToEnd() and File.ReadAllText, crashes visual studio, because file is large, any solution ?Jephum
@stom: Both of those methods attempt to read the entire file into memory before you do anything to process it. Generally speaking .NET should be able to read 100MB into memory at once, but if your program has already allocated a lot of memory you could be up against the memory limit for a 32-bit process or it might not be able to find contiguous 100MB to hold the result (you are far less likely to see this problem if targeting 64 bit). Suggest you create a new question including a minimal code example that reproduces the issue.Interlaken
thank you @Eric for responding, i posted the question hereJephum
its not working with file around 215MB. any solution? I am using StringBuilder to append the result.Toolmaker
If you use StringBuilder to append the result, you still need RAM for the entire file set, all at once. If it fails at that point, the .NET runtime is unable to allocate 215MB of contiguous memory for your StringBuilder. If you really must have all of the data in RAM at once, try checking how many bytes large the file is before you start reading data, and then using the overload of StringBuilder that allows you to specify an initial size for the buffer.Interlaken
Man, change totally with my MVC application. Thanks!Stilla
An example where BufferedStreams make a significant difference (in ML.Net) github.com/dotnet/machinelearning/pull/5924Interlaken
C
33

If you read the performance and benchmark stats on this website, you'll see that the fastest way to read (because reading, writing, and processing are all different) a text file is the following snippet of code:

using (StreamReader sr = File.OpenText(fileName))
{
    string s = String.Empty;
    while ((s = sr.ReadLine()) != null)
    {
        //do your stuff here
    }
}

All up about 9 different methods were bench marked, but that one seem to come out ahead the majority of the time, even out performing the buffered reader as other readers have mentioned.

Colfin answered 19/9, 2014 at 14:21 Comment(3)
This worked well for stripping apart a 19GB postgres file to translate it into sql syntax in multiple files. Thanks postgres guy who never executed my parameters correctly. /sighOvermantel
The performance difference here seems to pay off for really big files, like bigger than 150MB (also you really should use a StringBuilder for loading them into memory, loads faster as it doesn't make a new string every time you add chars)Cline
The benchmark on the website is so flawed it's useless. It does not even permute the order of the different techniques to test.Drye
W
16

You say you have been asked to show a progress bar while a large file is loading. Is that because the users genuinely want to see the exact % of file loading, or just because they want visual feedback that something is happening?

If the latter is true, then the solution becomes much simpler. Just do reader.ReadToEnd() on a background thread, and display a marquee-type progress bar instead of a proper one.

I raise this point because in my experience this is often the case. When you are writing a data processing program, then users will definitely be interested in a % complete figure, but for simple-but-slow UI updates, they are more likely to just want to know that the computer hasn't crashed. :-)

Wideman answered 29/1, 2010 at 13:3 Comment(2)
But can the user cancel out of the ReadToEnd call?Otto
@Tim, well spotted. In that case, we're back to the StreamReader loop. However, it will still be simpler because there's no need to read ahead to calculate the progress indicator.Wideman
C
9

Use a background worker and read only a limited number of lines. Read more only when the user scrolls.

And try to never use ReadToEnd(). It's one of the functions that you think "why did they make it?"; it's a script kiddies' helper that goes fine with small things, but as you see, it sucks for large files...

Those guys telling you to use StringBuilder need to read the MSDN more often:

Performance Considerations
The Concat and AppendFormat methods both concatenate new data to an existing String or StringBuilder object. A String object concatenation operation always creates a new object from the existing string and the new data. A StringBuilder object maintains a buffer to accommodate the concatenation of new data. New data is appended to the end of the buffer if room is available; otherwise, a new, larger buffer is allocated, data from the original buffer is copied to the new buffer, then the new data is appended to the new buffer. The performance of a concatenation operation for a String or StringBuilder object depends on how often a memory allocation occurs.
A String concatenation operation always allocates memory, whereas a StringBuilder concatenation operation only allocates memory if the StringBuilder object buffer is too small to accommodate the new data. Consequently, the String class is preferable for a concatenation operation if a fixed number of String objects are concatenated. In that case, the individual concatenation operations might even be combined into a single operation by the compiler. A StringBuilder object is preferable for a concatenation operation if an arbitrary number of strings are concatenated; for example, if a loop concatenates a random number of strings of user input.

That means huge allocation of memory, what becomes large use of swap files system, that simulates sections of your hard disk drive to act like the RAM memory, but a hard disk drive is very slow.

The StringBuilder option looks fine for who use the system as a mono-user, but when you have two or more users reading large files at the same time, you have a problem.

Cockchafer answered 29/1, 2010 at 12:42 Comment(2)
far out you guys are super quick! unfortunately because of the way the macro's work the entire stream needs to be loaded. As I mentioned don't worry about the richtext part. Its the initial loading we're wanting to improve.Ean
so you can work in parts, read first X lines, apply the macro, read the second X lines, apply the macro, and so on... if you explain what this macro do, we can help you with more precisionCockchafer
M
8

For binary files, the fastest way of reading them I have found is this.

 MemoryMappedFile mmf = MemoryMappedFile.CreateFromFile(file);
 MemoryMappedViewStream mms = mmf.CreateViewStream();
 using (BinaryReader b = new BinaryReader(mms))
 {
 }

In my tests it's hundreds of times faster.

Meade answered 30/9, 2014 at 12:38 Comment(2)
Do you have any hard evidence of this? Why should OP use this over any other answer? Please dig a bit deeper and give a bit more detailTalebearer
Its slow than filestream around 10-20 miliseconds.Peat
D
6

This should be enough to get you started.

class Program
{        
    static void Main(String[] args)
    {
        const int bufferSize = 1024;

        var sb = new StringBuilder();
        var buffer = new Char[bufferSize];
        var length = 0L;
        var totalRead = 0L;
        var count = bufferSize; 

        using (var sr = new StreamReader(@"C:\Temp\file.txt"))
        {
            length = sr.BaseStream.Length;               
            while (count > 0)
            {                    
                count = sr.Read(buffer, 0, bufferSize);
                sb.Append(buffer, 0, count);
                totalRead += count;
            }                
        }

        Console.ReadKey();
    }
}
Delastre answered 29/1, 2010 at 12:56 Comment(1)
I would move the "var buffer = new char[1024]" out of the loop: it's not necessary to create a new buffer each time. Just put it before "while (count > 0)".Caning
D
5

Have a look at the following code snippet. You have mentioned Most files will be 30-40 MB. This claims to read 180 MB in 1.4 seconds on an Intel Quad Core:

private int _bufferSize = 16384;

private void ReadFile(string filename)
{
    StringBuilder stringBuilder = new StringBuilder();
    FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read);

    using (StreamReader streamReader = new StreamReader(fileStream))
    {
        char[] fileContents = new char[_bufferSize];
        int charsRead = streamReader.Read(fileContents, 0, _bufferSize);

        // Can't do much with 0 bytes
        if (charsRead == 0)
            throw new Exception("File is 0 bytes");

        while (charsRead > 0)
        {
            stringBuilder.Append(fileContents);
            charsRead = streamReader.Read(fileContents, 0, _bufferSize);
        }
    }
}

Original Article

Dade answered 29/1, 2010 at 12:52 Comment(3)
These kind of tests are notoriously unreliable. You'll read data from the file system cache when you repeat the test. That's at least one order of magnitude faster than a real test that reads the data off the disk. A 180 MB file cannot possibly take less than 3 seconds. Reboot your machine, run the test once for the real number.Claudy
the line stringBuilder.Append is potentially dangerous, you need to replace it with stringBuilder.Append( fileContents, 0, charsRead ); to ensure you are not adding a full 1024 chars even when the stream has ended earlier.Showdown
@JohannesRudolph, your comment just solved me a bug. How did you come up with the 1024 number?Nanete
C
5

All excellent answers! however, for someone looking for an answer, these appear to be somewhat incomplete.

As a standard String can only of Size X, 2Gb to 4Gb depending on your configuration, these answers do not really fulfil the OP's question. One method is to work with a List of Strings:

List<string> Words = new List<string>();

using (StreamReader sr = new StreamReader(@"C:\Temp\file.txt"))
{

string line = string.Empty;

while ((line = sr.ReadLine()) != null)
{
    Words.Add(line);
}
}

Some may want to Tokenise and split the line when processing. The String List now can contain very large volumes of Text.

Cajeput answered 22/1, 2018 at 5:58 Comment(0)
E
5

Whilst the most upvoted answer is correct but it lacks usage of multi-core processing. In my case, having 12 cores I use PLink:

Parallel.ForEach(
    File.ReadLines(filename), //returns IEumberable<string>: lazy-loading
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    (line, state, index) =>
    {
        //process line value
    }
);

Worth mentioning, I got that as an interview question asking return Top 10 most occurrences:

var result = new ConcurrentDictionary<string, int>(StringComparer.InvariantCultureIgnoreCase);
Parallel.ForEach(
    File.ReadLines(filename),
    new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
    (line, state, index) =>
    {
        result.AddOrUpdate(line, 1, (key, val) => val + 1);        
    }
);

return result
    .OrderByDescending(x => x.Value)
    .Take(10)
    .Select(x => x.Value);

Benchmarking: BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042 Intel Core i7-8700K CPU 3.70GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores [Host] : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT DefaultJob : .NET Framework 4.8 (4.8.4250.0), X64 RyuJIT

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
GetTopWordsSync 33.03 s 0.175 s 0.155 s 1194000 314000 7000 7.06 GB
GetTopWordsParallel 10.89 s 0.121 s 0.113 s 1225000 354000 8000 7.18 GB

And as you can see it's 75% performance improvement.

But please note that the 7Gb is instantly loaded in the memory and since it's a blob it puts too much pressure on GC.

Ewaewald answered 21/1, 2021 at 15:59 Comment(0)
C
4

You might be better off to use memory-mapped files handling here.. The memory mapped file support will be around in .NET 4 (I think...I heard that through someone else talking about it), hence this wrapper which uses p/invokes to do the same job..

Edit: See here on the MSDN for how it works, here's the blog entry indicating how it is done in the upcoming .NET 4 when it comes out as release. The link I have given earlier on is a wrapper around the pinvoke to achieve this. You can map the entire file into memory, and view it like a sliding window when scrolling through the file.

Chancellor answered 29/1, 2010 at 12:52 Comment(0)
P
2

An iterator might be perfect for this type of work:

public static IEnumerable<int> LoadFileWithProgress(string filename, StringBuilder stringData)
{
    const int charBufferSize = 4096;
    using (FileStream fs = File.OpenRead(filename))
    {
        using (BinaryReader br = new BinaryReader(fs))
        {
            long length = fs.Length;
            int numberOfChunks = Convert.ToInt32((length / charBufferSize)) + 1;
            double iter = 100 / Convert.ToDouble(numberOfChunks);
            double currentIter = 0;
            yield return Convert.ToInt32(currentIter);
            while (true)
            {
                char[] buffer = br.ReadChars(charBufferSize);
                if (buffer.Length == 0) break;
                stringData.Append(buffer);
                currentIter += iter;
                yield return Convert.ToInt32(currentIter);
            }
        }
    }
}

You can call it using the following:

string filename = "C:\\myfile.txt";
StringBuilder sb = new StringBuilder();
foreach (int progress in LoadFileWithProgress(filename, sb))
{
    // Update your progress counter here!
}
string fileData = sb.ToString();

As the file is loaded, the iterator will return the progress number from 0 to 100, which you can use to update your progress bar. Once the loop has finished, the StringBuilder will contain the contents of the text file.

Also, because you want text, we can just use BinaryReader to read in characters, which will ensure that your buffers line up correctly when reading any multi-byte characters (UTF-8, UTF-16, etc.).

This is all done without using background tasks, threads, or complex custom state machines.

Parietal answered 9/7, 2010 at 18:35 Comment(0)
M
2

Its been more than 10 years since the last answers, This is my solution to read the text files of more than 10Gb and return the result following your required length. Putting here in case anyone seeking help :)

public static List<string> ReadFileNGetLine(string filepath, int lenghtLine)
    {
        List<string> listString = new List<string>();
        try
        {
            StringBuilder resultAsString = new StringBuilder();

            FileInfo info = new FileInfo(filepath);
            if (info.Length < 10)
            {
                return listString;
            }
            using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(filepath))
            using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, info.Length))
            {
                for (int i = 0; i < info.Length; i++)
                {
                    //Reads a byte from a stream and advances the position within the stream by one byte, or returns -1 if at the end of the stream.
                    if (listString.Count() >= lenghtLine)
                    {
                        break;
                    }
                    int result = memoryMappedViewStream.ReadByte();

                    if (result == -1)
                    {
                        break;
                    }

                    char letter = (char)result;
                    //khang: checking if the end of line is break line to collect full line
                    if ((letter.ToString() == "\r" || letter.ToString() == "\n") && letter.ToString() != "")
                    {
                        if (letter.ToString() != "\r")
                        {
                            listString.Add(resultAsString.ToString());
                            resultAsString.Clear();
                        }

                    }
                    else
                    {
                        resultAsString.Append(letter);
                    }

                }
            }
        }
        catch (Exception ex)
        {
            throw;
        }
        return listString;
    }
Modla answered 1/4, 2022 at 12:10 Comment(1)
MemoryMapped is better for random access, StreamReader is 10x faster for sequential readingCottonwood
B
0

My file is over 13 GB: enter image description here

The bellow link contains the code that read a piece of file easily:

Read a large text file

More information

Bimetallism answered 18/8, 2018 at 18:40 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.