Best way to read a large file into a byte array in C#?
Asked Answered
L

12

485

I have a web server which will read large binary files (several megabytes) into byte arrays. The server could be reading several files at the same time (different page requests), so I am looking for the most optimized way for doing this without taxing the CPU too much. Is the code below good enough?

public byte[] FileToByteArray(string fileName)
{
    byte[] buff = null;
    FileStream fs = new FileStream(fileName, 
                                   FileMode.Open, 
                                   FileAccess.Read);
    BinaryReader br = new BinaryReader(fs);
    long numBytes = new FileInfo(fileName).Length;
    buff = br.ReadBytes((int) numBytes);
    return buff;
}
Lipps answered 8/1, 2010 at 21:24 Comment(4)
Your example can be abbreviated to byte[] buff = File.ReadAllBytes(fileName).Eustasius
Why does it being a third party webservice imply the file needs to be fully in RAM before being sent to the webservice, rather than streamed? The webservice won't know the difference.Verger
@Brian, Some clients don't know how to handle a .NET stream, like Java for instance. When this is the case all that can be done is to read the entire file in byte array.Bentonbentonite
@sjeffrey: I said the data should be streamed, not passed as a .NET stream. The clients won't know the difference either way.Verger
E
919

Simply replace the whole thing with:

return File.ReadAllBytes(fileName);

However, if you are concerned about the memory consumption, you should not read the whole file into memory all at once at all. You should do that in chunks.

Ectosarc answered 8/1, 2010 at 21:27 Comment(11)
this method is limited to 2^32 byte files (4.2 GB)Tarr
File.ReadAllBytes throws OutOfMemoryException with big files (tested with 630 MB file and it failed)Cortney
@juanjo.arana Yeah, well... of course there'll always be something that doesn't fit in memory, in which case, there's no answer to the question. Generally, you should stream the file and not store it in memory altogether. You might want to look at this for a stopgap measure: msdn.microsoft.com/en-us/library/hh285054%28v=vs.110%29.aspxEctosarc
There is a limit for array size in .NET, but in .NET 4.5 you can turn on support for large arrays ( > 2GB) using special config option see msdn.microsoft.com/en-us/library/hh285054.aspxFaggot
@LeakyCode If I only want to read the first x bytes, e.g. 10, then is the read all bytes still the best way?Preterhuman
@harag No, and that's not what the question asks.Ectosarc
Make sure that you enable your solution web project run under 64bit as well your IIS web site in order to support large objectsPerales
OutOfMemoryException on a large file > 600 mb and thus not a solution for large filesHungarian
This should not be the accepted or top-rated answer for a large file read, at least the code given. The statement "you should not read the whole file into memory all at once at all. You should do that in chunks" is correct and should have been backed by code. Downvoting until that part is rectified, as this answer's code is very misleading and contradictory to that very correct statement.Kylie
Wha'ts about Encoding of file?Lawley
I was not able to use ReadAllBytes() because it doesnt let you specify FileShare and the like, so it is not great.Balakirev
E
85

I might argue that the answer here generally is "don't". Unless you absolutely need all the data at once, consider using a Stream-based API (or some variant of reader / iterator). That is especially important when you have multiple parallel operations (as suggested by the question) to minimise system load and maximise throughput.

For example, if you are streaming data to a caller:

Stream dest = ...
using(Stream source = File.OpenRead(path)) {
    byte[] buffer = new byte[2048];
    int bytesRead;
    while((bytesRead = source.Read(buffer, 0, buffer.Length)) > 0) {
        dest.Write(buffer, 0, bytesRead);
    }
}
Ethelred answered 8/1, 2010 at 21:44 Comment(10)
To add to your statement, I even suggest considering async ASP.NET handlers if you have an I/O bound operation like streaming a file to the client. However, if you have to read the whole file to a byte[] for some reason, I suggest avoid using streams or anything else and just use the system provided API.Ectosarc
@Mehrdad - agreed; but the full context isn't clear. Likewise MVC has action-results for this.Ethelred
Yes I need all the data at once. It's going to a third party webservice.Lipps
What is the system provided API?Lipps
@Tony: I stated in my answer: File.ReadAllBytes.Ectosarc
How can I use this to write to another byte arrayDorset
@iGod by changing the offset each time to increment how many bytes you read, and decrementing the amount to read each time by the same amount (start with bytesToRead = target.Length); so: int offset = 0; int toRead = target.Length; while((bytesRead - source.Read(target, offset, toRead)) > 0) { offset += bytesRead; toRead -= bytesRead; }Ethelred
but when i try to get length from source in advance the code breaks ad gives systemoutofmemory exceptionDorset
@iGod do you have code to show what you're doing there? how big is the data stream? yes: if you try to load everything into memory at once, it may well explode; especially for things over about 800MiBEthelred
@mmx , what does "system provided API" means in your comment?Army
V
45

I would think this:

byte[] file = System.IO.File.ReadAllBytes(fileName);
Vitrain answered 8/1, 2010 at 21:28 Comment(1)
Note that this can stall when getting really large files.Kylie
D
38

Your code can be factored to this (in lieu of File.ReadAllBytes):

public byte[] ReadAllBytes(string fileName)
{
    byte[] buffer = null;
    using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
    {
        buffer = new byte[fs.Length];
        fs.Read(buffer, 0, (int)fs.Length);
    }
    return buffer;
} 

Note the Integer.MaxValue - file size limitation placed by the Read method. In other words you can only read a 2GB chunk at once.

Also note that the last argument to the FileStream is a buffer size.

I would also suggest reading about FileStream and BufferedStream.

As always a simple sample program to profile which is fastest will be most beneficial.

Also your underlying hardware will have a large effect on performance. Are you using server based hard disk drives with large caches and a RAID card with onboard memory cache? Or are you using a standard drive connected to the IDE port?

Discontent answered 8/1, 2010 at 21:36 Comment(5)
Why would type of hardware make a difference? So if it's IDE you use some .NET method and if it's RAID you use another?Lipps
@Lipps - It has nothing to do with what calls you make from your programming language. There are different types of hard disk drives. For example, Seagate drives are classified as "AS" or "NS" with NS being the server based, large cache drive where-as the "AS" drive is the consumer - home computer based drive. Seek speeds and internal transfer rates also affect how fast you can read something from disk. RAID arrays can vastly improve read/write performance through caching. So you might be able to read the file all at once, but the underlying hardware is still the deciding factor.Discontent
This code contains a critical bug. Read is only required to return at least 1 byte.Twigg
I would make sure to wrap the long to int cast with the checked construct like this: checked((int)fs.Length)Carnage
I would just do var binaryReader = new BinaryReader(fs); fileData = binaryReader.ReadBytes((int)fs.Length); in that using statement. But that's effectively like what the OP did, just I cut out a line of code by casting fs.Length to int instead of getting the long value of the FileInfo length and converting that.Kylie
K
13

I'd say BinaryReader is fine, but can be refactored to this, instead of all those lines of code for getting the length of the buffer:

public byte[] FileToByteArray(string fileName)
{
    byte[] fileData = null;

    using (FileStream fs = File.OpenRead(fileName)) 
    { 
        using (BinaryReader binaryReader = new BinaryReader(fs))
        {
            fileData = binaryReader.ReadBytes((int)fs.Length); 
        }
    }
    return fileData;
}

Should be better than using .ReadAllBytes(), since I saw in the comments on the top response that includes .ReadAllBytes() that one of the commenters had problems with files > 600 MB, since a BinaryReader is meant for this sort of thing. Also, putting it in a using statement ensures the FileStream and BinaryReader are closed and disposed.

Kylie answered 12/10, 2016 at 0:18 Comment(2)
For C#, need to use "using (FileStream fs = File.OpenRead(fileName)) " instead of "using (FileStream fs = new File.OpenRead(fileName)) " as given above. Just removed new keyword before File.OpenRead()Nakamura
@Syed The code above WAS written for C#, but you're right that new wasn't needed there. Removed.Kylie
G
10

Depending on the frequency of operations, the size of the files, and the number of files you're looking at, there are other performance issues to take into consideration. One thing to remember, is that each of your byte arrays will be released at the mercy of the garbage collector. If you're not caching any of that data, you could end up creating a lot of garbage and be losing most of your performance to % Time in GC. If the chunks are larger than 85K, you'll be allocating to the Large Object Heap(LOH) which will require a collection of all generations to free up (this is very expensive, and on a server will stop all execution while it's going on). Additionally, if you have a ton of objects on the LOH, you can end up with LOH fragmentation (the LOH is never compacted) which leads to poor performance and out of memory exceptions. You can recycle the process once you hit a certain point, but I don't know if that's a best practice.

The point is, you should consider the full life cycle of your app before necessarily just reading all the bytes into memory the fastest way possible or you might be trading short term performance for overall performance.

Glyptics answered 8/1, 2010 at 22:25 Comment(1)
source code C# about it, for manage garbage collector, chunks, performance, event counters, ...Lawley
K
2

In case with 'a large file' is meant beyond the 4GB limit, then my following written code logic is appropriate. The key issue to notice is the LONG data type used with the SEEK method. As a LONG is able to point beyond 2^32 data boundaries. In this example, the code is processing first processing the large file in chunks of 1GB, after the large whole 1GB chunks are processed, the left over (<1GB) bytes are processed. I use this code with calculating the CRC of files beyond the 4GB size. (using https://crc32c.machinezoo.com/ for the crc32c calculation in this example)

private uint Crc32CAlgorithmBigCrc(string fileName)
{
    uint hash = 0;
    byte[] buffer = null;
    FileInfo fileInfo = new FileInfo(fileName);
    long fileLength = fileInfo.Length;
    int blockSize = 1024000000;
    decimal div = fileLength / blockSize;
    int blocks = (int)Math.Floor(div);
    int restBytes = (int)(fileLength - (blocks * blockSize));
    long offsetFile = 0;
    uint interHash = 0;
    Crc32CAlgorithm Crc32CAlgorithm = new Crc32CAlgorithm();
    bool firstBlock = true;
    using (FileStream fs = new FileStream(fileName, FileMode.Open, FileAccess.Read))
    {
        buffer = new byte[blockSize];
        using (BinaryReader br = new BinaryReader(fs))
        {
            while (blocks > 0)
            {
                blocks -= 1;
                fs.Seek(offsetFile, SeekOrigin.Begin);
                buffer = br.ReadBytes(blockSize);
                if (firstBlock)
                {
                    firstBlock = false;
                    interHash = Crc32CAlgorithm.Compute(buffer);
                    hash = interHash;
                }
                else
                {
                    hash = Crc32CAlgorithm.Append(interHash, buffer);
                }
                offsetFile += blockSize;
            }
            if (restBytes > 0)
            {
                Array.Resize(ref buffer, restBytes);
                fs.Seek(offsetFile, SeekOrigin.Begin);
                buffer = br.ReadBytes(restBytes);
                hash = Crc32CAlgorithm.Append(interHash, buffer);
            }
            buffer = null;
        }
    }
    //MessageBox.Show(hash.ToString());
    //MessageBox.Show(hash.ToString("X"));
    return hash;
}
Kansu answered 26/4, 2019 at 4:16 Comment(0)
C
2

Overview: if your image is added as a action= embedded resource then use the GetExecutingAssembly to retrieve the jpg resource into a stream then read the binary data in the stream into an byte array

   public byte[] GetAImage()
    {
        byte[] bytes=null;
        var assembly = Assembly.GetExecutingAssembly();
        var resourceName = "MYWebApi.Images.X_my_image.jpg";

        using (Stream stream = assembly.GetManifestResourceStream(resourceName))
        {
            bytes = new byte[stream.Length];
            stream.Read(bytes, 0, (int)stream.Length);
        }
        return bytes;

    }
Cajeput answered 15/6, 2020 at 21:45 Comment(0)
R
0

Use the BufferedStream class in C# to improve performance. A buffer is a block of bytes in memory used to cache data, thereby reducing the number of calls to the operating system. Buffers improve read and write performance.

See the following for a code example and additional explanation: http://msdn.microsoft.com/en-us/library/system.io.bufferedstream.aspx

Reyreyes answered 8/1, 2010 at 21:37 Comment(3)
What's the point of using a BufferedStream when you're reading the whole thing at once?Ectosarc
He asked for the best performance not to read the file at once.Reyreyes
Performance is measurable in the context of an operation. Additional buffering for a stream that you're reading sequentially, all at once, to memory is not likely to benefit from an extra buffer.Ectosarc
R
0

use this:

 bytesRead = responseStream.ReadAsync(buffer, 0, Length).Result;
Renaerenaissance answered 13/4, 2019 at 7:39 Comment(1)
Welcome to Stack Overflow! As explanations are an important part of answers on this platform, please explain your code and how it solves the problem in the question and why it might be better than other answers. Our guide How to write a good answer might be helpful for you. ThanksImago
N
-5

I would recommend trying the Response.TransferFile() method then a Response.Flush() and Response.End() for serving your large files.

Nacre answered 19/1, 2010 at 23:37 Comment(0)
G
-8

If you're dealing with files above 2 GB, you'll find that the above methods fail.

It's much easier just to hand the stream off to MD5 and allow that to chunk your file for you:

private byte[] computeFileHash(string filename)
{
    MD5 md5 = MD5.Create();
    using (FileStream fs = new FileStream(filename, FileMode.Open))
    {
        byte[] hash = md5.ComputeHash(fs);
        return hash;
    }
}
Gloxinia answered 20/10, 2014 at 9:56 Comment(1)
I don't see how the code is relevant to the question (or what you suggest in the written text)Infusionism

© 2022 - 2024 — McMap. All rights reserved.