How to write super-fast file-streaming code in C#?
Asked Answered
D

9

42

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

  1. Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)
Doubleness answered 5/6, 2009 at 13:41 Comment(1)
Are you splitting the file perfectly, i.e. could you rebuild the large file by just joining all the small files together? If so there are savings to be had there. If not, do the ranges of the small files overlap? Are they sorted in order of offset?Zygodactyl
C
49

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

I think it'll be significantly faster, but obviously you'll need to try it to see...

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

Colly answered 5/6, 2009 at 13:49 Comment(7)
I know this is a two year old post, just wondered... is this still the fastest way? (i.e. Nothing new in .Net to be aware of?). Also, would it be faster to perform the Math.Min prior to entering the loop? Or better yet, to remove the length parameter as it can be calculated by means of the buffer? Sorry to be picky and necro this! Thanks in advance.Leschen
@Smudge202: Given that this is performing IO, the call to Math.Min is certainly not going to be relevant in terms of performance. The point of having both the length parameter and the buffer length is to allow you to reuse a possibly-oversized buffer.Colly
Gotcha, and thanks for getting back to me. I'd hate to start a new question when there is likely a good enough answer right here, but would you say, that if you wanted to read the first x bytes of a large number of files (for the purpose of grabbing the XMP metadata from a large number of files), the above approach (with some tweaking) would still be recommended?Leschen
@Smudge202: Well the code above is for copying. If you only want to read the first x bytes, I'd still loop round, but just read into a right-sized buffer, incrementing the index at which the read will write into the buffer appropriately on each iteration.Colly
Yup, I'm less interested in the writing part, I just wanted to confirm that the fastest way to read one file is also the fastest way to read many files. I imagined being able to P/Invoke for file pointers/offsets and from there being able to scan across multiple files with the same/less streams/buffers, which in my imaginary world of make believe, would possibly be even faster for what I want to achieve (though not applicable to the OP). If I'm not barking mad, probably best I start a new question. If I am, could you let me know so I don't waste even more peoples' time? :-)Leschen
@Smudge202: Do you actually have a performance problem right now? Have you written the simplest code that works and found that it's too slow? Bear in mind that a lot can depend on context - reading in parallel may help if you're using solid state, but not on a normal hard disk, for example.Colly
SO is advising I take this to a chat, I think I'll actually start a new question. Thank you thus far!Leschen
B
30

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

Bayadere answered 3/3, 2011 at 22:55 Comment(1)
Thanks for the detailed blog information. Have a 'Nice Answer' badge!Discriminate
L
6

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.

(edit) something like:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}
Lenin answered 5/6, 2009 at 13:48 Comment(2)
Any reason for the flush at the end? Closing it should do that. Also, I think you want to subtract from length in the first loop :)Colly
Good eyes Jon! The Flush was force of habit; from a lot of code when I pass streams in rather than open/close them in the method - it is convenient (if writing a non-trivial amount of data) to flush it before returning.Lenin
W
3

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

Waters answered 5/6, 2009 at 13:49 Comment(0)
R
3

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

Rodie answered 5/6, 2009 at 14:57 Comment(1)
Remember with this (or any threading solution) you can hit a stage where you will max out your IO: you will have hit your best throughput(ie if attempting to write hundreds/thousands of small files at the same time, several large files etc). I have always found that if I can make one file read/write efficiently there is little I can do to improve on that by parallelising (Assembly can help a lot, make read/writes in assembler and it can be spectacular, up to the IO limits, however it can be a pain to write, and you need to be sure you want direct hardware or BIOS level access to your devicesTowne
B
1

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

Beggar answered 5/6, 2009 at 13:52 Comment(0)
S
1

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.

Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.

If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

Sexagenarian answered 5/6, 2009 at 15:31 Comment(0)
M
0

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

Maiga answered 5/6, 2009 at 14:21 Comment(1)
Threading may help, but his bottleneck is surely on the I/O -- the CPU is probably spending a lot of time waiting on the disk. That's not to say that threading wouldn't make any difference (for example, if the writes are to different spindles, then he might get a better performance boost than he would if it were all on one disk)Beggar
A
-1

(For future reference.)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

Memory Mapped files are supported in managed code in .NET 4.0.

But as noted, you need to profile, and expect to switch to native code for maximum performance.

Alroi answered 5/6, 2009 at 14:8 Comment(1)
Memory mapped files are page aligned so they are out. The problem here is more likely disc access time, and memory mapped files wouldnt help with that anyway. The OS is going to manage caching files whether they are memory mapped or not.Zygodactyl

© 2022 - 2024 — McMap. All rights reserved.