Improve speed of splitting file
Asked Answered
V

3

12

I am using this code to extract a chunk from file

// info is FileInfo object pointing to file
var percentSplit = info.Length * 50 / 100; // extract 50% of file
var bytes = new byte[percentSplit];
var fileStream = File.OpenRead(fileName);
fileStream.Read(bytes, 0, bytes.Length);
fileStream.Dispose();
File.WriteAllBytes(splitName, bytes);

Is there any way to speed up this process?

Currently for a 530 MB file it takes around 4 - 5 seconds. Can this time be improved?

Vociferation answered 10/2, 2013 at 4:13 Comment(12)
extracting 50% of the file isn't efficient why 4kb to 8kb. If you have .Net 4 or greater you can use Memory Mapped FilesFriseur
What is performance of your disk system? 100MB/s does sound pretty reasonable.Govea
Can I ask what you are splitting the file for? Is the splitting the file your end result or is this a intermediate step to get around another issue?Church
Arrays larger than 85KB will end up on the seldom collected never compacted large object heap. So, if this is something called very often from a long running process, you could wind up with memory problems reading 200+MB into an array.Matilda
I'm interested in your question and start a bounty for it. I'd like to get an answer better than mine.Tomato
Although I don't have experience with this, the new ReFS file system might help you out. From what I've read, it's implemented as allocate-on-write, so if you just copy the file to 2 files and change the size for the first file (using SetLength), it should save you have the time.Kickshaw
FYI; Using PInvoke you use memory mapped files on pre .NET4 Environments as well.Boutique
@devgeezer: if allocating larger than 85k objects is a problem, then they should've capped arrays to 85k. It is not a problem. In fact, he should allocate much larger than 85k and reuse that array as much as possible.Temporary
5 seconds for writing 530/2 Mb is an adequate performance of the regular disk subsystem. Program algorithm seems not to be a bottleneckJeremiah
@HermanSchoenfeld for more detail regarding the 85k threshold and GC behaviors of the LOH, read the following msdn article. msdn.microsoft.com/en-us/magazine/cc534993.aspxMatilda
@Matilda I read that article before I wrote here. The only real problem with LOH allocations is the possibility of memory fragmentation. It doesn't mean such arrays shouldn't be allocated. The case here warrants it, he would only need to allocate a single array (much) larger than 85k (perhaps 1MB) and simply reuse it. It would be much slower (software & hardware wise) to use a smaller array.Temporary
To be clear, I never said not to allocate a large-array; I offered a warning to be aware that there are consequences of frequent LOH allocation. There are several memory concerns for large-object use detailed in that article: LOH objects are collected in Gen-2 - the least frequently GC'd Gen, the LOH is never compacted (so the fragmentation can lead to memory bloat), CLR zero's memory before returning from allocation (can degrade performance).Matilda
T
8

There are several cases of you question, but none of them is language relevant.

Following are something to concern

  • What is the file system of source/destination file?
  • Do you want to keep original source file?
  • Are they lie on the same drive?

In c#, you almost do not have a method could be faster than File.Copy which invokes CopyFile of WINAPI internally. Because of the percentage is fifty, however, following code might not be faster. It copies whole file and then set the length of the destination file

var info=new FileInfo(fileName);
var percentSplit=info.Length*50/100; // extract 50% of file

File.Copy(info.FullName, splitName);
using(var outStream=File.OpenWrite(splitName))
    outStream.SetLength(percentSplit);

Further, if

  1. you don't keep original source after file splitted
  2. destination drive is the same as source
  3. your are not using a crypto/compression enabled file system

then, the best thing you can do, is don't copy files at all. For example, if your source file lies on FAT or FAT32 file system, what you can do is

  1. create new dir entry(entries) for newly splitted parts of file
  2. let the entry(entries) point(s) to the cluster of target part(s)
  3. set correct file size for each entry
  4. check for cross-link and avoid that

If your file system was NTFS, you might need to spend a long time to study the spec.

Good luck!

Tomato answered 10/2, 2013 at 11:56 Comment(2)
+1: Ken, I have deleted my answer as I found a fairly serious bug which meant my approach did not perform reliably, and once fixed was actually much slower than yours. I will be really interested to see if anything can actually beat the performance of File.Copy.Zumwalt
This is actually a good benchmark for any suggested solution, which should run about twice as fast. Assuming File.Copy() runs at a given system's max, copying only half of it should take about half that time.Barry
C
2
var percentSplit = (int)(info.Length * 50 / 100); // extract 50% of file
var buffer = new byte[8192];
using (Stream input = File.OpenRead(info.FullName))
using (Stream output = File.OpenWrite(splitName))
{
    int bytesRead = 1;
    while (percentSplit > 0 && bytesRead > 0)
    {
        bytesRead = input.Read(buffer, 0, Math.Min(percentSplit, buffer.Length));
        output.Write(buffer, 0, bytesRead);
        percentSplit -= bytesRead;
    }
    output.Flush();
}

The flush may not be needed but it doesn't hurt, this was quite interesting, changing the loop to a do-while rather than a while had a big hit on performance. I suppose the IL is not as fast. My pc was running the original code in 4-6 secs, the attached code seemed to be running at about 1 second.

Contralto answered 25/2, 2013 at 18:27 Comment(0)
S
0

I get better results when reading/writing by chunks of a few megabytes. The performances changes also depending on the size of the chunk.

FileInfo info = new FileInfo(@"C:\source.bin");
FileStream f = File.OpenRead(info.FullName);
BinaryReader br = new BinaryReader(f);

FileStream t = File.OpenWrite(@"C:\split.bin");
BinaryWriter bw = new BinaryWriter(t);

long count = 0;
long split = info.Length * 50 / 100;
long chunk = 8000000;

DateTime start = DateTime.Now;

while (count < split)
{
    if (count + chunk > split)
    {
        chunk = split - count;
    }

    bw.Write(br.ReadBytes((int)chunk));
    count += chunk;
}

Console.WriteLine(DateTime.Now - start);
Spectra answered 11/2, 2013 at 12:21 Comment(2)
You shouldn't allocate chunks bigger than 85K. see devgeezer remark in the question.Rawlings
Allocating chunks larger than 85k is fine. In fact, the larger the better, so long as you reuse that chunk as much as possible. The only problem is fragmentation of the Large Object Heap which can result in an out of memory exception. Reusing the large buffer will prevent that, and when the buffer is no longer used (and memory is needed), it will be collected. No problem.Temporary

© 2022 - 2024 — McMap. All rights reserved.