How to write a large buffer into a binary file in C++, fast?

Asked 19/7, 2012 at 15:18 Answered 23/8, 2016 at 9:43

Solved c++performance optimization file-io io

288

I'm trying to write huge amounts of data onto my SSD(solid state drive). And by huge amounts I mean 80GB.

I browsed the web for solutions, but the best I came up with was this:

#include <fstream>
const unsigned long long size = 64ULL*1024ULL*1024ULL;
unsigned long long a[size];
int main()
{
    std::fstream myfile;
    myfile = std::fstream("file.binary", std::ios::out | std::ios::binary);
    //Here would be some error handling
    for(int i = 0; i < 32; ++i){
        //Some calculations to fill a[]
        myfile.write((char*)&a,size*sizeof(unsigned long long));
    }
    myfile.close();
}

Compiled with Visual Studio 2010 and full optimizations and run under Windows7 this program maxes out around 20MB/s. What really bothers me is that Windows can copy files from an other SSD to this SSD at somewhere between 150MB/s and 200MB/s. So at least 7 times faster. That's why I think I should be able to go faster.

Any ideas how I can speed up my writing?

Presumptuous answered 19/7, 2012 at 15:18 Comment(22)

Have you tried playing with your disk buffering settings? You can set that through Device Manager -> Disk drives -> right click on a drive. – Hemihydrate 19/7, 2012 at 15:23

Did your timing results exclude the time it takes to do your computations to fill a[] ? – Amputate 19/7, 2012 at 15:24

@philippe That kinda defeats the purpose of writing to disk. – Hemihydrate 19/7, 2012 at 15:24

I've actually done this task before. Using simple fwrite() I could get around 80% of peak write speeds. Only with FILE_FLAG_NO_BUFFERING was I ever able to get max speed. – Hemihydrate 19/7, 2012 at 15:26

I'm talking about doing it in chunks of memory – Vogt 19/7, 2012 at 15:26

Get velocity using the win32 API! msdn.microsoft.com/en-us/library/windows/desktop/… – Stallworth 19/7, 2012 at 15:27

That is not quite how one programs a fast IO app on Windows. Read Designing Applications for High Performance - Part III – Cohere 19/7, 2012 at 15:30

Try maximizing the output buffer size and make writes of exactly the same size. – Ettie 19/7, 2012 at 15:32

I just tested the code and indeed it does only achieve a small fraction of my 100+ MB/s bandwidth on my HD. Hmm... I have disk cache enabled in Windows. – Hemihydrate 19/7, 2012 at 15:32

I'm not sure it's fair to compare your file writeing to a SSD-to-SSD copying. It might well be that SSD-to-SSD works on a lower level, avoiding the C++ libraries, or using direct memory access (DMA). Copying something is not the same as writing arbitrary values to a random access file. – Barrera 19/7, 2012 at 15:36

I just wrote a FILE* / fwrite() equivalent of this and it gets 90 MB/s on my machine. Using C++ streams gets only 20 MB/s... go figure... – Hemihydrate 19/7, 2012 at 15:40

@IgorF.: That's just wrong speculation; it's a perfectly fair comparison (if nothing else, in favor of file writing). Copying across a drive in Windows is just read-and-write; nothing fancy/complicated/different going on underneath. – Circumfluent 19/7, 2012 at 15:46

I think it was discussed a few times before: use memory mapped files. – Gibun 19/7, 2012 at 15:52

@MaximYegorushkin: Link or it didn't happen. :P – Circumfluent 19/7, 2012 at 15:55

iostreams are known to be terribly slow. See #4340896 – Unlay 19/7, 2012 at 15:56

Have you tried the C++ fast file copy method? https://mcmap.net/q/17090/-copy-a-file-in-a-sane-safe-and-efficient-way – Downandout 19/7, 2012 at 17:55

@BenVoigt: iostreams are slow when using the formatted stream operations (usually via operator<<). When it is a binary file and you are using chunks of this size (512M) and using write() there is no difference in performance between std::ofstream and FILE*: see my answer. – Downandout 19/7, 2012 at 17:57

@Loki: Look at my question (I linked it in an earlier comment). Overhead is different between glibc and Visual C++ runtime library. So your conclusions based on Linux benchmarking don't really apply to this question. – Unlay 19/7, 2012 at 18:22

If possible, unroll your loop manually, that can help with the speed too depending on how/if the compiler unrolls the code for you. The looping means that the processor has to branch to the start of the loop again, and branches are relatively expensive. – Hemeralopia 19/7, 2012 at 20:38

I'm wondering nobody commented on this line : myfile = fstream("file.binary", ios::out | ios::binary);. which will NOT even compile, because copy-semantic of stream classes is disabled in the stdlib. – Begin 20/7, 2012 at 10:9

is there no any low level system routine for that? For instance, on Windows you have the CopyFileEx link – Mays 24/7, 2012 at 18:48

@Mysticial, such a great difference (20MB/90MB) could be explained by flushing/updating directory metadata etc, during the writing. I've not been doing anything C level on windows for ages, but that would be my 1st guess. – Espresso 25/7, 2012 at 23:37

289

This did the job (in the year 2012):

#include <stdio.h>
const unsigned long long size = 8ULL*1024ULL*1024ULL;
unsigned long long a[size];

int main()
{
    FILE* pFile;
    pFile = fopen("file.binary", "wb");
    for (unsigned long long j = 0; j < 1024; ++j){
        //Some calculations to fill a[]
        fwrite(a, 1, size*sizeof(unsigned long long), pFile);
    }
    fclose(pFile);
    return 0;
}

I just timed 8GB in 36sec, which is about 220MB/s and I think that maxes out my SSD. Also worth to note, the code in the question used one core 100%, whereas this code only uses 2-5%.

Thanks a lot to everyone.

Update: 5 years have passed it's 2017 now. Compilers, hardware, libraries and my requirements have changed. That's why I made some changes to the code and did some new measurements.

First up the code:

#include <fstream>
#include <chrono>
#include <vector>
#include <cstdint>
#include <numeric>
#include <random>
#include <algorithm>
#include <iostream>
#include <cassert>

std::vector<uint64_t> GenerateData(std::size_t bytes)
{
    assert(bytes % sizeof(uint64_t) == 0);
    std::vector<uint64_t> data(bytes / sizeof(uint64_t));
    std::iota(data.begin(), data.end(), 0);
    std::shuffle(data.begin(), data.end(), std::mt19937{ std::random_device{}() });
    return data;
}

long long option_1(std::size_t bytes)
{
    std::vector<uint64_t> data = GenerateData(bytes);

    auto startTime = std::chrono::high_resolution_clock::now();
    auto myfile = std::fstream("file.binary", std::ios::out | std::ios::binary);
    myfile.write((char*)&data[0], bytes);
    myfile.close();
    auto endTime = std::chrono::high_resolution_clock::now();

    return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

long long option_2(std::size_t bytes)
{
    std::vector<uint64_t> data = GenerateData(bytes);

    auto startTime = std::chrono::high_resolution_clock::now();
    FILE* file = fopen("file.binary", "wb");
    fwrite(&data[0], 1, bytes, file);
    fclose(file);
    auto endTime = std::chrono::high_resolution_clock::now();

    return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

long long option_3(std::size_t bytes)
{
    std::vector<uint64_t> data = GenerateData(bytes);

    std::ios_base::sync_with_stdio(false);
    auto startTime = std::chrono::high_resolution_clock::now();
    auto myfile = std::fstream("file.binary", std::ios::out | std::ios::binary);
    myfile.write((char*)&data[0], bytes);
    myfile.close();
    auto endTime = std::chrono::high_resolution_clock::now();

    return std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count();
}

int main()
{
    const std::size_t kB = 1024;
    const std::size_t MB = 1024 * kB;
    const std::size_t GB = 1024 * MB;

    for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option1, " << size / MB << "MB: " << option_1(size) << "ms" << std::endl;
    for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option2, " << size / MB << "MB: " << option_2(size) << "ms" << std::endl;
    for (std::size_t size = 1 * MB; size <= 4 * GB; size *= 2) std::cout << "option3, " << size / MB << "MB: " << option_3(size) << "ms" << std::endl;

    return 0;
}

This code compiles with Visual Studio 2017 and g++ 7.2.0 (a new requirements). I ran the code with two setups:

Laptop, Core i7, SSD, Ubuntu 16.04, g++ Version 7.2.0 with -std=c++11 -march=native -O3
Desktop, Core i7, SSD, Windows 10, Visual Studio 2017 Version 15.3.1 with /Ox /Ob2 /Oi /Ot /GT /GL /Gy

Which gave the following measurements (after ditching the values for 1MB, because they were obvious outliers): Both times option1 and option3 max out my SSD. I didn't expect this to see, because option2 used to be the fastest code on my old machine back then.

TL;DR: My measurements indicate to use std::fstream over FILE.

Presumptuous answered 19/7, 2012 at 16:11 Comment(15)

+1 Yeah, this was the first thing I tried. FILE* is faster than streams. I wouldn't have expected such a difference since it "should've" been I/O bound anyway. – Hemihydrate 19/7, 2012 at 16:13

Can we conclude that C-style I/O is (strangely) much faster than C++ streams? – Morra 19/7, 2012 at 16:26

@SChepurin: If you're being pedantic, probably not. If you're being practical, probably yes. :) – Circumfluent 19/7, 2012 at 16:34

Could you please explain (for a C++ dunce like me) the difference between the two approaches, and why this one works so much faster than the original? – Lector 25/7, 2012 at 14:0

Does prepending ios::sync_with_stdio(false); make any difference for the code with stream? I'm just curious how big difference there is between using this line and not, but I don't have the fast enough disk to check the corner case. And if there is any real difference. – Dortheydorthy 25/7, 2012 at 23:29

Yes, C FILE is several times faster. But why? Shouldn't there be some optimized C++ stream that can compete with it? I too write large binary files and it would be nice to not have to call C routines in a C++ class. – Hexameter 16/10, 2012 at 19:2

You are not writing 8GB but 8*1024^3*sizeof(long long) – Keynesianism 31/1, 2013 at 18:44

Why write size*sizeof(unsigned long long) at a time? whats the logic? Whats your hdd block, sector and memory page size? – Supernova 9/8, 2013 at 10:52

Be aware that even specifically addressing buffering, etc., significant performance differences can be had with fwrite - moving items from the size and count parameters makes a difference, as does breaking up calls to fwrite with loops. I ran into this while working out another issue. – Blackguard 28/2, 2014 at 22:6

This program copies 64GB, and there's no timing code (so your "fill a[]" apparently is being timed), so any conclusion is worthless. – Pluviometer 8/11, 2014 at 18:38

@JimBalter - he's apparently checking disk I/O rate, outside the application, as did many others who commented on the question. – Laaspere 22/8, 2016 at 13:44

I see absolutely no difference between option 1 and option 3. Also, how many times did you run each microbenchmark for each size? I find it hard to believe that FILE* is slower than streams. If it is slower, than that means that whatever standard library you're using is either using the FILE* better, or it has a lower-level API that it's using. – Resolved 30/11, 2017 at 1:3

Maybe the performance improvement can come from using write instead of fwrite (unbuffered vs buffered io) – Resolved 30/11, 2017 at 1:10

I have done more test with my modified program which can generate binary and ascii data aswell here it is ReadWriteTest.cpp and here are my Binary-data-set and ascii-data-set and real-apps-test. – Yetty 29/7, 2018 at 15:17

@ArturCzajka I believe the std::ios_base::sync_with_stdio(false); has no effect at all for reading and writing to file streams, as it applies only to the default streams like std::cin and std::cout. So, if one writes a lot of data to std::cout, setting sync_with_stdio() to false probably has a measurable effect. Here, it doesn't. – Curable 26/4, 2021 at 19:3

Try the following, in order:

Smaller buffer size. Writing ~2 MiB at a time might be a good start. On my last laptop, ~512 KiB was the sweet spot, but I haven't tested on my SSD yet.

Note: I've noticed that very large buffers tend to decrease performance. I've noticed speed losses with using 16-MiB buffers instead of 512-KiB buffers before.
Use _open (or _topen if you want to be Windows-correct) to open the file, then use _write. This will probably avoid a lot of buffering, but it's not certain to.
Using Windows-specific functions like CreateFile and WriteFile. That will avoid any buffering in the standard library.

Circumfluent answered 19/7, 2012 at 15:53 Comment(3)

Check any benchmark results posted online. You need either 4kB writes with a queue depth of 32 or more, or else 512K or higher writes, to get any sort of decent throughput. – Unlay 19/7, 2012 at 16:6

@BenVoigt: Yup, that correlates with me saying 512 KiB was the sweet spot for me. :) – Circumfluent 19/7, 2012 at 16:7

Yes. From my experience, smaller buffer sizes are usually optimal. The exception is when you're using FILE_FLAG_NO_BUFFERING - in which larger buffers tend to be better. Since I think FILE_FLAG_NO_BUFFERING is pretty much DMA. – Hemihydrate 19/7, 2012 at 16:12

I see no difference between std::stream/FILE/device. Between buffering and non buffering.

Also note:

SSD drives "tend" to slow down (lower transfer rates) as they fill up.
SSD drives "tend" to slow down (lower transfer rates) as they get older (because of non working bits).

I am seeing the code run in 63 secondds.
Thus a transfer rate of: 260M/s (my SSD look slightly faster than yours).

64 * 1024 * 1024 * 8 /*sizeof(unsigned long long) */ * 32 /*Chunks*/

= 16G
= 16G/63 = 260M/s

I get a no increase by moving to FILE* from std::fstream.

#include <stdio.h>

using namespace std;

int main()
{
    
    FILE* stream = fopen("binary", "w");

    for(int loop=0;loop < 32;++loop)
    {
         fwrite(a, sizeof(unsigned long long), size, stream);
    }
    fclose(stream);

}

So the C++ stream are working as fast as the underlying library will allow.

But I think it is unfair comparing the OS to an application that is built on-top of the OS. The application can make no assumptions (it does not know the drives are SSD) and thus uses the file mechanisms of the OS for transfer.

While the OS does not need to make any assumptions. It can tell the types of the drives involved and use the optimal technique for transferring the data. In this case a direct memory to memory transfer. Try writing a program that copies 80G from 1 location in memory to another and see how fast that is.

Edit

I changed my code to use the lower level calls:
ie no buffering.

#include <fcntl.h>
#include <unistd.h>


const unsigned long long size = 64ULL*1024ULL*1024ULL;
unsigned long long a[size];
int main()
{
    int data = open("test", O_WRONLY | O_CREAT, 0777);
    for(int loop = 0; loop < 32; ++loop)
    {   
        write(data, a, size * sizeof(unsigned long long));
    }   
    close(data);
}

This made no diffference.

NOTE: My drive is an SSD drive if you have a normal drive you may see a difference between the two techniques above. But as I expected non buffering and buffering (when writting large chunks greater than buffer size) make no difference.

Edit 2:

Have you tried the fastest method of copying files in C++

int main()
{
    std::ifstream  input("input");
    std::ofstream  output("ouptut");

    output << input.rdbuf();
}

Downandout answered 19/7, 2012 at 16:4 Comment(13)

I didn't downvote, but your buffer size is too small. I did it with the same 512 MB buffer the OP is using and I get 20 MB/s with streams vs. 90 MB/s with FILE*. – Hemihydrate 19/7, 2012 at 16:5

Also your way with fwrite(a, sizeof(unsigned long long), size, stream); instead of fwrite(a, 1, size*sizeof(unsigned long long), pFile); gives me 220MB/s with chunks of 64MB per write. – Presumptuous 19/7, 2012 at 16:14

@Mysticial: It surprises my that buffer size makes a difference (though I believe you). The buffer is useful when you have lots of small writes so that the underlying device is not bothered with many requests. But when you are writing huge chunks there is no need for a buffer when writing/reading (on a blocking device). As such the data should be passed directly to the underlying device (thus by-passing the buffer). Though if you see a difference this would contradict this and make my wonder why the write is actually using a buffer at all. – Downandout 19/7, 2012 at 17:22

The best solution is NOT to increase the buffer size but to remove the buffer and make write pass the data directly to the underlying device. – Downandout 19/7, 2012 at 17:22

But this does not change my though that it is an unfair comparison. – Downandout 19/7, 2012 at 17:25

Well, each time you call fwrite(), you have the usual function call overhead as well as other error-checking/buffering overhead inside it. So you need the block to be "large enough" to where this overhead becomes insignificant. From my experience, it's usually about several hundred bytes to a few KB. Without internal buffering, it's easily more than 1MB since you may need to offset disk seek latency as well. (Internal buffering will coalesce multiple small writes.) – Hemihydrate 19/7, 2012 at 17:27

@Mysticial: 1) There are no small chunks => It is always large enough (in this example). In this case chunks are 512M 2) This is an SSD drive (both mine and the OP) so none of that is relevant. I have updated my answer. – Downandout 19/7, 2012 at 17:38

It might be an OS issue. Are you on Linux? All my results are on Windows. Perhaps Windows has a heavier I/O interface underneath. The OP reports 100% CPU using C++ streams and 2-5% with FILE*. – Hemihydrate 19/7, 2012 at 17:41

@BSD Linux (ie Mac) using an SSD drive. – Downandout 19/7, 2012 at 17:51

Ah ok. +1 for the edit. I suppose you don't get 100% CPU using C++ streams? I just re-ran the test with TM open, and it definitely hogs an entire core. If that's the case, then it we can conclude it's the OS or the implementation. – Hemihydrate 19/7, 2012 at 17:58

@Mysticial: Are you using SSD or spinning drive? – Downandout 19/7, 2012 at 18:1

Normal HD. Though I doubt it matters much because of OS write-coalescing. – Hemihydrate 19/7, 2012 at 18:2

@PanicSheep: I am not sure what the difference between the two calls are. The overall size is the same. It's not as if underneath it spins up a loop and makes size calls of writting sizeof(unsigned long long bytes). The interface is there to make it easy to write code underneath the interface the only difference is the totalsize of the buffer. – Downandout 19/7, 2012 at 18:5

The best solution is to implement an async writing with double buffering.

Look at the time line:

------------------------------------------------>
FF|WWWWWWWW|FF|WWWWWWWW|FF|WWWWWWWW|FF|WWWWWWWW|

The 'F' represents time for buffer filling, and 'W' represents time for writing buffer to disk. So the problem in wasting time between writing buffers to file. However, by implementing writing on a separate thread, you can start filling the next buffer right away like this:

------------------------------------------------> (main thread, fills buffers)
FF|ff______|FF______|ff______|________|
------------------------------------------------> (writer thread)
  |WWWWWWWW|wwwwwwww|WWWWWWWW|wwwwwwww|

F - filling 1st buffer
f - filling 2nd buffer
W - writing 1st buffer to file
w - writing 2nd buffer to file
_ - wait while operation is completed

This approach with buffer swaps is very useful when filling a buffer requires more complex computation (hence, more time). I always implement a CSequentialStreamWriter class that hides asynchronous writing inside, so for the end-user the interface has just Write function(s).

And the buffer size must be a multiple of disk cluster size. Otherwise, you'll end up with poor performance by writing a single buffer to 2 adjacent disk clusters.

Writing the last buffer.
When you call Write function for the last time, you have to make sure that the current buffer is being filled should be written to disk as well. Thus CSequentialStreamWriter should have a separate method, let's say Finalize (final buffer flush), which should write to disk the last portion of data.

Error handling.
While the code start filling 2nd buffer, and the 1st one is being written on a separate thread, but write fails for some reason, the main thread should be aware of that failure.

------------------------------------------------> (main thread, fills buffers)
FF|fX|
------------------------------------------------> (writer thread)
__|X|

Let's assume the interface of a CSequentialStreamWriter has Write function returns bool or throws an exception, thus having an error on a separate thread, you have to remember that state, so next time you call Write or Finilize on the main thread, the method will return False or will throw an exception. And it does not really matter at which point you stopped filling a buffer, even if you wrote some data ahead after the failure - most likely the file would be corrupted and useless.

Firewater answered 28/8, 2014 at 0:56 Comment(1)

Performing I/O is parallel with computations is a very good idea, but on Windows you shouldn't use threads to accomplish it. Instead, use "Overlapped I/O", which doesn't block one of your threads during the I/O call. It means you barely have to worry about thread synchronization (just don't access a buffer that has an active I/O operation using it). – Unlay 3/5, 2015 at 15:46

I'd suggest trying file mapping. I used mmapin the past, in a UNIX environment, and I was impressed by the high performance I could achieve

Generable answered 19/7, 2012 at 21:35 Comment(2)

@nalply It's still a working, efficient and interesting solution to keep in mind. – Margenemargent 24/7, 2012 at 20:13

https://mcmap.net/q/92639/-when-should-i-use-mmap-for-file-access about the pros an cons of mmap. Especially note "For pure sequential accesses to the file, it is also not always the better solution [...]" Also stackoverflow.com/questions/726471, it effectively says that on a 32-bit system you are limited to 2 or 3 GB. - by the way, it's not me who downvoted that answer. – Bolide 25/7, 2012 at 8:53

Could you use FILE* instead, and the measure the performance you've gained? A couple of options is to use fwrite/write instead of fstream:

#include <stdio.h>

int main ()
{
  FILE * pFile;
  char buffer[] = { 'x' , 'y' , 'z' };
  pFile = fopen ( "myfile.bin" , "w+b" );
  fwrite (buffer , 1 , sizeof(buffer) , pFile );
  fclose (pFile);
  return 0;
}

If you decide to use write, try something similar:

#include <unistd.h>
#include <fcntl.h>

int main(void)
{
    int filedesc = open("testfile.txt", O_WRONLY | O_APPEND);

    if (filedesc < 0) {
        return -1;
    }

    if (write(filedesc, "This will be output to testfile.txt\n", 36) != 36) {
        write(2, "There was an error writing to testfile.txt\n", 43);
        return -1;
    }

    return 0;
}

I would also advice you to look into memory map. That may be your answer. Once I had to process a 20GB file in other to store it in the database, and the file as not even opening. So the solution as to utilize moemory map. I did that in Python though.

Vogt answered 19/7, 2012 at 15:50 Comment(3)

Actually, a straight-forward FILE* equivalent of the original code using the same 512 MB buffer gets full speed. Your current buffer is too small. – Hemihydrate 19/7, 2012 at 15:53

@Hemihydrate But that's just an example. – Vogt 19/7, 2012 at 15:54

In most systems, 2 corresponds to standard error but it's still recommended that you'd use STDERR_FILENO instead of 2. Another important issue is that one possible errorno you can get is EINTR for when you receive an interrupt signal, this is not a real error and you should simply try again. – Anacreontic 27/4, 2019 at 2:5

fstreams are not slower than C streams, per se, but they use more CPU (especially if buffering is not properly configured). When a CPU saturates, it limits the I/O rate.

At least the MSVC 2015 implementation copies 1 char at a time to the output buffer when a stream buffer is not set (see streambuf::xsputn). So make sure to set a stream buffer (>0).

I can get a write speed of 1500MB/s (the full speed of my M.2 SSD) with fstream using this code:

#include <iostream>
#include <fstream>
#include <chrono>
#include <memory>
#include <stdio.h>
#ifdef __linux__
#include <unistd.h>
#endif
using namespace std;
using namespace std::chrono;
const size_t sz = 512 * 1024 * 1024;
const int numiter = 20;
const size_t bufsize = 1024 * 1024;
int main(int argc, char**argv)
{
  unique_ptr<char[]> data(new char[sz]);
  unique_ptr<char[]> buf(new char[bufsize]);
  for (size_t p = 0; p < sz; p += 16) {
    memcpy(&data[p], "BINARY.DATA.....", 16);
  }
  unlink("file.binary");
  int64_t total = 0;
  if (argc < 2 || strcmp(argv[1], "fopen") != 0) {
    cout << "fstream mode\n";
    ofstream myfile("file.binary", ios::out | ios::binary);
    if (!myfile) {
      cerr << "open failed\n"; return 1;
    }
    myfile.rdbuf()->pubsetbuf(buf.get(), bufsize); // IMPORTANT
    for (int i = 0; i < numiter; ++i) {
      auto tm1 = high_resolution_clock::now();
      myfile.write(data.get(), sz);
      if (!myfile)
        cerr << "write failed\n";
      auto tm = (duration_cast<milliseconds>(high_resolution_clock::now() - tm1).count());
      cout << tm << " ms\n";
      total += tm;
    }
    myfile.close();
  }
  else {
    cout << "fopen mode\n";
    FILE* pFile = fopen("file.binary", "wb");
    if (!pFile) {
      cerr << "open failed\n"; return 1;
    }
    setvbuf(pFile, buf.get(), _IOFBF, bufsize); // NOT important
    auto tm1 = high_resolution_clock::now();
    for (int i = 0; i < numiter; ++i) {
      auto tm1 = high_resolution_clock::now();
      if (fwrite(data.get(), sz, 1, pFile) != 1)
        cerr << "write failed\n";
      auto tm = (duration_cast<milliseconds>(high_resolution_clock::now() - tm1).count());
      cout << tm << " ms\n";
      total += tm;
    }
    fclose(pFile);
    auto tm2 = high_resolution_clock::now();
  }
  cout << "Total: " << total << " ms, " << (sz*numiter * 1000 / (1024.0 * 1024 * total)) << " MB/s\n";
}

I tried this code on other platforms (Ubuntu, FreeBSD) and noticed no I/O rate differences, but a CPU usage difference of about 8:1 (fstream used 8 times more CPU). So one can imagine, had I a faster disk, the fstream write would slow down sooner than the stdio version.

Laaspere answered 23/8, 2016 at 9:43 Comment(0)

Try using open()/write()/close() API calls and experiment with the output buffer size. I mean do not pass the whole "many-many-bytes" buffer at once, do a couple of writes (i.e., TotalNumBytes / OutBufferSize). OutBufferSize can be from 4096 bytes to megabyte.

Another try - use WinAPI OpenFile/CreateFile and use this MSDN article to turn off buffering (FILE_FLAG_NO_BUFFERING). And this MSDN article on WriteFile() shows how to get the block size for the drive to know the optimal buffer size.

Anyway, std::ofstream is a wrapper and there might be blocking on I/O operations. Keep in mind that traversing the entire N-gigabyte array also takes some time. While you are writing a small buffer, it gets to the cache and works faster.

Callaghan answered 19/7, 2012 at 15:25 Comment(0)

If you copy something from disk A to disk B in explorer, Windows employs DMA. That means for most of the copy process, the CPU will basically do nothing other than telling the disk controller where to put, and get data from, eliminating a whole step in the chain, and one that is not at all optimized for moving large amounts of data - and I mean hardware.

What you do involves the CPU a lot. I want to point you to the "Some calculations to fill a[]" part. Which I think is essential. You generate a[], then you copy from a[] to an output buffer (thats what fstream::write does), then you generate again, etc.

What to do? Multithreading! (I hope you have a multi-core processor)

fork.
Use one thread to generate a[] data
Use the other to write data from a[] to disk
You will need two arrays a1[] and a2[] and switch between them
You will need some sort of synchronization between your threads (semaphores, message queue, etc.)
Use lower level, unbuffered, functions, like the the WriteFile function mentioned by Mehrdad

Tyche answered 19/7, 2012 at 16:33 Comment(0)

Try to use memory-mapped files.

Pym answered 19/7, 2012 at 15:43 Comment(18)

@Mehrdad but why? Because it's a platform dependent solution? – Pym 19/7, 2012 at 15:45

No... it's because in order to do fast sequential file writing, you need to write large amounts of data at once. (Say, 2-MiB chunks is probably a good starting point.) Memory mapped files don't let you control the granularity, so you're at the mercy of whatever the memory manager decides to prefetch/buffer for you. In general, I've never seen them be as effective as normal reading/writing with ReadFile and such for sequential access, although for random access they may well be better. – Circumfluent 19/7, 2012 at 15:48

But memory-mapped files are used by OS for paging, for example. I think it's a highly optimized (in terms of speed) way to read/write data. – Pym 19/7, 2012 at 15:51

@Mysticial: People 'know" a lot of things that are just plain wrong. – Unlay 19/7, 2012 at 15:53

@qehgt: If anything, paging is much more optimized for random access than sequential access. Reading 1 page of data is much slower than reading 1 megabyte of data in a single operation. – Circumfluent 19/7, 2012 at 15:54

@BenVoigt I'm speaking from experience. As soon as you put a workstation load that relies paging. Automatic > 1000x slowdown. Not only that, it usually hangs the computer to where a reset is required. – Hemihydrate 19/7, 2012 at 15:54

@Mysticial: You're confusing two opposite things. Having stuff that needs to be in memory paged out, is totally different from having data stored on disk paged into cache. The pagefile manager is reponsible for both (both are "memory-mapped files"). – Unlay 19/7, 2012 at 15:58

@BenVoigt Maybe I am. Which two things? – Hemihydrate 19/7, 2012 at 15:59

@Mysticial: He means (1) Having stuff that needs to be in memory paged out, and (2) having data stored on disk cached in*. Stuff can get paged out without being stored on the disk. (I typically turn off my pagefiles too, though, but for different reasons.) – Circumfluent 19/7, 2012 at 16:0

@Mehrdad: You better run a benchmark rather than assuming the truth. I have successfully used memory-mapped files with sequential access patterns with huge speed-ups compared to C (i.e. FILE) and C++ (i.e. iostream). – Defeat 19/7, 2012 at 16:1

@Mehrdad: Huh? Where does stuff get paged out to, except a larger slower memory? That's how cache hierarchies work. – Unlay 19/7, 2012 at 16:1

@AlefSin: Yes, but notice I was recommending ReadFile in Windows, not C or C++'s standard functions. Memory-mapped files are faster but not the fastest out there. – Circumfluent 19/7, 2012 at 16:2

@BenVoigt: Er, maybe "page out" wasn't the right term? I meant that the page can get invalidated, and the system has to fetch the page from the disk again. (e.g. executables) – Circumfluent 19/7, 2012 at 16:2

@Mehrdad: I think you're confusing reading with writing. Prefetch and so forth that you're making a big deal of, doesn't apply to writes. – Unlay 19/7, 2012 at 16:3

@BenVoigt: I wasn't talking about writing, my bad. I was talking about paging in general, since that's the topic Mysticial brought up. – Circumfluent 19/7, 2012 at 16:3

@Mehrdad: That's still being paged out to disk. The file is the actual executable file, not the pagefile, and it won't incur a disk write if the page wasn't dirty, but it's the same mechanism. (Although if the page was modified, e.g. relocations / non-preferred load address, it will be dirty and written out to the pagefile) – Unlay 19/7, 2012 at 16:4

@BenVoigt: If by "page out" you mean "marked as needing to be re-read" then sure. I don't use that term for this because I feel it implies the page must be written to the disk, which it doesn't. (If the term doesn't imply that then it's probably just me.) – Circumfluent 19/7, 2012 at 16:5

I see why the mods hate us for these long comment threads. XD @Mehrdad Yes, I see where I got confused. thx – Hemihydrate 19/7, 2012 at 16:10

If you want to write fast to file streams then you could make stream the read buffer larger:

wfstream f;
const size_t nBufferSize = 16184;
wchar_t buffer[nBufferSize];
f.rdbuf()->pubsetbuf(buffer, nBufferSize);

Also, when writing lots of data to files it is sometimes faster to logically extend the file size instead of physically, this is because when logically extending a file the file system does not zero the new space out before writing to it. It is also smart to logically extend the file more than you actually need to prevent lots of file extentions. Logical file extention is supported on Windows by calling SetFileValidData or xfsctl with XFS_IOC_RESVSP64 on XFS systems.

Soul answered 2/3, 2013 at 18:17 Comment(0)

-1

im compiling my program in gcc in GNU/Linux and mingw in win 7 and win xp and worked good

you can use my program and to create a 80 GB file just change the line 33 to