Improving/optimizing file write speed in C++

Asked 5/8, 2015 at 17:8 Answered 5/8, 2015 at 17:59

Solved c++performance usrp software-defined-radio uhd

I've been running into some issues with writing to a file - namely, not being able to write fast enough.

To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.

The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.

Below is the relevant section of code:

    std::string path = "Stream/raw.dat";
    ofstream outFile(path, ios::out | ios::app| ios::binary);

    if(outFile.is_open())
        cout << "Yes" << endl;

    while(1)
    {
         rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
         switch(metaData.error_code)
         {

             //Irrelevant error checking...

             //Write data to a file
                std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
         }
    }

The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.

That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.

If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.

Etienne answered 5/8, 2015 at 17:8 Comment(9)

This is about a USRP, isn't it` – Significative 5/8, 2015 at 17:10

Intresting.... pure C pointer-like direction might do you better. If you know the structure of your operating system, you might be able to access the memory faster. – Assuming 5/8, 2015 at 17:10

Yep...I'm using a USRP N210. – Etienne 5/8, 2015 at 17:11

Does std::copy copy element wise? This is a common mistake when doing IO. Super slow. – Henandchickens 5/8, 2015 at 17:12

added the USRP and the software-defined-radio tags, since they apply here. Not getting the overall system performance needed for real-time processing is a very common problem. – Significative 5/8, 2015 at 17:17

@A.Abramov UHD, the device interface Mlagma uses, is C++ (it has a brand new C wrapper, but that's not as practical as the original/underlying C++, and also not faster). – Significative 5/8, 2015 at 17:24

Writing to disk is slow. Don't expect to be able to write more than 50 MB/s. There is not much you can do to improve this situation; consider saving the data to a RAM disk (e.g. a tmpfs) or buying a faster mass storage device (e.g. an SSD). – Biestings 5/8, 2015 at 17:30

@FUZxxl yes, but please also be aware that not every SSD is up to these write rates -- you see, you need the write rate the USRP enforces as an absolute minimum rate for short term averages, not as an "over the whole disk" average. So often, not even SSDs are up to the task. There's actually been a lot of discussion about how to make things work for > 100MS/s . – Significative 5/8, 2015 at 17:40

you probably will want to compress the data a bit so you are moving less bytes overall. – Dissonant 13/3, 2018 at 13:22

So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:

Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead

In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.

Galata answered 5/8, 2015 at 17:13 Comment(10)

Agreed, adding more: write the incoming data to one or more huge buffers (depending on lag time between receiving data and writing to file). Create a thread that reads from this buffer and writes to the file (in huge blocks). Also, use as much hardware assistance as possible, such as DMA. – Cahan 5/8, 2015 at 17:19

@ThomasMatthews Agreeing in general, larger chunks of data = better performance, but this also has downsides, namely, if the chunks don't get too large, the OS might not be busy for overly long processing these, and on systems where CPU cores are sparse, the time that an OS is busy with file I/O might get critical if it can't keep up with getting data through the network simultaneously. Linux scales pretty well on multiple cores, so this is really just a problem on single-core CPUs. – Significative 5/8, 2015 at 17:22

@ThomasMatthews I switched to write like you suggested, and it's a tremendous improvement - it has yet to overflow. I've also increased the size of my buffer. – Etienne 5/8, 2015 at 17:41

@Etienne good to hear that! As soon as your application does more than just writing to a file (which you could also just do with the mentioned example program), I still recommend going multi-threaded. – Significative 5/8, 2015 at 17:42

@ThomasMatthews I'm currently working on a multi-threaded application. In practice, I wouldn't write the file in this function. For the time being, it's just for proof of concept. I recently just finished a multi-threaded buffer for passing the samples along to different portions of the program, such as demodulation and what not. – Etienne 5/8, 2015 at 17:44

The two thread thing is not necessary because both input and output are buffered by the OS. One thread is enough to drive both devices to the max (until bottlenecked). – Henandchickens 5/8, 2015 at 21:5

@usr: in a perfect world, that might be true, but having a pool of buffers exchanging data between receiving and writing hardware is absolutely necessary here, as many examples of USRP usage showed. – Significative 8/8, 2015 at 10:0

@usr: to explain a little better: If buffering was arbitrarily flexible and throughput would be the only measure important here, your comment would be true. However, the problem here is that neither assumption is true: latency matters, because after a while of buffering away packets coming in over gigabit ethernet (or USB3, or 10GigE, or PCIe, to name the interfaces of the current non-embedded USRPs), the buffers are just full, and the OS and hardware are forced to drop data. That's a catastrophe! So, no, you cannot rely on your OS to "guess" what architecture your application needs. – Significative 8/8, 2015 at 10:25

@MarcusMüllerꕺꕺ if we assume that the streaming destination can handle all data then even a small amount of buffering should be enough. Just enough so that reading and writing is overlapped. That being said if this is UDP (which it does not look like; looks like TCP) then custom buffering can make sense to increase the probability that it will work. On the other hand simply increasing socket buffer size also should be enough. – Henandchickens 8/8, 2015 at 10:30

@usr: real world experience show that this assumption is wrong, even with striping raids of SSDs for high rates. Don't get me wrong, if you're doing data processing in general, your assumptions are fine, but this is the real time world, where PCs just don't work with "average rates", but with short term-latencies that can easily get too large. Btw, UHD is UDP (TCP doesn't make sense, a packet a few ms too late is simply not useful anymore). OP already increased buffer sizes. Really, your assumptions do not meet reality, here. – Significative 8/8, 2015 at 12:11

Let's do a bit of math.

Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).

Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).

This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.

The obvious starting point for an improvement would be to write the data in binary format instead:

uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));

For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).

I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.

Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.

The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.

In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Trinitrotoluene answered 5/8, 2015 at 17:59 Comment(4)

Don't bypass buffering. Do your IO asynchronously or at least on a separate thread. Wasting CPU is bad, but whatever you do, keep the OS buffer populated so it can keep the drive efficient. – Valerle 13/3, 2018 at 8:10

@EliotGillum: it's comments like this that convince SO's best contributors that they might as well quit and raise flowers instead of trying to help others. You have a lot of company, but you are personally responsible for making the world a worse place. Reread the last paragraph of the answer. Continue rereading it as often as necessary to realize that your comment is thoroughly and utterly wrong. – Trinitrotoluene 13/3, 2018 at 13:16

cruelty never makes a community better, no matter how much rep you have – Valerle 13/3, 2018 at 20:7

I don't think it's cruelty. I think it's an observation of fact. "...my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file." That shows no indication that he will benefit from polluting the cache with the content of the file he's receiving. – Trinitrotoluene 14/3, 2018 at 4:2

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags