Asked 9/1, 2020 at 15:56 Answered 31/5, 2023 at 21:41

Solved c++multithreading file asynchronous io

I have a super fast M.2 drive. How fast is it? It doesn’t matter because I cannot utilize this speed anyway. That’s why I’m asking this question.

I have an app that needs a real lot of memory. So much that it won’t fit in RAM. Fortunately it is not needed all at once. Instead it is used to save intermediate results from computations.

Unfortunately the application is not able to write and reads this data fast enough. I tried using multiple reader and writer threads but it only made it worse (later I read that it is because of this).

So my question is: Is it possible to have truly asynchronous file IO in C++ to fully exploit those advertised gigabytes per second? If it is than how (in a cross platform way)?

You could also recommend a library that’s good with tasks like that if you know one because I believe that there is no point in reinventing the wheel.

Edit:

Here is code that shows how I do file IO in my program. It isn't from the mentioned program because it wouldn't be that minimal. This one ilustrates the problem nevertheless. Do not mind Windows.h. It is used only to set thread affinity. In the actual program I also set affinity , so that's why I included it.

#include <fstream>
#include <thread>
#include <memory>
#include <string>

#include <Windows.h> // for SetThreadAffinityMask()

void stress_write(unsigned bytes, int num)
{
    std::ofstream out("temp" + std::to_string(num));
    for (unsigned i = 0; i < bytes; ++i)
    {
        out << char(i);
    }
}

void lock_thread(unsigned core_idx)
{
    SetThreadAffinityMask(GetCurrentThread(), 1LL << core_idx);
}

int main()
{
    std::ios_base::sync_with_stdio(false);
    lock_thread(0);

    auto worker_count = std::thread::hardware_concurrency() - 1;

    std::unique_ptr<std::thread[]> threads = std::make_unique<std::thread[]>(worker_count); // faster than std::vector

    for (int i = 0; i < worker_count; ++i)
    {
        threads[i] = std::thread(
            [](unsigned idx) {
                lock_thread(idx);
                stress_write(1'000'000'000, idx);
            },
            i + 1
        );
    }
    stress_write(1'000'000'000, 0);

    for (int i = 0; i < worker_count; ++i)
    {
        threads[i].join();
    }
}

As you can see its just plain old fstream. On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). I am wondering if it could be increased.

Tuyettv answered 9/1, 2020 at 15:56 Comment(15)

Post the code you are using, perhaps we can spot a performance bug? – Gabel 9/1, 2020 at 15:59

Have you thought about other ways than explicitly reading and writing files? How about memory mapping the files? – Dybbuk 9/1, 2020 at 16:0

It’s just fstreams. I don’t think there is a point in showing it but I will add it to the answers (in a few minutes because it has to be "minimal"). – Tuyettv 9/1, 2020 at 16:1

Add the code to the question instead, please. (Click on 'edit' at the bottom.) – And 9/1, 2020 at 16:1

to fully exploit those advertised gigabytes per second? For best performance you need to read or even write a GB sized file sequentially. Write performance should be somewhat lower than read for many drives. – Pitarys 9/1, 2020 at 16:2

(a) it does matter how fast the drive is and how much you are using it. It is really easy to misread performance information (bits vs bytes, random vs sequential, read vs write). (b) many reader/writer threads making it worse may be a symptom of you actually saturating the drive. (c) A solution to your problem may involve understanding your business logic. If you are using 1%, 10%, 50% or 80% of your bandwidth, the kinds of things you should do next to improve bandwidth use will be very different. – Thickwitted 9/1, 2020 at 16:6

You will get best results from a SO question if you can produce a concrete, complete, minimal example of your bandwidth problem (that someone else can copy/paste and reproduce!), together with benchmarks showing how close you are to saturating your bandwidth. I (and others) can give piles of advice on how to make things faster, but which thing you should do depends on details you haven't shared, so we'd be shooting in the dark. (As an example, find or extend a future with a .then method, attach said futures to a pseudo-executor, and queue up a pile of work) – Thickwitted 9/1, 2020 at 16:8

And perhaps there are other solutions to your problem than using (temporary) files to store results? Are you sure you're using the best and most appropriate data-structures for the use-case? Are there other algorithms that can be used that doesn't need as much memory? And perhaps the cheapest and easiest solution is to just add more memory? Without knowing what you do, and why, we can't really help you with your original problem. – Dybbuk 9/1, 2020 at 16:12

The only way to come close to the listed maximum transfer rates for any drive is by using OS specific unbuffered I/O routines. If you use the normal C++ libraries you get a lot of memory copying: possibly the C++ routine has a buffer, and the OS I/O call (which will be called by the C++ library) has a buffer (disk cache). If you know ahead of time what data you need you can also take advantage of asynchronous I/O calls if supported by the OS. – Tolly 9/1, 2020 at 16:23

@enthusiastic_3d_graphics_pr...I suggest avoiding iostreams, they are notoriously hard to use correctly and even if you do -- hand-written code that deals with memory buffers and OS primitives will typically outperform them. – Toadinthehole 9/1, 2020 at 16:34

fstream really and truly sucks performance-wise. Before jumping to async I/O (which is a good idea) you should first measure with stdio.h and your OS's platform-specific synchronous I/O functions. (Not saying you need to deploy platform-specific code, but writing some during performance testing can be extraordinarily valuable to determine what layer is adding inefficiencies) – Sayles 9/1, 2020 at 17:50

On my machine this uses 100% CPU, but only 7-9% disk (around 190MB/s). Are you testing a release build? – Pitarys 9/1, 2020 at 20:25

out << char(i); may be part of the problem. – Pitarys 9/1, 2020 at 20:26

@Pitarys Yes, I am using a release build. What do you mean about out << char(i); being the problem? Its the only IO line in this program, so It will always be the bottleneck, wont it? – Tuyettv 9/1, 2020 at 22:4

Have you tried something as simple as boost.org/doc/libs/1_72_0/doc/html/boost_asio/overview/… -- or even sending data to the stream in 4k chunks? Also see #12997631 – Thickwitted 10/1, 2020 at 19:13

The easiest thing to get (up to) a 10x speed up is to change this:

void stress_write(unsigned bytes, int num)
{
  std::ofstream out("temp" + std::to_string(num));
  for (unsigned i = 0; i < bytes; ++i)
  {
    out << char(i);
  }
}

to this:

void stress_write(unsigned bytes, int num)
{
  constexpr auto chunk_size = (1u << 12u); // tune as needed
  std::ofstream out("temp" + std::to_string(num));
  for (unsigned chunk = 0; chunk < (bytes+chunk_size-1)/chunk_size; ++chunk)
  {
    char chunk_buff[chunk_size];
    auto count = (std::min)( bytes - chunk_size*chunk, chunk_size );
    for (unsigned j = 0; j < count; ++j)
    {
      unsigned i = j + chunk_size*chunk;
      chunk_buff[j] = char(i); // processing
    }
    out.write( chunk_buff, count );
  }
}

where we group writes up to 4096 bytes before sending to the std ofstream.

The streaming operations have a number of annoying, hard for compilers to elide, virtual calls that dominate performance when you are writing only a handful of bytes at a time.

By chunking data into larger pieces we make the vtable lookups rare enough that they no longer dominate.

See this SO post for more details asto why.

To get the last iota of performance, you may have to use something like boost.asio or access your platforms raw async file io libraries.

But when you are working at < 10% of the drive bandwidth while railing your CPU, aim at low hanging fruit first.

Thickwitted answered 10/1, 2020 at 19:28 Comment(0)

Chunking the I/O is indeed the most important optimization here and should suffice in most cases. However, the direct answer to the exact question asked about asynchronous IO is the following.

Boost::Asio added support for file operations in version 1.21.0. The interface is similar to the rest of Asio.

First, we need to create an object representing a file. The most common use cases would use either a random_access_file or a stream_file. In case of this example code, a streaming file is enough.

Reading is done through async_read_some, but the usual async_read helper function can be used to read a specific number of bytes at once.

If the operating system supports that, these operations do indeed run in the background and use little processor time. Both Windows and Linux do support this.

Tuyettv answered 11/1, 2023 at 16:53 Comment(0)

Stop thinking about c++ stream i/o if you are willing to bump up the disk i/o, cause it has been proven long ago that they are one of the slowest performing. Instead, you can try low level C i/o, e.g. FILE*(fopen, fread, fwrite). You will notice the performance increase right away. Moreover, as others already suggested here, try to use a dedicated thread for io and read and write in chunks, ideally the chunk size being equal to the sector size. In case of SSD, you will have to find the best value playing with it. Next, if that is not sufficient, try to use low level OS specific calls, e.g.overlapped io in windows, or completion ports, while in linux you would most probably come to the epoll.

Vanhoose answered 31/5, 2023 at 21:41 Comment(0)

Edit:

Recommended topics

Hot tags