Loop Around File Mapping Kills Performance

Asked 2/9, 2014 at 19:59 Answered 28/7, 2015 at 20:18

Solved c++windows boost interprocess file-mapping

I have a circular buffer which is backed with file mapped memory (the buffer is in the size range of 8GB-512GB).

I am writing to (8 instances of) this memory in a sequential manner from the beginning to the end at which point it loops around back to the beginning.

It works fine until it reaches the end where it needs to perform two file mappings and loop around the memory, at which point IO performance is totally trashed and doesn't recover (even after several minutes). I can't quite figure it out.

using namespace boost::interprocess;

class mapping
{
public:

  mapping()
  {
  }

  mapping(file_mapping& file, mode_t mode, std::size_t file_size, std::size_t offset, std::size_t size)
    : offset_(offset)
    , mode_(mode)
  {     
    const auto aligned_size         = page_ceil(size + page_size());
    const auto aligned_file_size    = page_floor(file_size);
    const auto aligned_file_offset  = page_floor(offset % aligned_file_size);
    const auto region1_size         = std::min(aligned_size, aligned_file_size - aligned_file_offset);
    const auto region2_size         = aligned_size - region1_size;

    if (region2_size)
    {
      const auto region1_address  = mapped_region(file, read_only, 0, (region1_size + region2_size) * 2).get_address(); 
      const auto region2_address  = reinterpret_cast<char*>(region1_address) + region1_size;  

      region1_ = mapped_region(file, mode, aligned_file_offset, region1_size, region1_address);
      region2_ = mapped_region(file, mode, 0,                   region2_size, region2_address);
    }
    else
    {
      region1_ = mapped_region(file, mode, aligned_file_offset, region1_size);
      region2_ = mapped_region();
    }

    size_ = region1_.get_size() + region2_.get_size();
    offset_ = aligned_file_offset;
  }

  auto offset() const   -> std::size_t  { return offset_; }
  auto size() const     -> std::size_t  { return size_; }
  auto data() const     -> const void*  { return region1_.get_address(); }
  auto data()           -> void*        { return region1_.get_address(); }
  auto flush(bool async = true) -> void
  {
    region1_.flush(async);
    region2_.flush(async);
  }
  auto mode() const -> mode_t { return mode_; }

private:
  std::size_t   offset_ = 0;
  std::size_t   size_ = 0;
  mode_t        mode_;
  mapped_region region1_;
  mapped_region region2_;
};

struct loop_mapping::impl final
{     
  std::tr2::sys::path         file_path_;
  file_mapping                file_mapping_;    
  std::size_t                 file_size_;
  std::size_t                 map_size_     = page_floor(256000000ULL);

  std::shared_ptr<mapping>    mapping_ = std::shared_ptr<mapping>(new mapping());
  std::shared_ptr<mapping>    prev_mapping_;

  bool                        write_;

public:
  impl(std::tr2::sys::path path, bool write)
    : file_path_(std::move(path))
    , file_mapping_(file_path_.string().c_str(), write ? read_write : read_only)
    , file_size_(page_floor(std::tr2::sys::file_size(file_path_)))
    , write_(write)
  {     
    REQUIRE(file_size_ >= map_size_ * 3);
  }

  ~impl()
  {
    prev_mapping_.reset();
    mapping_.reset();
  }

  auto data(std::size_t offset, std::size_t size, boost::optional<bool> write_opt) -> void*
  { 
    offset = offset % page_floor(file_size_);

    REQUIRE(size < file_size_ - map_size_ * 3);

    const auto write = write_opt.get_value_or(write_);

    REQUIRE(!write || write_);          

    if ((write && mapping_->mode() == read_only) || offset < mapping_->offset() || offset + size >= mapping_->offset() + mapping_->size())
    {
      auto new_mapping = std::make_shared<loop::mapping>(file_mapping_, write ? read_write : read_only, file_size_, page_floor(offset), std::max(size + page_size(), map_size_));

      if (mapping_)
        mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));

      if (prev_mapping_)
        prev_mapping_->flush(false);

      prev_mapping_ = std::move(mapping_);
      mapping_    = std::move(new_mapping);
    }

    return reinterpret_cast<char*>(mapping_->data()) + offset - mapping_->offset();
  }
}

// 8 processes to 8 different files 128GB each.
loop_mapping loop(...);
for (auto n = 0; true; ++n)
{
     auto src = get_new_data(5000000/8);
     auto dst = loop.data(n * 5000000/8, 5000000/8, true);
     std::memcpy(dst, src, 5000000/8); // This becomes very slow after loop around.
     std::this_thread::sleep_for(std::chrono::seconds(1));
}

Any ideas?

Target System:

1x 3TB Seagate Constellation ES.3
2x Xeon E5-2400 (6 core, 2.6Ghz)
6x 8GB DDR3 1600Mhz ECC
Windows Server 2012

Fabio answered 2/9, 2014 at 19:59 Comment(24)

Could you add more explanation to the code you posted? Is the slow section within the posted block of code, or is the provided code the slow part itself? – Winegrower 2/9, 2014 at 20:24

None of the code is slow per se. It is when I try to write to the mapped memory it gets slow. I'll add a simple example. – Fabio 2/9, 2014 at 20:24

You might need to pre-allocate the disk space for the file, either by writing at least one byte at the end or using SetFileValidData (requires admin privilege). – Stirring 2/9, 2014 at 21:11

What do you mean by "pre-allocate"? The file is created with it's final size before it is memory mapped. Not sure why that would make a difference? I have already written to the entire file before/when the problem starts occurring. – Fabio 2/9, 2014 at 21:24

It sounds like you're thrashing on swapping into your process address space. Just because the file is mapped doesn't mean its committed to physical RAM; it just means it has a mapped logical address. At 244.14 MB per "item" (256000000, bytes) I could easily see that happening. And it would be compounded if the target of the read is also on pages that have to likewise be swapped into physical storage. Have you done a process eval to see how many page faults (misses triggering reading from physical storage to your address space) are being generated by this? – Visby 2/9, 2014 at 21:33

It triggers a lot of page faults *after" doing the loop around. I don't quite see how you explanation explains that the problem only occurs when starting over at the beginning? Note that I have tried this with 8x128GB files and the problem always occurs when looping around and otherwise works fine. Interestingly enough I don't get the problem when only running 2x8GB files (computer has 24 GB ram). – Fabio 2/9, 2014 at 21:47

It doesn't explain it entirely, but it does promote a substantial performance hit. There should be a series of faults as pages need to be committed to physical RAM for access. The disk (hopefully a contiguous sector list) backing that mapped address space is being hit sequentially during the first pass. But nearly all of it will have to be thrown out when you "rewind" to access the logical memory at the beginning of the buffer again. As you continue, prior pages from the opposite end of the file will be need to be committed to disk before unloading, effectively introducing a butterfly. – Visby 2/9, 2014 at 21:56

Ok. And how would I avoid that? Also what do you mean by "a butterfly"? – Fabio 2/9, 2014 at 21:58

Btw, what kind of rig is this on (memory, proc, disks, OS, etc) ? – Visby 2/9, 2014 at 21:58

I don't think you're doing yourself any favours by ignoring exceptions and trying again. You should at least log the errors somewhere. Your code that attempts to map the file twice, back to back, looks suspect. You're temporarily trying to map the file into a region that's twice the size of the file. So you're not actually pre-allocating the file in this case. – Nuncia 2/9, 2014 at 21:58

It's a single enterprise level 3TB disk with 24GB memory and 2x6 core xeon running windows server 2012. – Fabio 2/9, 2014 at 21:58

@RossRidge: The exception never happens. It's just a fail safe in case it doesn't find a contiguous block of memory. I will remove it. – Fabio 2/9, 2014 at 21:59

@Fabio butterfly, as in when you start adding content back at the beginning of the queue eventually the dirty pages at the end of the queue need to be committed (your current activity needs the physical RAM pages), but those dirty pages are at the opposite end of a very large file. That has a very real potential of reducing your page load times to the seek+write time of your disk, as each time you need another page, the oldest page to commit and make available is for data on the other side of the galaxy. What happens when you emulate this with a much smaller file (as in 1/10th size)? – Visby 2/9, 2014 at 22:5

@WhozCraig: Smaller files do not suffer from the issue. I will try to do a forced synchronous flush of the memory mapped file before looping around and see if that helps. – Fabio 2/9, 2014 at 22:7

I've added a more complete code sample. – Fabio 2/9, 2014 at 22:17

Thanks. Interesting question, btw. While updating your question info include the stats (mem, disk, mach, etc) you mentioned in-comment and the OS you're using. There are a lot of very savvy people on this board, and the more info like that they have to work with the better. – Visby 2/9, 2014 at 22:19

I have added system information. – Fabio 2/9, 2014 at 22:27

Honestly, it just sounds like your program requires a lot of I/O, so as soon as it can't make further progress without doing I/O, it runs at I/O speed. – Traprock 2/9, 2014 at 22:27

@DavidSchwartz: Well it should be able to run smoothly since it works just fine UNTIL it loops around, i.e. if I use a file size of 1TB it works fine for several hours. And I am not doing that much IO, in my test bed I am writing at a speed of 50MB/s (which is basically half of what the disk can handle) and flushing to disk in chunks of 256MB. I think it should be possible to start writing at the beginning in a way that doesn't significantly impact performance relative to sequential write performance. – Fabio 2/9, 2014 at 22:29

File mappings sound like the wrong solution. You seem to need control over IO. Use manual synchronous IO. – Kerseymere 2/9, 2014 at 23:19

@usr: That won't work since I need interprocess communication (not part of the example). Which is also why I need to be able write to contiguous memory. That part is to complicated too add here, let's just assume that I need to use memory mapped io. And why would manual synchronous IO work any differently? – Fabio 2/9, 2014 at 23:33

Maybe a better and more expensive SATA/IO controller would handle this better? – Fabio 2/9, 2014 at 23:42

Comments are not for extended discussion; this conversation has been moved to chat. – Challenge 3/9, 2014 at 0:55

@Fabio this IO perf degradation sounded to me like the kernel flushes modified pages inefficiently. I have seen that. You expect sequential IO yet you (partially) get random IO at up to 100x perf loss. That will never happen if you do it manually. AFAIK the OS synchronizes file buffers. Maybe you can synchronize with other processes in some other way and transfer the data using file IO, or using an in-memory shared section. – Kerseymere 3/9, 2014 at 11:53

8 buffers each 8 to 512GiB in size on a system with 48GiB of physical memory means that your mapping will have to be swapped. No surprise there.
The issue, as you have already remarked yourself, is that prior to being able to write to a page, you encounter a fault, and the page is read in. That doesn't happen on the first run, since merely a zero page is used. To make matters worse, reading in pages again competes with write-behind of dirty pages.

Now, there is unluckily no way of telling Windows "I'm going to overwrite this anyway", nor is there any way of making the disk load your stuff faster. However, you can start the transfer earlier (maybe when you're 3/4 through the buffer).

Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).

That is, of course, not exactly what you want to do when you already know that you will overwrite the complete memory page (or several of them) anyway, but it is as good as you can get. It's worth a try in any case.

Ideally, you would want to do something like a destructive madvise(MADV_DONTNEED) as implemented e.g. under Linux (and I believe FreeBSD, too) immediately before you overwrite a page, but I am not aware of any way of doing this under Windows (...short of destroying the view and the mapping and mapping from scratch, but then you throw away all data, so that's a bit useless).

Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.

Another "obvious" (but probably not that easy) solution would be to make the consumer faster. That would allow for a smaller buffer to begin with, and even on a huge buffer it would keep the working set smaller (both producer and consumer force pages into RAM while accessing them, so if the consumer accesses data with less delay after the producer has written them, they will both be using mostly the same set of pages.) Smaller working sets fit into RAM more easily.
But I realize that you probably didn't choose a several-gigabyte buffer for no reason.

Armful answered 27/7, 2015 at 13:17 Comment(5)

Note that VirtualAlloc does allow you to discard memory-mapped pages, but only if they are backed by the page file, which I'm guessing isn't possible for such a large mapping. – Stirring 27/7, 2015 at 21:6

@HarryJohnston: True, that would be the MEM_RESET flag. Unluckily, much unlike mmap you cannot use VirtualAlloc on just any address. It must be the base address of the block (not just the address of some page). So you would throw everything away, which is almost certainly not what is desired. Or, one would have to do thousands of little allocations... – Armful 27/7, 2015 at 22:6

I don't believe that's true. Using MEM_RESET on a non-base address doesn't return an error. I can't think of any straightforward way to tell whether it actually worked or not, but it claimed that it had succeeded. Similarly, you can commit just part of an existing reservation, and that definitely works. – Stirring 27/7, 2015 at 23:2

OK, I've now been able to confirm that MEM_RESET on a non-base address works as expected. The pages that were reset, and only those pages, lost their contents once the system memory was stressed. – Stirring 27/7, 2015 at 23:54

@HarryJohnston: That's surprising, but it's good news! You should post that as alternative answer, since if this really works, it's exactly what the OP wants. – Armful 28/7, 2015 at 6:42

Since your code is devoid of any comment, filled with auto variables, not compilable as is and I don't have 512Gb available on my PC to test it anyway, this will remain a passing tought off the top of my head.

each of your process only writes a few hundreds Kb/s, so there should be ample time to flush that to disk in the background.

However, it seems you are asking the boost mapping system to flush either synchronously or asynchronously the previous chunk depending on your mysterious offset computations:

mapping_->flush((new_mapping->offset() % file_size_) < (mapping_->offset() % file_size_));

I guess the rollover triggers a synchronous flush, which is a likely culprit for the sudden slowdown.

What the operating system does at this point depends on the boost implementation, which is not described (or at least in a way obvious enough for me to get it after a cursory look at their man page). If boost stuffed your 48 Gb of memory with unflushed pages, you could certainly experience a sudden and prolonged deceleration.

At least worth a comment in your code if this mysterious line does something clever and completely different I missed entirely.

Systematics answered 1/11, 2014 at 3:23 Comment(0)

Windows Server 2012 (which you're using) supports PrefetchVirtualMemory which is a somewhat half-assed substitute for POSIX madvise(MADV_WILLNEED).

Even with prefetching early you will still be limited by disk I/O bandwidth, but at least you can hide the latency.

Armful answered 27/7, 2015 at 13:17 Comment(5)

@HarryJohnston: That's surprising, but it's good news! You should post that as alternative answer, since if this really works, it's exactly what the OP wants. – Armful 28/7, 2015 at 6:42

If you are able to back the memory mapping with the page file rather than a specific file, you can use the MEM_RESET flag with VirtualAlloc to prevent Windows from paging in the old contents.

The main issue I would anticipate in using this approach is that you can't easily recover the disk space when you are done. It may also require the system's page file settings to be changed; I believe it will work with the default settings, but not if a maximum page file size has been set.

Stirring answered 28/7, 2015 at 20:18 Comment(2)

Not sure how to "back the memory mapping with the page file rather than a specific file" but I will look into it. Seems like a possible solution. – Fabio 28/7, 2015 at 20:54

Pass NULL instead of a file handle. – Stirring 28/7, 2015 at 21:4

I am going to assume that by "Loop around" you mean that the RAM got full. What happens is that until the RAM get full, all you have to do is allocate a page and write in it (RAM speed), after the RAM gets full every page allocation turns to 2 actions: 1. you have to write the dirty page back (DISK speed) 2. and allocate a page (RAM speed)

And worst case you also have to bring the page from the file in the disk (DISK speed) if you are reading something from it. So instead of working only in RAM speed (page allocation), every page allocation runs in DISK speed. This doesnt happen with 2x8GB because it is small enough for all of the memory of both files to remain fully in the RAM.

Sorus answered 15/10, 2014 at 8:48 Comment(0)

The problem here it turns out is that when overwrite a valid page in memory the page first has to be read from the drive before being overwritten. There is no way to get around this issue as far as I know when using memory mapped files.

The reason it doesn't happen during the first pass is that the pages being overwritten are not "valid" and thus they do not need to be read back.

Fabio answered 27/7, 2015 at 12:42 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags