Array size and copy performance
Asked Answered
J

1

6

I'm sure this has been answered before, but I can't find a good explanation.

I'm writing a graphics program where a part of the pipeline is copying voxel data to OpenCL page-locked (pinned) memory. I found that this copy procedure is a bottleneck and made some measurements on the performance of a simple std::copy. The data is floats, and every chunk of data that I want to copy is around 64 MB in size.

This is my original code, before any attempts at benchmarking:

std::copy(data, data+numVoxels, pinnedPointer_[_index]);

Where data is a float pointer, numVoxels is an unsigned int and pinnedPointer_[_index] is the float pointer referencing a pinned OpenCL buffer.

Since I got slow performance of that, I decided to try copying smaller parts of the data instead and see what kind of bandwidth I got. I used boost::cpu_timer for timing. I've tried running it for some time as well as averaging over a couple of hundred runs, getting similar results. Here is relevant code along with the results:

boost::timer::cpu_timer t;                                                    
unsigned int testNum = numVoxels;                                             
while (testNum > 2) {                                                         
  t.start();                                                                  
  std::copy(data, data+testNum, pinnedPointer_[_index]);                      
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9 ;                                 
  int size = testNum*sizeof(float);                                           
  double GB = (double)size / 1073741842.0;                                    
  // Print results  
  testNum /= 2;                                                               
}

Copied 67108864 bytes in 0.032683s, 1.912315 GB/s
Copied 33554432 bytes in 0.017193s, 1.817568 GB/s
Copied 16777216 bytes in 0.008586s, 1.819749 GB/s
Copied 8388608 bytes in 0.004227s, 1.848218 GB/s
Copied 4194304 bytes in 0.001886s, 2.071705 GB/s
Copied 2097152 bytes in 0.000819s, 2.383543 GB/s
Copied 1048576 bytes in 0.000290s, 3.366923 GB/s
Copied 524288 bytes in 0.000063s, 7.776913 GB/s
Copied 262144 bytes in 0.000016s, 15.741867 GB/s
Copied 131072 bytes in 0.000008s, 15.213149 GB/s
Copied 65536 bytes in 0.000004s, 14.374742 GB/s
Copied 32768 bytes in 0.000003s, 10.209962 GB/s
Copied 16384 bytes in 0.000001s, 10.344942 GB/s
Copied 8192 bytes in 0.000001s, 6.476566 GB/s
Copied 4096 bytes in 0.000001s, 4.999603 GB/s
Copied 2048 bytes in 0.000001s, 1.592111 GB/s
Copied 1024 bytes in 0.000001s, 1.600125 GB/s
Copied 512 bytes in 0.000001s, 0.843960 GB/s
Copied 256 bytes in 0.000001s, 0.210990 GB/s
Copied 128 bytes in 0.000001s, 0.098439 GB/s
Copied 64 bytes in 0.000001s, 0.049795 GB/s
Copied 32 bytes in 0.000001s, 0.049837 GB/s
Copied 16 bytes in 0.000001s, 0.023728 GB/s

There is a clear bandwidth peak at copying chunks of 65536-262144 bytes, and the bandwidth is very much higher than copying the full array (15 vs 2 GB/s).

Knowing this, I decided to try another thing and copied the full array, but using repeated calls to std::copy where each call just handled part of the array. Trying different chunk sizes, these are my results:

unsigned int testNum = numVoxels;                                             
unsigned int parts = 1;                                                       
while (sizeof(float)*testNum > 256) {                                         
  t.start();                                                                  
  for (unsigned int i=0; i<parts; ++i) {                                      
    std::copy(data+i*testNum, 
              data+(i+1)*testNum, 
              pinnedPointer_[_index]+i*testNum);
  }                                                                           
  t.stop();                                                                   
  boost::timer::cpu_times result = t.elapsed();                               
  double time = (double)result.wall / 1.0e9;                                  
  int size = testNum*sizeof(float);                                           
  double GB = parts*(double)size / 1073741824.0;                              
  // Print results
  parts *= 2;                                                                 
  testNum /= 2;                                                               
}      

Part size 67108864 bytes, copied 0.0625 GB in 0.0331298s, 1.88652 GB/s
Part size 33554432 bytes, copied 0.0625 GB in 0.0339876s, 1.83891 GB/s
Part size 16777216 bytes, copied 0.0625 GB in 0.0342558s, 1.82451 GB/s
Part size 8388608 bytes, copied 0.0625 GB in 0.0334264s, 1.86978 GB/s
Part size 4194304 bytes, copied 0.0625 GB in 0.0287896s, 2.17092 GB/s
Part size 2097152 bytes, copied 0.0625 GB in 0.0289941s, 2.15561 GB/s
Part size 1048576 bytes, copied 0.0625 GB in 0.0240215s, 2.60184 GB/s
Part size 524288 bytes, copied 0.0625 GB in 0.0184499s, 3.38756 GB/s
Part size 262144 bytes, copied 0.0625 GB in 0.0186002s, 3.36018 GB/s
Part size 131072 bytes, copied 0.0625 GB in 0.0185958s, 3.36097 GB/s
Part size 65536 bytes, copied 0.0625 GB in 0.0185735s, 3.365 GB/s
Part size 32768 bytes, copied 0.0625 GB in 0.0186523s, 3.35079 GB/s
Part size 16384 bytes, copied 0.0625 GB in 0.0187756s, 3.32879 GB/s
Part size 8192 bytes, copied 0.0625 GB in 0.0182212s, 3.43007 GB/s
Part size 4096 bytes, copied 0.0625 GB in 0.01825s, 3.42465 GB/s
Part size 2048 bytes, copied 0.0625 GB in 0.0181881s, 3.43631 GB/s
Part size 1024 bytes, copied 0.0625 GB in 0.0180842s, 3.45605 GB/s
Part size 512 bytes, copied 0.0625 GB in 0.0186669s, 3.34817 GB/s

It seems like decreasing the chunk size actually has a significant effect, but I can't still get anywhere near 15 GB/s.

I run 64 bit Ubuntu, GCC optimization doesn't do much difference.

  1. Why does the array size affect the bandwidth in this way?
  2. Does the OpenCL pinned memory play a part?
  3. What are the strategies for optimizing a large array copy?
Jon answered 20/5, 2013 at 20:48 Comment(8)
You could be running into your OS page fault system. It may be swapping memory in 64k chunks.Warhorse
Can pass the array by pointer or reference instead of copying it?Warhorse
Make sure you run the tests repeatedly.Grimalkin
On my box, writing 128MB of ram with some gives a throughput of about 5.5GB/s. I'm sure some faster boxes will do better than that, but unless you have 256 bit wide bus, and doing everything right, much more is unlikely. That's without reading from the memory bus at the same time, which will slow it down. I haven't got a benchmark ready for that, but I expect around 2-3MB/s is reasonable to expect without a lot of effort (and a lot of effort will only give a little better figures)Oys
@ThomasMatthews: If I understand what you're saying, I can't simply dereference since I need the data in pinned memory for a PCIe transfer later. That bandwidth drops from ~11 GB/s to about 2 GB/s when not using pinned memory.Jon
@KerrekSB: I did, tried both averaging over a couple of hundred runs as well as letting it run for a while. Added note to question. Similar results.Jon
@VictorSand: You will need to ask the operating system for a "locked" area of memory, one that it won't move around. You may not be able to get the total amount, but at least you won't have to worry about the RTOS paging your memory to disk in order to run another task in the system.Warhorse
@ThomasMatthews: So basically pin the memory I'm reading from in the same way as the memory I'm writing into is pinned and mapped by OpenCL? I'm thinking that would just result in the same situation again but with an extra layer of pinned memory, since my data has to change for every frame.Jon
O
7

I'm pretty sure you are running into cache-thrashing. If you fill the cache with data you've written, next time round, some data is needed, the cache will have to read that data from the memory, but FIRST it needs to find some space in the cache - because all the data [or at least a lot of it] is "dirty" because it has been written to, it needs to be written out to RAM. Next we write a new bit of data to the cache, which throws out another bit of data that is dirty (or something we read in earlier).

In assembler, we can overcome this by using a "non-temporal" move instruction. The SSE instruction movntps for example. This instruction will "avoid storing things in the cache".

Edit: You can also get better performance by not mixing reads and writes - use a small buffer [fixed size array] of say 4-16KB, and copy data to that buffer, then write that buffer to the new place where you want it. Again, ideally use non-temporal writes, as that will improve the throughput even in this case - but just using "blocks" to read and then write, rather than read one, write one, will go much faster.

Something like this:

   float temp[2048]; 
   int left_to_do = numVoxels;
   int offset = 0;

   while(left_to_do)
   {
      int block = min(left_to_do, sizeof(temp)/sizeof(temp[0]); 
      std::copy(data+offset, data+offset+block, temp);                      
      std::copy(temp, temp+block, pinnedPointer_[_index+offet]);                      
      offset += block;
      left_to_do -= block;
   }

Try that, and see if it improves things. It may not...

Edit2: I should explain that this is faster because you are re-using the same bit of cache to load data into every time, and by not mixing the reading and writing, we get better performance from the memory itself.

Oys answered 20/5, 2013 at 21:15 Comment(5)
Thanks, that's helpful. It's a bit over my head but I'll definitely look into it and try your suggestions.Jon
Another follow-up: Does the fact that the memory I'm copying into is page-locked by OpenCL play a role in this?Jon
@VictorSand No, shouldn't matter. page-locking just means that it can't be paged out or moved about. So unless you are very short on memory should be no issue. Note that the 15GB/s is about the peak performance of a processor writing to cache. You can't sustain that for any length of time.Oys
That makes sense. Tack igen! :)Jon
So I tried your suggestion with a large number of buffer sizes. Results were very similar to what I tried with copying in parts, but slightly worse. It peaked at about 0.5-1 GB/s less. Still better than the straightforward copy, though! I'm going to to ahead and accept your very helpful answer, and come back if I learn something new.Jon

© 2022 - 2024 — McMap. All rights reserved.