Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()
Asked Answered
P

4

33

I'm working on some SocketChannel-to-SocketChannel code which will do best with a direct byte buffer--long lived and large (tens to hundreds of megabytes per connection.) While hashing out the exact loop structure with FileChannels, I ran some micro-benchmarks on ByteBuffer.allocate() vs. ByteBuffer.allocateDirect() performance.

There was a surprise in the results that I can't really explain. In the below graph, there is a very pronounced cliff at the 256KB and 512KB for the ByteBuffer.allocate() transfer implementation--the performance drops by ~50%! There also seem sto be a smaller performance cliff for the ByteBuffer.allocateDirect(). (The %-gain series helps to visualize these changes.)

Buffer Size (bytes) versus Time (MS)

The Pony Gap

Why the odd performance curve differential between ByteBuffer.allocate() and ByteBuffer.allocateDirect()? What exactly is going on behind the curtain?

It very well maybe hardware and OS dependent, so here are those details:

  • MacBook Pro w/ Dual-core Core 2 CPU
  • Intel X25M SSD drive
  • OSX 10.6.4

Source code, by request:

package ch.dietpizza.bench;

import static java.lang.String.format;
import static java.lang.System.out;
import static java.nio.ByteBuffer.*;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.UnknownHostException;
import java.nio.ByteBuffer;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.nio.channels.WritableByteChannel;

public class SocketChannelByteBufferExample {
    private static WritableByteChannel target;
    private static ReadableByteChannel source;
    private static ByteBuffer          buffer;

    public static void main(String[] args) throws IOException, InterruptedException {
        long timeDirect;
        long normal;
        out.println("start");

        for (int i = 512; i <= 1024 * 1024 * 64; i *= 2) {
            buffer = allocateDirect(i);
            timeDirect = copyShortest();

            buffer = allocate(i);
            normal = copyShortest();

            out.println(format("%d, %d, %d", i, normal, timeDirect));
        }

        out.println("stop");
    }

    private static long copyShortest() throws IOException, InterruptedException {
        int result = 0;
        for (int i = 0; i < 100; i++) {
            int single = copyOnce();
            result = (i == 0) ? single : Math.min(result, single);
        }
        return result;
    }


    private static int copyOnce() throws IOException, InterruptedException {
        initialize();

        long start = System.currentTimeMillis();

        while (source.read(buffer)!= -1) {    
            buffer.flip();  
            target.write(buffer);
            buffer.clear();  //pos = 0, limit = capacity
        }

        long time = System.currentTimeMillis() - start;

        rest();

        return (int)time;
    }   


    private static void initialize() throws UnknownHostException, IOException {
        InputStream  is = new FileInputStream(new File("/Users/stu/temp/robyn.in"));//315 MB file
        OutputStream os = new FileOutputStream(new File("/dev/null"));

        target = Channels.newChannel(os);
        source = Channels.newChannel(is);
    }

    private static void rest() throws InterruptedException {
        System.gc();
        Thread.sleep(200);      
    }
}
Photocopy answered 6/9, 2010 at 13:15 Comment(10)
Have you got the code hosted somewhere? I'd be interested to see if I recreate your results.Lotz
@gid: Source code added. Looking forward to your results.Photocopy
sorry about the delay, have tested on windows 7 x64 & java 1.6.20 and the results are nearly the same. Only difference is that the drop off happens at 256k rather than 512k.Lotz
Machine, Ubuntu 10.10 32 bit, OpenJDK 1.6.0_20. I have tested it too, on my machine the drop off happens at 1024k for normal and at 2048k for direct. I suppose teh effect may be caused by something on the OS/CPU boundary (CPU Cache).Canoe
@bartosz.r: What exact model is your CPU? I can run some tests too.Photocopy
@bartosz.r: Also, I would assume the cliffs for each buffer would be at the same place, but they seem to be at two different points.Photocopy
it is Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz, Maybe they are in two places, because the indirect version needs to copy it twice (using intermediate buffer). Well, I do not know, but this is really interesting. That's so much ahrdware dependet - the whole cache line alignment, cache size thing or even implicit thread synchronization by CPU cache.Canoe
You should really try to use the Java Microbechmark Harness (jmh) to test this. This benchmark still looks flawed. For example you're measuring the time to read from source, and write to the target, but there is no baseline to test how long that takes on itself, there is no warmup period for the benchmark, so you're also measuring the JIT compilation times, the allocate and allocateDirect cases are not properly separated (class loading+JIT might actually revert some speculative inlining). Not sure any of that affects your results, but the JIT might be one reason for the bumps.Root
@Root jmh didn't exist in 2010. (neat project, but i'm not going to run again 11 years later.) the warmup is in the code--JIT is not a factor. but all that is irrelevant. the above discussion is not about raw numbers, but about why there is strange curve between two different IO models.Photocopy
@StuThompson It's just taking a minimum runtime while warming up and running on the same loop and that makes JIT a factor, because the JIT might also deoptimize code, and yes, all this is relevant, because the problem with flawed benchmarks is that a strange curve in benchmark numbers might appear, because the benchmark itself is flawed, which is what I was trying to express. I did not look at the date when this was posted, but my comment is still relevant for those who will find this today, 11 years after your question.Root
E
33

How ByteBuffer works and why Direct (Byte)Buffers are the only truly useful now.

first I am a bit surprised it's not common knowledge but bear it w/ me

Direct byte buffers allocate an address outside the java heap.

This is utmost importance: all OS (and native C) functions can utilize that address w/o locking the object on the heap and copying the data. Short example on copying: in order to send any data via Socket.getOutputStream().write(byte[]) the native code has to "lock" the byte[], copy it outside java heap and then call the OS function, e.g. send. The copy is performed either on the stack (for smaller byte[]) or via malloc/free for larger ones. DatagramSockets are no different and they also copy - except they are limited to 64KB and allocated on the stack which can even kill the process if the thread stack is not large enough or deep in recursion. note: locking prevents JVM/GC to move/reallocate the object around the heap

So w/ the introduction of NIO the idea was avoid the copy and multitudes of stream pipelining/indirection. Often there are 3-4 buffered type of streams before the data reaches its destination. (yay Poland equalizes(!) with a beautiful shot) By introducing the direct buffers java could communicate straight to C native code w/o any locking/copy necessary. Hence sent function can take the address of the buffer add the position and the performance is much the same as native C. That's about the direct buffer.

The main issue w/ direct buffers - they are expensive to allocate and expensive to deallocate and quite cumbersome to use, nothing like byte[].

Non-direct buffer do not offer the true essence the direct buffers do - i.e. direct bridge to the native/OS instead they are light-weighted and share exactly the same API - and even more, they can wrap byte[] and even their backing array is available for direct manipulation - what not to love? Well they have to be copied!

So how does Sun/Oracle handles non-direct buffers as the OS/native can't use 'em - well, naively. When a non-direct buffer is used a direct counter part has to be created. The implementation is smart enough to use ThreadLocal and cache a few direct buffers via SoftReference* to avoid the hefty cost of creation. The naive part comes when copying them - it attempts to copy the entire buffer (remaining()) each time.

Now imagine: 512 KB non-direct buffer going to 64 KB socket buffer, the socket buffer won't take more than its size. So the 1st time 512 KB will be copied from non-direct to thread-local-direct, but only 64 KB of which will be used. The next time 512-64 KB will be copied but only 64 KB used, and the third time 512-64*2 KB will be copied but only 64 KB will be used, and so on... and that's optimistic that always the socket buffer will be empty entirely. So you are not only copying n KB in total, but n × n ÷ m (n = 512, m = 16 (the average space the socket buffer has left)).

The copying part is a common/abstract path to all non-direct buffer, so the implementation never knows the target capacity. Copying trashes the caches and what not, reduces the memory bandwidth, etc.

*A note on SoftReference caching: it depends on the GC implementation and the experience can vary. Sun's GC uses the free heap memory to determine the lifespan of the SoftRefences which leads to some awkward behavior when they are freed - the application needs to allocated the previously cached objects again- i.e. more allocation (direct ByteBuffers take minor part in the heap, so at least they do not affect the extra cache trashing but get affected instead)

My rule of the thumb - a pooled direct buffer sized with the socket read/write buffer. The OS never copies more than necessary.

This micro-benchmark is mostly memory throughput test, the OS will have the file entirely in cache, so it mostly tests memcpy. Once the buffers run out of the L2 cache the drop of performance is to be noticeable. Also running the benchmark like that imposes increasing and accumulated GC collection costs. (rest() will not collect the soft-referenced ByteBuffers)

Editorialize answered 12/6, 2012 at 20:20 Comment(0)
B
26

Thread Local Allocation Buffers (TLAB)

I wonder if the thread local allocation buffer (TLAB) during the test is around 256K. Use of TLABs optimizes allocations from the heap so that the non-direct allocations of <=256K are fast.

What is commonly done is to give each thread a buffer that is used exclusively by that thread to do allocations. You have to use some synchronization to allocate the buffer from the heap, but after that the thread can allocate from the buffer without synchronization. In the hotspot JVM we refer to these as thread local allocation buffers (TLAB's). They work well.

Large allocations bypassing the TLAB

If my hypothesis about a 256K TLAB is correct, then information later in the the article suggests that perhaps the >256K allocations for the larger non-direct buffers bypass the TLAB. These allocations go straight to heap, requiring thread synchronization, thus incurring the performance hits.

An allocation that can not be made from a TLAB does not always mean that the thread has to get a new TLAB. Depending on the size of the allocation and the unused space remaining in the TLAB, the VM could decide to just do the allocation from the heap. That allocation from the heap would require synchronization but so would getting a new TLAB. If the allocation was considered large (some significant fraction of the current TLAB size), the allocation would always be done out of the heap. This cut down on wastage and gracefully handled the much-larger-than-average allocation.

Tweaking the TLAB parameters

This hypothesis could be tested using information from a later article indicating how to tweak the TLAB and get diagnostic info:

To experiment with a specific TLAB size, two -XX flags need to be set, one to define the initial size, and one to disable the resizing:

-XX:TLABSize= -XX:-ResizeTLAB

The minimum size of a tlab is set with -XX:MinTLABSize which defaults to 2K bytes. The maximum size is the maximum size of an integer Java array, which is used to fill the unallocated portion of a TLAB when a GC scavenge occurs.

Diagnostic Printing Options

-XX:+PrintTLAB

Prints at each scavenge one line for each thread (starts with "TLAB: gc thread: " without the "'s) and one summary line.

Boodle answered 6/9, 2010 at 17:35 Comment(4)
+1 Wow. Thanks. I've never even heard of this stuff. Will experiment and report back.Photocopy
Alas, no joy. :( I tried with values both larger (10MB) and smaller (2KB) and there was no change in the performance curves. But thanks for the educational trip into JVM options.Photocopy
Awww - darn. I guess that why hypothesis need experiments to confirm them. Thanks for checking it out and reporting back. As you say, even a wrong hypothesis can be educational and useful. I learned a lot just confirming my understanding TLABs and writing up the answer.Boodle
The heap buffer is allocated once per capacity test, it will be moved to the "tenured" heap after the first GC, in that aspect TLAB doesn't matter at all. TLAB may matter in heavily multithread code only (and enough allocation), otherwise it costs a CASed pointer bump. Trouble is if you have more threads doing the same location CAS, if you have only one it's not that big of a costs, esp. if it hits L1 and the cache line is 'owned'Editorialize
P
7

I suspect that these knees are due to tripping across a CPU cache boundary. The "non-direct" buffer read()/write() implementation "cache misses" earlier due to the additional memory buffer copy compared to the "direct" buffer read()/write() implementation.

Pyrites answered 1/10, 2010 at 21:44 Comment(2)
I applied Zach Smith's memory bandwidth "benchmark" (home.comcast.net/~fbui/bandwidth.html) on my MBP Core Duo that likewise has a 4MB L2 cache. The tool shows a knee at 1MB. The direct byte buffer does not enable DMA. The direct byte buffer allocates process memory (i.e., malloc()) in the JVM. The JVM file system read()/write() are copying memory to/from system memory into the direct buffer's process memory.Pyrites
FWIW, my MBP actually only has a 3MB L2 cache (not 4MB as I previously stated).Pyrites
B
0

There are many reasons why this could happen. Without code and/or more details about the data, we can only guess what is happening.

Some Guesses:

  • Maybe you hit the maximum bytes that can be read at a time, thus the IOwaits gets higher or memory consumption up without a decrease in loops.
  • Maybe you hit a critical memory limit, or the JVM is trying to free memory before a new allocation. Try playing around with the -Xmx and -Xms parameters
  • Maybe HotSpot can't/won't optimize, because the number of calls to some methods are too low.
  • Maybe there are OS or Hardware conditions that cause this kind of delay
  • Maybe the implementation of the JVM is just buggy ;-)
Berbera answered 6/9, 2010 at 13:51 Comment(4)
Hehe...Many of these I've speculated on myself, but none really make total sense to me. "Max bytes?" 256KB is not much, and it behaves differently for direct and non-direct buffers. "256KB and the JVM memory settings"? Again, 256KB is small. The discrepancy is fairly consistant no matter how many loops it runs through. "No hotspot optimizations?" I've tried different configurations and still the results are consistant. "OS/HW conditions" Like what? And why different for direct versus non-direct buffers? Sigh...Photocopy
The JVM may use different OS calls for direct and non-direct buffers, thus resulting in a different runtime behaviour. Non-direct buffers may be slightly larger than direct ones. But the TLAB stuff from Bert sound more like the source of your problem.Berbera
It's not a "problem". Merely an unexpected benchmark result that I would like to accurately understand.Photocopy
BTW: After the above TLAB changes didn't work, I tried -Xmx and -Xms ...no joy :( The mystery remains.Photocopy

© 2022 - 2024 — McMap. All rights reserved.