Java TCP/IP Socket Latency - stuck at 50 μs (microseconds)? (used for Java IPC)

Asked 9/5, 2011 at 22:47 Answered 10/9, 2013 at 20:52

We have been profiling and profiling our application to get reduce latency as much as possible. Our application consists of 3 separate Java processes, all running on the same server, which are passing messages to each other over TCP/IP sockets.

We have reduced processing time in first component to 25 μs, but we see that the TCP/IP socket write (on localhost) to the next component invariably takes about 50 μs. We see one other anomalous behavior, in that the component which is accepting the connection can write faster (i.e. < 50 μs). Right now, all the components are running < 100 μs with the exception of the socket communications.

Not being a TCP/IP expert, I don't know what could be done to speed this up. Would Unix Domain Sockets be faster? MemoryMappedFiles? what other mechanisms could possibly be a faster way to pass the data from one Java Process to another?

UPDATE 6/21/2011 We created 2 benchmark applications, one in Java and one in C++ to benchmark TCP/IP more tightly and to compare. The Java app used NIO (blocking mode), and the C++ used Boost ASIO tcp library. The results were more or less equivalent, with the C++ app about 4 μs faster than Java (but in one of the tests Java beat C++). Also, both versions showed a lot of variability in the time per message.

I think we are agreeing with the basic conclusion that a shared memory implementation is going to be the fastest. (Although we would also like to evaluate the Informatica product, provided it fits the budget.)

Counterblow answered 9/5, 2011 at 22:47 Comment(8)

The SI shorthand for microseconds is μs, not μ (and you should have a space between the quantity and the unit). I fixed it for you. – Unveil 9/5, 2011 at 22:52

Not being an expert either, I'll hazard a guess that UDP might get your latencies down, due to being a more lightweight protocol. It is, of course, much more painful to program against, and might not yield any benefits if your app has to manually implement the same reliability guarantees that TCP provides out-of-the-box. – Unveil 9/5, 2011 at 23:2

How about stdin/stdout/stderr (e.g. the first process starts the other 2, and comms only happens between this 'master' and the 2 slaves)? is this an option? – Multiangular 9/5, 2011 at 23:8

Is there any reason why these three processes cannot share the same JVM? – Tubby 10/5, 2011 at 0:6

@Marcelo: Thanks! pointers for better grammar and syntax are always appreciated! – Counterblow 10/5, 2011 at 2:37

Hi Olaf: we have modified the apps to allow them to be configured to run in same JVM. But due to their very different functions, it is often beneficial to have them as separate processes to allow them to be stopped and restarted at different times which is helpful for several scenarios. Also - we had some experience with Garbage Collection delays being magnified in one of the faster components when run in same JVM with another component that was less efficient. So there is some (slightly irrational) concern about the increase of GC latency by combining them together. – Counterblow 10/5, 2011 at 2:51

Just to check - you have switched off Nagle? – Omora 21/6, 2011 at 14:37

Not sure if it still matters but see this post.. #15726211 – Surcharge 31/3, 2013 at 18:43

If using native libraries via JNI is an option, I'd consider implementing IPC as usual (search for IPC, mmap, shm_open, etc.).

There's a lot of overhead associated with using JNI, but at least it's a little less than the full system calls needed to do anything with sockets or pipes. You'll likely be able to get down to about 3 microseconds one-way latency using a polling shared memory IPC implementation via JNI. (Make sure to use the -Xcomp JVM option or adjust the compilation threshold, too; otherwise your first 10,000 samples will be terrible. It makes a big difference.)

I'm a little surprised that a TCP socket write is taking 50 microseconds - most operating systems optimize TCP loopback to some extent. Solaris actually does a pretty good job of it with something called TCP Fusion. And if there has been any optimization for loopback communication at all, it's usually been for TCP. UDP tends to get neglected - so I wouldn't bother with it in this case. I also wouldn't bother with pipes (stdin/stdout or your own named pipes, etc.), because they're going to be even slower.

And generally, a lot of the latency you're seeing is likely coming from signaling - either waiting on an IO selector like select() in the case of sockets, or waiting on a semaphore, or waiting on something. If you want the lowest latency possible, you'll have to burn a core sitting in a tight loop polling for new data.

Of course, there's always the commercial off-the-shelf route - which I happen to know for a certainty would solve your problem in a hurry - but of course it does cost money. And in the interest of full disclosure: I do work for Informatica on their low-latency messaging software. (And my honest opinion, as an engineer, is that it's pretty fantastic software - certainly worth checking out for this project.)

Propertius answered 10/5, 2011 at 0:24 Comment(2)

I checked out your website and saw your ultra messaging product. I saw that on CISCO UCS it is showing latency less than 1 μs. What do you think it would be on a standard Linux server? (e.g. 2 dual core intel Xeon)? – Counterblow 12/5, 2011 at 0:16

I've actually tested here with a dinky Core 2 Quad Q6600 machine I bought for about $400, and I can get under 1 microsecond on that in C, too (just a little under, though - it's still not nearly as impressive as the fancier Cisco server machine could do). The under-one-microsecond numbers are all from benchmarks run with plain C apps; for Java, add a couple microseconds as a basic floor due to the JNI overhead. And that's also with a receiving thread that's polling in a tight loop; you can run it non-polling too, but then you get a few more micros of signaling/thread wakeup latency. – Propertius 13/5, 2011 at 16:49

"The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about whether the memory mapping stays in memory. Maybe it is just saying that like other memory, if you run out of physical, this gets swapped back to disk, but otherwise not?"

Linux. mmap() call allocates pages in OS page cache area (which are periodically get flushed to disk and can be evicted based on Clock-PRO which is approximation of LRU algorithm?) So the answer on your question is - yes. Memory mapped buffer can be evicted (in theory) from memory unless it is mlocke'd (mlock()). This is in theory. In practice, I think it is hardly possible if your system is not swapping In this case, first victims are page buffers.

Firman answered 26/10, 2011 at 0:22 Comment(0)

See my answer to fastest (low latency) method for Inter Process Communication between Java and C/C++ - with memory mapped files (shared memory) java-to-java latency can be reduced to 0.3 microsecond

Stutzman answered 21/6, 2011 at 9:9 Comment(1)

We came to the same conclusion as you more or less. The question I have about Memory Mapped Files though, whether that memory would stay in RAM, or whether it would some times page back to disk (not sure what the rules that govern that would be). The O'Reilly book on NIO (Java NIO, page 84), seems to be vague about whether the memory mapping stays in memory. Maybe it is just saying that like other memory, if you run out of physical, this gets swapped back to disk, but otherwise not? – Counterblow 21/6, 2011 at 14:7

MemoryMappedFiles is not viable solution for low latency IPC at all - if mapped segment of memory gets updated it is eventually will be synced to disk thus introducing unpredictable delay which measures in milliseconds at least. For low latency one can try combinations of either Shared Memory + message queues (notifications), or shared memory + semaphores. This works on all Unixes especially System V version (not POSIX) but if you run application on Linux you pretty safe with POSIX IPC (most features are available in 2.6 kernel) Yes, you will need JNI to get this done.

UPD: I forgot that this is JVM - JVM IPC and we have already GCs which we can not control fully, so introducing additional several ms pauses due to OS file buffers flash to disk may be acceptable.

Firman answered 9/10, 2011 at 21:26 Comment(2)

"if mapped segment of memory gets updated it is eventually will be synced to disk". Is it only paged to disk as a "swap" (i.e. when O/S doesn't have enough physical RAM to keep it only as memory)? – Counterblow 12/10, 2011 at 16:5

pdflush knows when. on Linux. – Firman 26/10, 2011 at 0:5

Check out https://github.com/pcdv/jocket

It's a low-latency replacement for local Java sockets that uses shared memory.

RTT latency between 2 processes is well below 1us on a modern CPU.

Hardship answered 10/9, 2013 at 20:52 Comment(3)

I started to look at Jocket. It seems to be using MappedByteBuffer which was suggested by several other answers heres. I've tested using MappedByteBuffer and seen it to be very fast for IPC. But I still have the unanswered question about when disk I/O takes place (which introduces very large pauses, depending on the size of the buffer flushed to disk). – Counterblow 12/9, 2013 at 15:11

That's right, it uses MappedByteBuffer. I was also worried by the I/O latency so I decided to create the files under "/dev/shm" when possible (under linux, it is mounted as tmpfs so there is no I/O). However, in my benchmarks I was not able to observe any noticeable difference in performance... – Hardship 12/9, 2013 at 15:34

Thanks for the tip about /dev/shm. This is definitely worth us trying. – Counterblow 12/9, 2013 at 17:25

Recommended topics

Hot tags