fastest (low latency) method for Inter Process Communication between Java and C/C++

Asked 14/4, 2010 at 6:23 Answered 3/7, 2013 at 14:8

110

I have a Java app, connecting through TCP socket to a "server" developed in C/C++.

both app & server are running on the same machine, a Solaris box (but we're considering migrating to Linux eventually). type of data exchanged is simple messages (login, login ACK, then client asks for something, server replies). each message is around 300 bytes long.

Currently we're using Sockets, and all is OK, however I'm looking for a faster way to exchange data (lower latency), using IPC methods.

I've been researching the net and came up with references to the following technologies:

shared memory
pipes
queues
as well as what's referred as DMA (Direct Memory Access)

but I couldn't find proper analysis of their respective performances, neither how to implement them in both JAVA and C/C++ (so that they can talk to each other), except maybe pipes that I could imagine how to do.

can anyone comment about performances & feasibility of each method in this context ? any pointer / link to useful implementation information ?

EDIT / UPDATE

following the comment & answers I got here, I found info about Unix Domain Sockets, which seem to be built just over pipes, and would save me the whole TCP stack. it's platform specific, so I plan on testing it with JNI or either juds or junixsocket.

next possible steps would be direct implementation of pipes, then shared memory, although I've been warned of the extra level of complexity...

thanks for your help

Keynesianism answered 14/4, 2010 at 6:23 Comment(9)

It might be overkill in your case but consider zeromq.org – Geographical 14/4, 2010 at 6:39

that's interesting, however the idea would be to use "generic" (as in OS-provided or language-provided) methods first, that's why I mentioned queues & shared memory. – Keynesianism 14/4, 2010 at 7:58

See also stackoverflow.com/questions/904492 – Enriquetaenriquez 14/4, 2010 at 12:31

Don't forget mapped files or just UDP. – Greenstone 15/4, 2010 at 5:6

UDP is too slow (comparable to TCP), and not reliable. what do you mean by mapped files ? – Keynesianism 15/4, 2010 at 6:16

UDP slower than TCP??? hmmm... proof please – Goulash 20/3, 2013 at 23:44

@J.F.Sebastian Zeromq is NOT the fastest. Its underlying implementation uses TCP sockets. This will be slower than methods such as POSIX Message Queues, pipes or shared memory. – Peekaboo 17/8, 2017 at 3:59

@user289882 if your use-case requires the functionality that zeromq provides then you should compare the time performance of zeromq with the time performance of your custom solution on top of "POSIX Message Queus, pipes or shared memory" (it is similar to comparison of a hand-written assembler vs. code generated by an optimizing compiler: it is true that it is possible to write a faster assembler by hand. Whether it is worth in practice in most cases is another question). – Geographical 17/8, 2017 at 4:17

@J.F.Sebastian Well OP is not asking for the IPC with the most functionality, he is asking for what is the fastest (lowest latency) IPC. – Peekaboo 17/8, 2017 at 18:43

110

Just tested latency from Java on my Corei5 2.8GHz, only single byte send/received, 2 Java processes just spawned, without assigning specific CPU cores with taskset:

TCP         - 25 microseconds
Named pipes - 15 microseconds

Now explicitly specifying core masks, like taskset 1 java Srv or taskset 2 java Cli:

TCP, same cores:                      30 microseconds
TCP, explicit different cores:        22 microseconds
Named pipes, same core:               4-5 microseconds !!!!
Named pipes, taskset different cores: 7-8 microseconds !!!!

TCP overhead is visible
scheduling overhead (or core caches?) is also the culprit

At the same time Thread.sleep(0) (which as strace shows causes a single sched_yield() Linux kernel call to be executed) takes 0.3 microsecond - so named pipes scheduled to single core still have much overhead

Some shared memory measurement: September 14, 2009 – Solace Systems announced today that its Unified Messaging Platform API can achieve an average latency of less than 700 nanoseconds using a shared memory transport. http://solacesystems.com/news/fastest-ipc-messaging/

P.S. - tried shared memory next day in the form of memory mapped files, if busy waiting is acceptable, we can reduce latency to 0.3 microsecond for passing a single byte with code like this:

MappedByteBuffer mem =
  new RandomAccessFile("/tmp/mapped.txt", "rw").getChannel()
  .map(FileChannel.MapMode.READ_WRITE, 0, 1);

while(true){
  while(mem.get(0)!=5) Thread.sleep(0); // waiting for client request
  mem.put(0, (byte)10); // sending the reply
}

Notes: Thread.sleep(0) is needed so 2 processes can see each other's changes (I don't know of another way yet). If 2 processes forced to same core with taskset, the latency becomes 1.5 microseconds - that's a context switch delay

P.P.S - and 0.3 microsecond is a good number! The following code takes exactly 0.1 microsecond, while doing a primitive string concatenation only:

int j=123456789;
String ret = "my-record-key-" + j  + "-in-db";

P.P.P.S - hope this is not too much off-topic, but finally I tried replacing Thread.sleep(0) with incrementing a static volatile int variable (JVM happens to flush CPU caches when doing so) and obtained - record! - 72 nanoseconds latency java-to-java process communication!

When forced to same CPU Core, however, volatile-incrementing JVMs never yield control to each other, thus producing exactly 10 millisecond latency - Linux time quantum seems to be 5ms... So this should be used only if there is a spare core - otherwise sleep(0) is safer.

Pasteurism answered 20/6, 2011 at 13:55 Comment(5)

thanks Andriy, very information study, and it's matching more or less my measurements for TCP, so that's a good reference. I guess I'll look into named pipes. – Keynesianism 21/6, 2011 at 4:51

So replacing the Thread(Sleep) with incrementing the volatile static int should only be done if you can pin a process to different cores? Also, I didnt realise you could do this? I thought the OS decides? – Keramics 20/4, 2012 at 16:34

Try LockSupport.parkNanos(1), should do the same thing. – Demagoguery 14/6, 2012 at 19:31

Very nice. You can do better (as in 5-7us RTT latency) for TCP ping though. See here: psy-lob-saw.blogspot.com/2012/12/… – Cloudburst 31/12, 2012 at 19:34

Further exploration of using memory mapped file as shared memory to support IPC queue in Java: psy-lob-saw.blogspot.com/2013/04/lock-free-ipc-queue.html achieving 135M messages a second. Also see my answer below for comparative study of latency by method. – Cloudburst 16/1, 2014 at 18:27

The question was asked some time ago, but you might be interested in https://github.com/peter-lawrey/Java-Chronicle which supports typical latencies of 200 ns and throughputs of 20 M messages/second. It uses memory mapped files shared between processes (it also persists the data which makes it fastest way to persist data)

Otilia answered 15/7, 2012 at 6:48 Comment(0)

DMA is a method by which hardware devices can access physical RAM without interrupting the CPU. E.g. a common example is a harddisk controller which can copy bytes straight from disk to RAM. As such it's not applicable to IPC.

Shared memory and pipes are both supported directly by modern OSes. As such, they're quite fast. Queues are typically abstractions, e.g. implemented on top of sockets, pipes and/or shared memory. This may look like a slower mechanism, but the alternative is that you create such an abstraction.

Enriquetaenriquez answered 14/4, 2010 at 8:58 Comment(2)

for DMA, why is that then that I can read a lot of things related to RDMA (as Remote Direct Memory Access) that would apply across the network (especially with InfiniBand) and do this same thing. I'm actually trying to achieve the equivalent WITHOUT the network (as all is on the same box). – Keynesianism 14/4, 2010 at 9:40

RDMA is the same concept: copying bytes across a network without interrupting CPUs on either side. It still doesn't operate at the process level. – Enriquetaenriquez 14/4, 2010 at 12:27

Here's a project containing performance tests for various IPC transports:

http://github.com/rigtorp/ipc-bench

Unicycle answered 15/4, 2010 at 4:54 Comment(2)

It doesn't include the 'Java factor', but it does look interesting. – Greenstone 15/4, 2010 at 5:5

Just found goldsborough's benchmark with descriptive readme. github.com/goldsborough/ipc-bench – Emboly 16/11, 2020 at 5:34

A late arrival, but wanted to point out an open source project dedicated to measuring ping latency using Java NIO.

Further explored/explained in this blog post. The results are(RTT in nanos):

Implementation, Min,   50%,   90%,   99%,   99.9%, 99.99%,Max
IPC busy-spin,  89,    127,   168,   3326,  6501,  11555, 25131
UDP busy-spin,  4597,  5224,  5391,  5958,  8466,  10918, 18396
TCP busy-spin,  6244,  6784,  7475,  8697,  11070, 16791, 27265
TCP select-now, 8858,  9617,  9845,  12173, 13845, 19417, 26171
TCP block,      10696, 13103, 13299, 14428, 15629, 20373, 32149
TCP select,     13425, 15426, 15743, 18035, 20719, 24793, 37877

This is along the lines of the accepted answer. System.nanotime() error (estimated by measuring nothing) is measured at around 40 nanos so for the IPC the actual result might be lower. Enjoy.

Cloudburst answered 3/7, 2013 at 14:8 Comment(0)

If you ever consider using native access (since both your application and the "server" are on the same machine), consider JNA, it has less boilerplate code for you to deal with.

Whiteeye answered 14/4, 2010 at 7:10 Comment(0)

I don't know much about native inter-process communication, but I would guess that you need to communicate using native code, which you can access using JNI mechanisms. So, from Java you would call a native function that talks to the other process.

Crankcase answered 14/4, 2010 at 6:52 Comment(0)

In my former company we used to work with this project, http://remotetea.sourceforge.net/, very easy to understand and integrate.

Flavio answered 14/4, 2010 at 6:33 Comment(0)

Have you considered keeping the sockets open, so the connections can be reused?

Grecian answered 15/4, 2010 at 5:22 Comment(3)

the sockets do stay open. the connection is alive for the whole time the application is running (around 7 hours). messages are exchanged more or less continuously (let's say around 5 to 10 per second). current latency is around 200 microseconds, the goal is to shave 1 or 2 orders of magnitude. – Keynesianism 15/4, 2010 at 6:7

A 2 ms latency? Ambitious. Would it be feasible to rewrite the C-stuff to a shared library that you can interface to using JNI? – Disconsolate 15/4, 2010 at 10:43

2ms is 2000 microseconds, not 200. this makes 2ms far less ambitious. – Prelatism 27/1, 2017 at 2:18

-1

Oracle bug report on JNI performance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4096069

JNI is a slow interface and so Java TCP sockets are the fastest method for notification between applications, however that doesn't mean you have to send the payload over a socket. Use LDMA to transfer the payload, but as previous questions have pointed out, Java support for memory mapping is not ideal and you so will want to implement a JNI library to run mmap.

Feminine answered 30/11, 2010 at 8:20 Comment(2)

Why is JNI slow? Consider how the low-level TCP layer in Java works, it's not written in Java byte-code! (E.g. this has to funnel through the native host.) Thus, I reject the assertion that Java TCP sockets are any faster than JNI. (JNI, however, is not IPC.) – Greenstone 30/11, 2010 at 8:32

A single JNI call cost you 9ns (on a Intel i5) if you only use primitives. So it is not that slow. – Nielsen 30/4, 2015 at 9:25

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags