Multicore Java Program with Native Code

Asked 20/8, 2012 at 13:0 Answered 20/8, 2012 at 19:36

Solved java parallel-processing scalability multicore native-code

I am using a native C++ library inside a Java program. The Java program is written to make use of many-core systems, but it does not scale: the best speed is with around 6 cores, i.e., adding more cores slows it down. My tests show that the call to the native code itself causes the problem, so I want to make sure that different threads access different instances of the native library, and therefore remove any hidden (memory) dependency between the parallel tasks. In other words, instead of the static block

static {
    System.loadLibrary("theNativeLib");
}

I want multiple instances of the library to be loaded, for each thread dynamically. The main question is if that is possible at all. And then how to do it!

Notes: - I have implementations in Java 7 fork/join as well as Scala/akka. So any help in each platform is appreciated. - The parallel tasks are completely independent. In fact, each task may create a couple of new tasks and then terminates; no further dependency!

Here is the test program in fork/join style, in which processNatively is basically a bunch of native calls:

class Repeater extends RecursiveTask<Long> {
    final int n;
    final processor mol;

    public Repeater(final int m, final processor o) {
        n=m;
        mol = o;
    }
    @Override
    protected Long compute() {
        processNatively(mol);
        final List<RecursiveTask<Long>> tasks = new ArrayList<>();
        for (int i=n; i<9; i++) {
            tasks.add(new Repeater(n+1,mol));
        }

        long count = 1;
        for(final RecursiveTask<Long> task : invokeAll(tasks)) { 
            count += task.join(); 
        }
        return count;
    }
}
private final static ForkJoinPool forkJoinPool = new ForkJoinPool();

public void repeat(processor mol)
{
    final long middle = System.currentTimeMillis();     
    final long count = forkJoinPool.invoke(new Repeater(0, mol));
    System.out.println("Count is "+count);
    final long after = System.currentTimeMillis();      
    System.out.println("Time elapsed: "+(after-middle));
}

Putting it differently: If I have N threads that use a native library, what happens if each of them calls System.loadLibrary("theNativeLib"); dynamically, instead of calling it once in a static block? Will they share the library anyway? If yes, how can I fool JVM into seeing it as N different libraries loaded independently? (The value of N is not known statically)

Ruffian answered 20/8, 2012 at 13:0 Comment(9)

I'm sorry I missed the question? Mind clarifying it – Joachim 20/8, 2012 at 13:2

PS: As I said, the code is just a test. So don't look for any logic in how the tasks are generated! The point is just calling the native code many many times. – Ruffian 20/8, 2012 at 13:3

@David: I updated the question. Is it clearer now? – Ruffian 20/8, 2012 at 13:9

The question really should be: why does the native call slow it down? Surely it should be possible to write completely re-entrant native code with JNI. – Polysepalous 20/8, 2012 at 13:28

@biziclop: That's a good question, but I don't have access to the code of the native library. – Ruffian 20/8, 2012 at 13:32

@Ruffian Ah, that is a problem then indeed. I thought you wrote the native bit as well. – Polysepalous 20/8, 2012 at 13:57

Do you need to use the output of the native call in the rest of the task? – Polysepalous 20/8, 2012 at 13:59

Spawn a new JVM for every multiple of 6 and use Remote Akka Actors to join the results? – Marilla 20/8, 2012 at 18:56

Sounds like memory bound problem to me. How much data is the native call and the program as a whole processing? – Sinuous 20/8, 2012 at 19:45

The javadoc for System.loadLibrary states that it's the same as calling Runtime.getRuntime().loadLibrary(name). The javadoc for this loadLibrary (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#loadLibrary(java.lang.String) ) states that "If this method is called more than once with the same library name, the second and subsequent calls are ignored.", so it seems you can't load the same library more than once. In terms of fooling the JVM into thinking there are multiple instances, I can't help you there.

Electromagnetic answered 20/8, 2012 at 14:11 Comment(3)

Would this mean the OP can make N copies of the library with different names (e.g., hard-link in *nix) to circumvent this single-loading optimization by Java? – Trustful 20/8, 2012 at 14:22

In theory yes, but only if the JVM uses the librarys name as a check, and doesn't examine the actual contents of the library. – Electromagnetic 20/8, 2012 at 14:41

But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages? – Ruffian 20/8, 2012 at 14:52

You need to ensure you don't have a bottle neck on any shared resources. e.g. say you have 6 hyper threaded cores, you may find that 12 threads is optimal or you might find that 6 thread is optimal (and each thread has a dedicated core)

If you have a heavy floating point routine, it is likely that hyperthreading will be slower rather than faster.

If you are using all the cache, trying to use more can slow your system down. If you are using the limit of CPU to main memory bandwidth, attempting to use more bandwidth can slow your machine.

But then, how can I refer to the different instances? I mean the loaded classes will have the same names and packages, right? What happens in general if you load two dynamic libraries containing classes with the same names and packages?

There is only one instance, you cannot load a DLL more than once. If you want to construct a different data set for each thread you need to do this externally to the library and pass this to the library so each thread can work on different data.

Amalita answered 20/8, 2012 at 19:36 Comment(5)

Any suggestions what I actually can do to improve the scalability? – Ruffian 21/8, 2012 at 9:43

You need to work out what your bottle neck is. I suggest trying some simple micro-benchmarks to work out what is the usable cache size of your system and what you CPU-memory bandwidth is. You have to work within the limitations of your hardware and its handy to know what that is. – Amalita 21/8, 2012 at 9:45

I would write a simple program which doesn't use floating point, cache or memory. Ensure you can get this to scale across all your CPUs as you might expect. Then introduce floating point (if you are using that) etc until you see the limitation you do here. – Amalita 21/8, 2012 at 9:46

The native library is actually "bliss", which is a graph canonization algorithm. So I don't expect any floating point operations there. I have a simple toy example that scales with no problem. I'll try to make it use more memory to see how the scalability is affected. Nevertheless, I don't have access to the native code, so I'm not quite sure how this experiment will help me solve the problem... Thanks for you help anyway... – Ruffian 21/8, 2012 at 10:14

You may need to reduce the available memory per thread. i.e. if your combined memory usage exceeds your cache size you are likely to be limited by the rate it can shuffle data in and out of the cache. – Amalita 21/8, 2012 at 10:16

Recommended topics

Hot tags