Java is scaling much worse than C# over many cores?
Asked Answered
A

4

16

I am testing spawning off many threads running the same function on a 32 core server for Java and C#. I run the application with 1000 iterations of the function, which is batched across either 1,2,4,8, 16 or 32 threads using a threadpool.

At 1, 2, 4, 8 and 16 concurrent threads Java is at least twice as fast as C#. However, as the number of threads increases, the gap closes and by 32 threads C# has nearly the same average run-time, but Java occasionally takes 2000ms (whereas both languages are usually running about 400ms). Java is starting to get worse with massive spikes in the time taken per thread iteration.

EDIT This is Windows Server 2008

EDIT2 I have changed the code below to show using the Executor Service threadpool. I have also installed Java 7.

I have set the following optimisations in the hotspot VM:

-XX:+UseConcMarkSweepGC -Xmx 6000

but it still hasnt made things any better. The only difference between the code is that im using the below threadpool and for the C# version we use:

http://www.codeproject.com/Articles/7933/Smart-Thread-Pool

Is there a way to make the Java more optimised? Perhaos you could explain why I am seeing this massive degradation in performance?

Is there a more efficient Java threadpool?

(Please note, I do not mean by changing the test function)

import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;

public class PoolDemo {

    static long FastestMemory = 2000000;
    static long SlowestMemory = 0;
    static long TotalTime;
    static int[] FileArray;
    static DataOutputStream outs;
    static FileOutputStream fout;
    static Byte myByte = 0;

  public static void main(String[] args) throws InterruptedException, FileNotFoundException {

        int Iterations = Integer.parseInt(args[0]);
        int ThreadSize = Integer.parseInt(args[1]);

        FileArray = new int[Iterations];
        fout = new FileOutputStream("server_testing.csv");

        // fixed pool, unlimited queue
        ExecutorService service = Executors.newFixedThreadPool(ThreadSize);
        ThreadPoolExecutor executor = (ThreadPoolExecutor) service;

        for(int i = 0; i<Iterations; i++) {
          Task t = new Task(i);
          executor.execute(t);
        }

        for(int j=0; j<FileArray.length; j++){
            new PrintStream(fout).println(FileArray[j] + ",");
        }
      }

  private static class Task implements Runnable {

    private int ID;

    public Task(int index) {
      this.ID = index;
    }

    public void run() {
        long Start = System.currentTimeMillis();

        int Size1 = 100000;
        int Size2 = 2 * Size1;
        int Size3 = Size1;

        byte[] list1 = new byte[Size1];
        byte[] list2 = new byte[Size2];
        byte[] list3 = new byte[Size3];

        for(int i=0; i<Size1; i++){
            list1[i] = myByte;
        }

        for (int i = 0; i < Size2; i=i+2)
        {
            list2[i] = myByte;
        }

        for (int i = 0; i < Size3; i++)
        {
            byte temp = list1[i];
            byte temp2 = list2[i];
            list3[i] = temp;
            list2[i] = temp;
            list1[i] = temp2;
        }

        long Finish = System.currentTimeMillis();
        long Duration = Finish - Start;
        TotalTime += Duration;
        FileArray[this.ID] = (int)Duration;
        System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");


        if(Duration < FastestMemory){
            FastestMemory = Duration;
        }
        if (Duration > SlowestMemory)
        {
            SlowestMemory = Duration;
        }
    }
  }
}
Arcadia answered 4/4, 2012 at 12:20 Comment(13)
What OS? Windows? I figure it probably is, if you were using Mono you'd probably say, but still...Omora
Why do you try to compare the performance on these two languages? The main target of the java language are the enterprise applications, if you need performance just use c++.Mountford
Did you use the client or server VM for Java?Warplane
How many cores is your application actually using?Eleazar
-Xmx 6000 is an illegal option (there shouldn't be a space, and the size without suffix is in bytes and should be in multiples of 1024, and be larger than 2 MB), so what are you really using?Heinrick
@Warplane I did not realise there were different versions. I got the vm with eclipse 64 bit?Arcadia
You have serious thread safety issues in your tasks (writing to static variables, potential overwriting values and or having interleaved writes as you are using long).You may want to process the results of individual tasks after they are done on the main thread.Heinrick
@Mark, the function was just something to test the memory capability on our server. We believe we may have memory bandwidth issues. Our library is written in C# so we're first benchmarking it against Java to work eliminate.Arcadia
@Eleazar 4 physical cores, 4x8 overall coresArcadia
where did you get that java threadpool from? it might make sense to try another one or create your threads yourself to eliminate any possible problems with it.Untidy
@Christian, I have changed the code to above (found another example online): bobah.net/d4d/source-code/misc/…Arcadia
@Mark's issue directly affects your testing, because your thread safety issues are around the actual variables you are using to measure performance. I updated my answer below to address this.Refute
You need to put executor.shutdown() at the end of your code or else your program will never terminate.Refute
R
19

Summary

Below are the original response, update 1, and update 2. Update 1 talks about dealing with the race conditions around the test statistic variables by using concurrency structures. Update 2 is a much simpler way of dealing with the race condition issue. Hopefully no more updates from me - sorry for the length of the response but multithreaded programming is complicated!

Original Response

The only difference between the code is that im using the below threadpool

I would say that is an absolutely huge difference. It's difficult to compare the performance of the two languages when their thread pool implementations are completely different blocks of code, written in user space. The thread pool implementation could have enormous impact on performance.

You should consider using Java's own built-in thread pools. See ThreadPoolExecutor and the entire java.util.concurrent package of which it is part. The Executors class has convenient static factory methods for pools and is a good higher level interface. All you need is JDK 1.5+, though the newer, the better. The fork/join solutions mentioned by other posters are also part of this package - as mentioned, they require 1.7+.

Update 1 - Addressing race conditions by using concurrency structures

You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing the < and > testing and then the setting in more than one step. This is not atomic; there is certainly the chance that another thread will update these values in between the testing and the setting. The += setting of TotalTime is also non-atomic: a test and set in disguise.

Here are some suggested fixes.

TotalTime

The goal here is a threadsafe, atomic += of TotalTime.

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;  

...    

// In PoolDemo
static AtomicLong TotalTime = new AtomicLong();    

...    

// In Task, where you currently do the TotalTime += piece
TotalTime.addAndGet (Duration); 

FastestMemory / SlowestMemory

The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.

Simplest approach:

Protect the testing and setting of the variables using the class itself as a monitor. We need a monitor that contains the variables in order to guarantee synchronized visibility (thanks @A.H. for catching this.) We have to use the class itself because everything is static.

// In Task
synchronized (PoolDemo.class) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Intermediate approach:

You may not like taking the whole class for the monitor, or exposing the monitor by using the class, etc. You could do a separate monitor that does not itself contain FastestMemory and SlowestMemory, but you will then run into synchronization visibility issues. You get around this by using the volatile keyword.

// In PoolDemo
static Integer _monitor = new Integer(1);
static volatile long FastestMemory = 2000000;
static volatile long SlowestMemory = 0;

...

// In Task
synchronized (PoolDemo._monitor) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Advanced approach:

Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;    

. . . . 

// In PoolDemo
static AtomicLong FastestMemory = new AtomicLong(2000000);
static AtomicLong SlowestMemory = new AtomicLong(0);

. . . . .

// In Task
long temp = FastestMemory.get();       
while (Duration < temp) {
    if (!FastestMemory.compareAndSet (temp, Duration)) {
        temp = FastestMemory.get();       
    }
}

temp = SlowestMemory.get();
while (Duration > temp) {
    if (!SlowestMemory.compareAndSet (temp, Duration)) {
        temp = SlowestMemory.get();
    }
}

Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.

I originally posted this update as a comment but moved it here so that I would have room to show code. This update has been through a few iterations - thanks to A.H. for catching a bug I had in an earlier version. Anything in this update supersedes anything in the comment.

Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.

Update 2 - Addressing race conditions in a much simpler way

I recently noticed that your current code will never terminate unless you add executorService.shutdown(). That is, the non-daemon threads living in that pool must be terminated or else the main thread will never exit. This got me to thinking that since we have to wait for all threads to exit, why not compare their durations after they finished, and thus bypass the concurrent updating of FastestMemory, etc. altogether? This is simpler and could be faster; there's no more locking or CAS overhead, and you are already doing an iteration of FileArray at the end of things anyway.

The other thing we can take advantage of is that your concurrent updating of FileArray is perfectly safe, since each thread is writing to a separate cell, and since there is no reading of FileArray during the writing of it.

With that, you make the following changes:

// In PoolDemo
// This part is the same, just so you know where we are
for(int i = 0; i<Iterations; i++) {
    Task t = new Task(i);
    executor.execute(t);
}

// CHANGES BEGIN HERE
// Will block till all tasks finish. Required regardless.
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);

for(int j=0; j<FileArray.length; j++){
    long duration = FileArray[j];
    TotalTime += duration;

    if (duration < FastestMemory) {
        FastestMemory = duration;
    }

    if (duration > SlowestMemory) {
        SlowestMemory = duration;
    }

    new PrintStream(fout).println(FileArray[j] + ",");
}

. . . 

// In Task
// Ending of Task.run() now looks like this
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

Give this approach a shot as well.

You should definitely be checking your C# code for similar race conditions.

Refute answered 4/4, 2012 at 12:41 Comment(5)
Just used an Executor service and i'm still getting the same timings. I based my code on this: bobah.net/d4d/source-code/misc/…Arcadia
You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing < and > testing and then setting in more than one step. This is not atomic, there is certainly the chance that another thread will update these values in between the testing and the setting. So wrap the < test and setting in a synchronized block - one for fastest and one for slowest. Meanwhile, the += setting of TotalTime is also non-atomic. Either synchronize that operation or try an AtomicLong.Refute
Using synchronized (PoolDemo.monitor) will do two things: First: You serialize threads - that's OK and intended. Second: The JVM guarantees that leaving the synchronized block the new state of the monitor will be visible to other CPUs. There is no guarantee that other variables are visible to other CPUs. In your case there is no guarantee for FastestMemory & co. You either must synchronize on them too or stuff them into the monitor object or use AtomicXYZ for them.Madge
Argh, you're absolutely right. This was all because I wanted avoid synchronizing on the class itself (since everything is static). The synchronization solution could also work if I made them volatile - that would solve the visibility issue. Anyway, time to edit my post - thanks for catching this!Refute
@sparc_spread: Regarding "Update 2": A even simpler solution would be to abandon Runnable, use a Callable instead which returns all relevant data. This eliminates all synchronization in the user's code.Madge
M
5

...but Java occasionally takes 2000ms...

And

    byte[] list1 = new byte[Size1];
    byte[] list2 = new byte[Size2];
    byte[] list3 = new byte[Size3];

The hickups will be the garbage collector cleaning up your arrays. If you really want to tune that I suggest you use some kind of cache for the arrays.

Edit

This one

   System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

does one or more synchronized internally. So your highly "concurrent" code will be serialized quite good at this point. Just remove it and retest.

Madge answered 8/4, 2012 at 12:43 Comment(3)
+1 I agree that this could be a factor. One interesting experiment would be to change the list* variables from local variables of run() into instance variables of Task. Then, keep each Task t in a List when it is instantiated. The goal here is to keep the list* blocks in the reference graph until the very end of the program. This would mean no gc of them till the end. It would be interesting to see if the times became more deterministic at that point. Of course you'd also have to make sure the heap is big enough to do this.Refute
@Refute I would expect, that the GC tries to free some memory, constantly fails and then allocates new RAM from the OS. This is for sure not the fastest way. So what value would such a datapoint provide? IMO none.Madge
It would be interesting to distinguish between the cost of actual successful collection vs. the cost of the thrashing you describe, which I agree probably would start happening in the scenario I set up.Refute
S
4

While @sparc_spread's answer is great, another thing I've noticed is this:

I run the application with 1000 iterations of the function

Notice that the HotSpot JVM is working on interpreted mode for the first 1.5k iterations of any function on client mode, and for 10k iterations on server mode. Computers with that many cores are automatically considered "servers" by the HotSpot JVM.

That would mean that C# would do JIT (and run in machine code) before Java does, and has a chance for better performance at the function runtime. Try increasing the iterations to 20,000 and start counting from 10k iteration.

The rationale here is that the JVM collects statistical data for how to do JIT best. It trusts that your function is going to be run a lot through time, so it takes a "slow bootstrapping" mechanism for a faster runtime overall. Or in their words "20% of the functions run 80% of the time", so why JIT them all?

Shaff answered 8/4, 2012 at 4:48 Comment(5)
Does that mean a function that only runs once, but has a tight loop with millions of iterations doesn't get JITed? That sounds like a strange choice.Partake
+1 I think you are onto something here. It's always good to have a "warm-up" time with these sorts of tests and with the JIT behavior you describe it seems essential. I'm pretty sure your answer + @A.H.'s answer re gc behavior account for most of what's going on. I'm just trying to fix the race conditions but I'm not sure they're the cause of what the original poster is describing.Refute
@CodeInChaos that's actually a very deliberate choice. JIT is costly; therefore, they do it for the 20% of the code that does run (all that bootstrapping code? No need to JIT it). Also, because they take their time, they get the chance to check for statistics on how the function is used, and JIT it even better.Shaff
@AviadBenDov I didn't criticize the basic idea, but I think that method granularity is too coarse. It would make more sense to JIT the method once a piece of code in has been executed n times, and not once the method has been called n times. Else methods which contain a tight loop(which takes long to complete) but are called only a few times won't get JITed.Partake
@CodeInChaos I see what you mean now. You're right in that, but as much as I understand it, JIT can only be performed on whole methods - therefore the restriction. In my opinion though, I can only imagine badly formed code when I try to think of a situation where 80% of the application time it's running one method with a really long loop and no in-loop method calls.. :-)Shaff
M
2

Are you using java6? Java 7 comes with features to improve performance in parallel programing:

http://www.oracle.com/technetwork/articles/java/fork-join-422606.html

Mindimindless answered 4/4, 2012 at 12:26 Comment(3)
Im just going to quickly install version 7Arcadia
With java 7 you can use the fork framework to improve the performance. Here there is a stackoverflow thread that talks about it #7927364Mindimindless
Current version on the website is 1.7.0_3 but that might have been pretty recent(I think the earlier versions of java 7 had a few problems, so they may not have put them up for direct download).Tutt

© 2022 - 2024 — McMap. All rights reserved.