Java concurrency: executing many "infinite" tasks with few threads

Asked 5/8, 2013 at 13:16 Answered 12/2, 2014 at 16:3

I'm building a (concurrent) simulator for a set of N particles that are moving in a space according to the Newton's laws. My idea is model each particle as a task, which interacts with other particles (tasks) in order to get their positions and masses in order to calculate the net force it is subject to. Each particle-task is something as

while(true){
   force = thisParticle.calculateNetForce(allTheParticles);
   thisParticle.waitForAllTheParticlesToCalculateNetForce(); // synchronization
   thisParticle.updatePosition(force);
   thisParticle.waitForAllTheParticlesToUpdateTheirState(); // synchronization
}

I can have a lot of particles (100 or more), so I can't create such a number of Java threads (which are mapped to physical threads). My idea is to use Runtime.getRuntime().availableProcessors()+1 threads onto which the many tasks can be executed.

However, I can't use a FixedThreadExecutor because the particle-tasks does not end. I would like to use a FixedThreadExecutor which must be also able to perform a sort of scheduling internally. Do you know something for this purpose?

Or, could you suggest me better approaches for modelling such a system by a point of view of concurrency (e.g. a different task decomposition) ?

P.s.: I am limited to "classical" concurrency mechanisms, not including actors or similar architectures.

Dabbs answered 5/8, 2013 at 13:16 Comment(10)

If thisParticle.waitForAllTheParticlesToCalculateNetForce(); effectively waits for something (via an actual wait or a CountdownLatch/CyclicBarrier/Phaser etc.), the thread in which that method is run will be returned to the pool and be available to other tasks. Not sure I understand why your FixedThreadPool approach would not work. – Zobias 5/8, 2013 at 13:19

What your trying to do is "Agent based Modelling", could be worth adding the tag. – Taco 5/8, 2013 at 13:22

You also can create that many threads since thread is an abstraction beyond processor cores and work even on single core machines. It's a bad idea to have 100 or more threads like that though. – Concerto 5/8, 2013 at 13:25

@Zobias Such method actually performs a await() on a CyclicBarrier instance. However, the system gets stuck as no other particle-tasks are executed. Only the first PROCESSORS+1 tasks get executed. – Dabbs 5/8, 2013 at 13:36

Then you probably have a problem in your code (e.g. you are holding a lock that prevents other threads from progressing when you should not). – Zobias 5/8, 2013 at 14:11

@Zobias I don't think so. Take a look at: pastebin.com/rpVDZtML – Dabbs 5/8, 2013 at 14:42

@RobertoCasadei await is a non blocking call: the thread becomes idle until all parties have called await(). So unless you hold a lock when calling await, the other tasks should be able to use that idle thread. It might be worth posting some of your code. – Zobias 5/8, 2013 at 14:44

@RobertoCasadei Your posted code needs at least 10 (10 passed as the constructor arg) threads to work. await will wait/block until all parties are awaiting. In your case, only two threads will trip the barrier and wait indefinitely. Once all ten threads invoke await then you the rest of the method will continue. – Asbestos 5/8, 2013 at 15:54

@JohnVint That was exactly what I wanted to show, i.e. that waiting on a barrier does not result in the current task being suspended and another task being executed on the freed thread. My issue is that I want to have few physical threads but a lot of long-running tasks; an Executor with scheduling facilities would do the job. – Dabbs 5/8, 2013 at 18:25

@RobertoCasadei I figure that is what you were implying just wanted to clarify. By means of divide and conquer the fork join is the best way of having Java handle these tasks for you (in my opinion). – Asbestos 5/8, 2013 at 18:29

The biggest killer for performance is likely to be the thread safety checks you perform to ensure all the particles are interacting in a thread safe manner. I suggest you use one thread per core and try to minimise the interaction between threads. This can be done by dividing your space into threads e.g. half X, half Y, half Z divides the space into 8. You cal look at all the interactions in each space concurrently and independently and you only need to worry when a particle passed from one space/thread to another.

Kleist answered 5/8, 2013 at 13:21 Comment(7)

I don't think such partitioning is an option with gravity modeling. – Delorenzo 5/8, 2013 at 13:31

I am pretty sure the net mass and direction for particle outside a region is the sum of the masses and locations in any given space. It gets more complicated for particles inside a region. – Kleist 5/8, 2013 at 13:37

it is not necessary to divide particles by their location, better just split then initially into constant sets of equal size, each set processed by one thread. Indeed, the greatest optimization is to sum gravity from each set, replacing each set with one particle with summarized mass, located at the center of gravity. – Nimbus 23/8, 2013 at 16:14

@AlexeiKaigorodov The reason for splitting the particle is if you want them to have collisions. If you don't you can divide them as you suggest. – Kleist 24/8, 2013 at 7:27

Since, verifying collisions normally take n^2 calculations. Dividing the space in 8 is a good idea. However, essentially you have still O(n^2) problem. This problem claims badly for a matrix approaching. – Infinite 12/2, 2014 at 15:34

@Infinite This is true, however the constant is much lower making it 8 - 80x faster. (The extra 10x comes from not needing locking around the accesses and working in the L1 cache instead of a shared L3 cache) – Kleist 12/2, 2014 at 17:6

@peterlawrey, your idea (or using things like BSP) is specially good with few objects (dozens or maybe until few hundred). When you have a huge numbers of object, I should use ADDITIONALLY some other techniques. 10m objects simulation: youtube.com/watch?v=Qve54Z71VYU – Infinite 12/2, 2014 at 17:53

I would assume you are storing all your particles in maybe an array of 2-dimensiional array? This would be a great candidate for the Fork-Join Framework.

You would split the calculation of portions of the array into smaller portions. You keep splitting until at a certain size. Finally you calculate and return. The returned value will then be joined and calculated with other the other side of the tree.

Asbestos answered 5/8, 2013 at 13:20 Comment(0)

Rather than a thread per particle, I would create an ExecutorService with an appropriate number of threads. I would keep the particles in a list (or some other type of collection). I would create separate pieces of work to be executed (either as Runnable or Callable) for each particle's calculate and update steps. When you submit a piece of work to the executor, you get back a Future. Put these futures in a collection. After you have submitted all the pieces of work that you want to run in parallel, you iterate over your list of futures and call get() on each one to implement your synchronization steps.

You might end up creating a little POJO to associate a particle and it's calculated force (or stash the calculated force in the particle instance).

Croix answered 5/8, 2013 at 13:53 Comment(2)

It is an approach I've considered. However, it seemed to me less effective by a conceptual point of view (as it's quite direct to think at a mapping between the particles and concurrent activities -- not threads, definitely), and I noticed that its implementation requires the creation of a big number of objects and a lot of method invocations. – Dabbs 5/8, 2013 at 14:11

Yes, you have to balance the advantages of a controllable number of threads, adequate concurrency to keep multiple cores busy, garbage collection overhead, etc. In my experience, the number of method invocations is a performance concern. Given the n-squared nature of your calculation and the fact that the garbage overhead of the "extra" objects grows linearly with the number of particles, this seems like a good trade-off in this case. – Croix 5/8, 2013 at 14:44

Why don't you do the calculations in discrete steps ?

while(true){


for(Particle p : allParticles){
   force = p.calculateNetForce(allParticles);   
   p.setNextPosition(force); //Remembers, but doesn't change the current position
}

for(Particle p : allParticles){
    p.nextState(); //Change the position
}

}

First calculate the force for each particle, but don't change its current state. After you've calculated it for every particle, then update its internal state according to your previous calculations. In this way even a single thread will be enough, and of course you can split up the calculations across multiple threads but you'll need additional synchronization

JAVA 8 UPDATE

Using Java 8 you can take advantage of multi-core systems, while not having to take care of threads, synchronization etc.

 while(true){
       allParticles.parallelStream().forEach(p -> {
           double force = p.calculateNetForce(allParticles);
           p.setNextPosition(force)
       });

       allParticles.parallelStream().forEach(p ->   p.nextState());      
 }

Wernher answered 5/8, 2013 at 14:17 Comment(7)

Indeed, my issues come from the need to identify exploitable concurrency. – Dabbs 5/8, 2013 at 14:26

Then just create N threads and split the work between them! Put all the Particle objects in a ConcurrentQueue and for each thread: code while(true){ while(!queue.isEmpty()){ 1.get a particle 2.calculate force } 3.update state //sync, or swap queues }` – Wernher 5/8, 2013 at 15:4

The question, in fact, is about how to split the work by a technical/conceptual point of view. – Dabbs 5/8, 2013 at 15:9

Well, my idea is to have 2 concurrent queues (CQ) and a Set containing all your Particle objects (PO) and N threads. At first the first CQ (CQ1) also contains all the PO. Then each thread starts taking POs from CQ1, calculates the net force, and sets the next state (position) without modifying the current PO's state, then pushes the PO on the CQ2. When CQ1 becomes empty, the thread which last accessed it sets a flag, and then all the threads stat taking POs from CQ2 and invoke po.nextState() and then push it on CQ1. When CQ2 becomes empty, everything starts again from the beginning... – Wernher 5/8, 2013 at 18:48

another more efficient approach will be not to move the POs from CQ1 to CQ2 and then from CQ2 to CQ1, but swap CQ1 and CQ2's places, but then you will have to have an external sync mechanism (a simple boolean flag will do) and a mechanism to choose which state to access (using the external boolean flag) – Wernher 5/8, 2013 at 18:55

@svetlin-zarev, you are right. The main problem is how to split the problem in small task. Using OpenCL, Fork n Join, etc it doesn't matter because you need in fact a smart way to split the problem. Agent or Actor model it is a horrible idea in this situation. – Infinite 12/2, 2014 at 15:18

It's been a long time since I answered that question. Now in Java 8 there are lambdas and stream-API and it's very easy to take advantage om multi-core systems. For example he can use allParticles.paralellSream().forEach(p -> p.calculateNetForce(allTheParticles)) which will take advantage of all CPUs, while he won't need to manually take care of threads, sychronization, etc – Wernher 12/2, 2014 at 17:58

For each particle, you call calculateNetForce(allTheParticles), which I suppose makes your computations proportional to O(N^2) (the square of the number of all particles). This is the main performance killer, and you better find an algorithm with complexity of O(N), and only then try to parallelize. Off the top of my head, I can propose to first calculate the sum mass and the center of gravity for all particles. Then, for each particle, calculate mass and the center of the rest of particles. This can be done with taking the total mass and center, and adding a "hole" with negative mass instead of the current particle. Then calculate the force between the particle and the rest of them. The calculations for each particle are independent and can be parallelized with any of ways proposed by other commenters.

Nimbus answered 23/8, 2013 at 16:30 Comment(0)

The particle should be itself Runnable and Callable, this will allow you to avoid creating a lot of extra objects and synchronize different steps. Here is a SSCCEE:

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Particle implements Callable<Void> {

  private enum ParticleState {
    POSITION_UPDATED, FORCE_CALCULATED
  }

  private int id;
  private int calculatedForce;
  private ParticleState particleState = ParticleState.POSITION_UPDATED;
  private List<Particle> allTheParticles;

  public Particle(int id, List<Particle> allTheParticles) {
    this.id = id;
    this.allTheParticles = allTheParticles;
  }

  private void calculateNetForce() {
    System.out.println("calculation in " + id);
    String someIntenseOperation = "";
    for (int i = 0; i < 10000; i++) {
      someIntenseOperation += allTheParticles.size();
    }
    calculatedForce = 0;
    particleState = ParticleState.FORCE_CALCULATED;
  }

  private void updatePosition() {
    System.out.println("updating position of " + id);
    particleState = ParticleState.POSITION_UPDATED;
  }

  @Override
  public Void call() throws Exception {
    switch (particleState) {
      case FORCE_CALCULATED:
        updatePosition();
        break;
      case POSITION_UPDATED:
        calculateNetForce();
        break;
    }
    return null;
  }

  public static void main(String[] args) throws InterruptedException {
    final ExecutorService executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() + 1);
    final List<Particle> allTheParticles = new ArrayList<>();
    for (int i = 0; i < 20; i++) {
      allTheParticles.add(new Particle(i, allTheParticles));
    }
    while (true) {
      executor.invokeAll(allTheParticles);
      executor.invokeAll(allTheParticles);
    }
  }
}

Pilocarpine answered 5/8, 2013 at 14:43 Comment(2)

How do you ensure that all the particles are synchronized with respect to their state? – Dabbs 5/8, 2013 at 14:48

executor.invokeAll does this for you. Try running this example. – Pilocarpine 5/8, 2013 at 14:50

Since, verifying collisions normally take n^2 calculations, dividing the space is a good idea. although, it will be essentially a O(n^2) problem.

This problem claims badly for a matrix approaching (but take a look in Parallel computing to know the best ideas to deal with it) You could use some techniques pointed here: An efficient way to simulate many particle collisions?

Please, note that should be a bad idea to use Actor model because threads will be problematic after certain number.

There's now Java OpenCL lib (ex: Aparapi) and Java 9 should bring openCL natively with Sumatra project. So, you could use Fork and Join lib and JVM will use OpenCL under the hood.

Infinite answered 12/2, 2014 at 16:3 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

JAVA 8 UPDATE

Recommended topics

Hot tags