What causes this performance drop?

Asked 20/2, 2015 at 13:53 Answered 24/2, 2015 at 12:1

Solved java multithreading performance disruptor-pattern

I'm using the Disruptor framework for performing fast Reed-Solomon error correction on some data. This is my setup:

          RS Decoder 1
        /             \
Producer-     ...     - Consumer
        \             /
          RS Decoder 8

The producer reads blocks of 2064 bytes from disk into a byte buffer.
The 8 RS decoder consumers perform Reed-Solomon error correction in parallel.
The consumer writes files to disk.

In the disruptor DSL terms, the setup looks like this:

        RsFrameEventHandler[] rsWorkers = new RsFrameEventHandler[numRsWorkers];
        for (int i = 0; i < numRsWorkers; i++) {
            rsWorkers[i] = new RsFrameEventHandler(numRsWorkers, i);
        }
        disruptor.handleEventsWith(rsWorkers)
                .then(writerHandler);

When I don't have a disk output consumer (no .then(writerHandler) part), the measured throughput is 80 M/s, as soon as I add a consumer, even if it writes to /dev/null, or doesn't even write, but it is declared as a dependent consumer, performance drops to 50-65 M/s.

I've profiled it with Oracle Mission Control, and this is what the CPU usage graph shows:

Without an additional consumer:

With an additional consumer: With additional consumer

What is this gray part in the graph and where is it coming from? I suppose it has to do with thread synchronisation, but I can't find any other statistic in Mission Control that would indicate any such latency or contention.

Kuhlman answered 20/2, 2015 at 13:53 Comment(3)

that depends on which tool you are using. it says application + kernal assuming open file descriptors or on that lines. – Cuomo 20/2, 2015 at 13:58

@Cuomo could you elaborate please? Are you saying that open file descriptors are using up to 20% CPU? – Celery 20/2, 2015 at 13:59

I don't know this framework but couldn't that be due to the .then() method polling to see whether workers are done? – Asante 20/2, 2015 at 14:37

Your hypothesis is correct, it is a thread synchronization issue.

From the API Documentation for EventHandlerGroup<T>.then (Emphasis mine)

Set up batch handlers to consume events from the ring buffer. These handlers will only process events after every EventProcessor in this group has processed the event.

This method is generally used as part of a chain. For example if the handler A must process events before handler B:

This should necessarily decrease throughput. Think about it like a funnel:

Event Funnel

The consumer has to wait for every EventProcessor to be finished, before it can proceed through the bottleneck.

Vasti answered 20/2, 2015 at 15:17 Comment(3)

The thing is that, although it may seem that I have 8 RS decoder handlers, actually only one of them processes one event, the others just pass them through. This is how I achieve parallel processing. I do this as described in the answer to "How do you arrange a Disruptor with multiple consumers so that each event is only consumed once?" here github.com/LMAX-Exchange/disruptor/wiki/… – Celery 20/2, 2015 at 15:32

@Kuhlman It doesn't matter, there's still going to be a block wait as it passes the event from one Processor to the other. – Vasti 20/2, 2015 at 15:45

My suspicion is that Consumer is slow, so RB is filling and causing waits for the Producer to write to the RB.Although the consumer is waiting for each EventProcessor to finish, I'm not clear from your explanation how the consumer waiting would slow down the EventProcessors. – Mandiemandingo 24/2, 2015 at 12:4

I can see two possibilities here, based on what you've shown. You might be affected by one or both, I'd recommend testing both. 1) IO processing bottleneck. 2) Contention on multiple threads writing to buffer.

IO processing

From the data shown, you have stated that as soon as you enable the IO component, your throughput decreases and kernel time increases. This could quite easily be the IO wait time while your consumer thread is writing. Context switch to perform a write() call is significantly more expensive than doing nothing. Your Decoders are now capped at the maximum speed of the consumer. To test this hypothesis, you could remove the write() call. In other words, open the output file, prepare the string for output, and just not issue the write call.

Suggestions

Try removing the write() call in the Consumer, see if it reduces kernel time.
Are you writing to a single flat file sequentially - if not, try this
Are you using smart batching (ie: buffering until endOfBatch flag and then writing in a single batch) to ensure that the IO is bundled up as efficiently as possible?

Contention on multiple writers

Based on your description I suspect your Decoders are reading from the disruptor and then writing back to the very same buffer. This is going to cause issues with multiple writers aka contention on the CPUs writing to memory. One thing I would suggest is to have two disruptor rings:

Producer writes to #1
Decoder reads from #1, performs RS decode and writes the result to #2
Consumer reads from #2, and writes to disk

Assuming your RBs are sufficiently large, this should result in good clean walking through memory.

The key here is not having the Decoder threads (which may be running on a different core) write to the same memory that was just owned by the Producer. With only 2 cores doing this, you will probably see improved throughput unless the disk speed is the bottleneck.

I have a blog article here which describes in more detail how to achieve this including sample code. http://fasterjava.blogspot.com.au/2013/04/disruptor-example-udp-echo-service-with.html

Other thoughts

It would also be helpful to know what WaitStrategy you are using, how many physical CPUs are in the machine, etc.
You should be able to significantly reduce CPU utilisation by moving to a different WaitStrategy given that your biggest latency will be IO writes.
Assuming you are using reasonably new hardware, you should be able to saturate the IO devices with only this setup.
You will also need to make sure the files are on different physical devices to achieve reasonable performance.

Mandiemandingo answered 24/2, 2015 at 12:1 Comment(0)

Recommended topics

Hot tags