Optimizing the Akka.NET message dispatcher
Asked Answered
R

0

7

I'm currently trying to find the bottlenecks in our message dispatcher for Akka.NET (port of java/scala actor model framework) For those interested, it can be found here: https://github.com/akkadotnet/akka.net

We seem to scale great up to 8 cores, everything seems fine so far. However, when running on larger machines, everything falls apart eventually. We have tested this on a 16 core machine, and it scales nicely up to a certain point, then suddenly, message throughput is halved.

This image is while profiling on my laptop Akka.NET profiling See: https://i.sstatic.net/DxboR.png for full image

. Is the bottleneck the Task.Factory.StartNew, or is it the ConcurrentQueue.Enqueue ? I'm not sure I read those numbers right.

Here is a breif decription of how our message dispatcher and mailbox works:

Once a message is posted to a mailbox, the mailbox checks if it is currently processing messages. if it is processing messages, it simply let the currently running Task consume the new message.

So essentially, we post a message to a ConcurrentQueue, and the current mailbox run will find it.

if the mailbox is currently idle while the message is posted, then we schedule the mailbox using Task.Factory.StartNew(mailboxAction).

To ensure that only one task is running at any given time for a specific mailbox, we use Interlocked checks to see if the mailbox is bussy or idle. The interlock checks works, this is tested extensively so we know we don't start up multiple tasks for the same mailbox.

Any ideas what could cause throughput to break completely on the 16 core machine? The same effect does not happen on smaller machines, those stay stable at max throughput when they can not scale any more.

One thing that I have verified on the 16 core machine, is that the mailbox seems to process too fast, depleting all the messages in the actors messagequeue, which will lead a new scheduling of the mailbox once a new message arrives. That is, the consumer is faster than the producer.

Roband answered 23/3, 2014 at 21:15 Comment(8)
I would not be surprised to find ConcurrentQueue has an implementation optimized for a low core count. In such algorithms, there's often a trade-off between scalability and performance.Valedictorian
I suspect that the Task Scheduling is not the problem here, I've modified to code to never let the currently running mailbox end its run, it will process messages as long as the app runs (hogging the current task thread) and the exact same behavior occurs.. scales well up to a point, then throughput is halved..Roband
So could it be that when many different threads try to post to the same concurrentqueue, things start to clog up, even though concurrentqueue is supposed to be lock free?Roband
Can you bundle messages together to reduce the amount of stress on the queue?Valedictorian
Machines with a large number of cores often have a NUMA architecture. Like multiple processor chips each with their own memory bus. Perf start to tank when the processor interconnect needs to shovel data from one memory bus to another. Such machines really only work well when the cores execute disjoint jobs that don't have a data inter-dependency.Saleswoman
@HansPassant so if an actor running inside one core tries to pass messages to another actor (inactive or running) we could get this kind of perf penalty? assuming NUMA architecture that isRoband
What do you mean it's not lock-free? the impl relies only on InterLocked and no MonitorsRoband
@ThomasJungblut You can look at the implementation here for yourself. I've only skimmed it, but it does look lockless to me.Roslyn

© 2022 - 2024 — McMap. All rights reserved.