I/O performance - async vs TPL vs Dataflow vs RX

Asked 17/4, 2013 at 1:5 Answered 19/12, 2013 at 16:41

Solved task-parallel-library system.reactive async-await conceptual tpl-dataflow

I have a piece of C# 5.0 code that generates a ton of network and disk I/O. I need to run multiple copies of this code in parallel. Which of the following technologies is likely to give me the best performance:

async methods with await
directly use Task from TPL
the TPL Dataflow nuget
Reactive Extensions

I'm not very good at this parallel stuff, but if using a lower lever, like say Thread, can give me a lot better performance I'd consider that too.

Eu answered 17/4, 2013 at 1:5 Comment(1)

I did not grasp the nuget context, why is it used only with TPL-Dataflow? Are you using .NET 4.0 Async CTP or .NET 4.5? – Hollington 17/4, 2013 at 6:51

Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".

A better question to ask is "which option is easiest to learn and develop with?" Or "which option would be best to maintain this code with five years from now?" And for that I would suggest async first, or Dataflow or Rx if your logic is better represented as a stream.

Dower answered 17/4, 2013 at 1:35 Comment(0)

This is like trying to optimize the length of your transatlantic flight by asking the quickest method to remove your seatbelt.

Ok, some real advice, since I was kind of a jerk

Let's give a helpful answer. Think of performance as in "Classes" of activities - each one is an order of magnitude slower (at least!):

Only accessing the CPU, very little memory usage (i.e. rendering very simple graphics to a very fast GPU, or calculating digits of Pi)
Only accessing CPU and in-memory things, nothing on disk (i.e. a well-written game)
Accessing the disk
Accessing the network.

If you do even one of activity #3, there's no point in doing optimizations typical to activities #1 and #2 like optimizing threading libraries - they're completely overshadowed by the disk hit. Same for CPU tricks - if you're constantly incurring L2/L3 cache misses, sparing a few CPU cycles by hand-writing assembly isn't worth it (which is why things like loop unrolling are usually a bad idea these days).

So, what can we derive from this? There are two ways to make your program faster, either move up from #3 to #2 (which isn't often possible, depending on what you're doing), or by doing less I/O. I/O and network speed is the rate-limiting factor in most modern applications, and that's what you should be trying to optimize.

Cowpox answered 17/4, 2013 at 4:25 Comment(5)

Another way to make the program faster is to perform the IO in a smarter way. For example, with classical HDDs, it's often faster to not perform IO in parallel, because it leads to more seeking (which is slow). – Excrescent 17/4, 2013 at 10:28

^^ This is a good idea too. It's fairly easy to detect an SSD via measuring the speed of a sequential read, measuring the speed of reading random sectors on disk, then comparing the variance. If they're similar, you've got an SSD – Cowpox 21/4, 2013 at 22:40

@Excrescent I think this is wrong as parallel IO is faster, because of the use of SCAN and elevator algorithms will actually be able to schedule better the reads – Virulence 16/6, 2014 at 17:40

I think you are incorrect, because two explicit sets of sequential reads to disparate places will always have less seeks than parallel access to both places, even if you're rearranging I/O – Cowpox 16/6, 2014 at 22:21

Brilliant article. I can also add that hardware has a big effect on #3. A standard spinning HD might do 100 IOPS (I/O per second), a mid range SSD might do 6,000 IOPS, and a high end PCI Express based SSD might do somewhere between 100,000 IOPS and 10 million IOPS. And readers 20 years in the future will laugh patronisingly at these numbers. – Shopper 29/10, 2015 at 15:4

Any performance difference between these options would be inconsequential in the face of "a ton of network and disk I/O".

Dower answered 17/4, 2013 at 1:35 Comment(0)

It's an older question, but for anyone reading this...

It depends. If you try to saturate 1Gbps link with 50B messages, you will be CPU bound even with simple non-blocking send over raw sockets. If, on the other hand, you are happy with 1Mbps throughput or your messages are larger than 10KB, any of these frameworks will do the job.

For low-bandwidth situations, I would recommend to prioritize by ease of use, i.e. async/await, Dataflow, Rx, TPL in this order. Note that high-bandwidth application should be prototyped as if it is low-bandwidth and optimized later.

For true high-bandwidth application, I can recommend Dataflow over Rx, because Rx is not designed for high concurrency. Raw TPL is the bottom layer, which guarantees the lowest overhead if you can handle the complexity. If you can make efficient use of dedicated threads, then that would be even faster. Async/await vs. Dataflow IMO doesn't make any performance difference. The overhead seems comparable, so choose one that's a better fit.

Meghanmeghann answered 19/12, 2013 at 16:41 Comment(0)

Ok, some real advice, since I was kind of a jerk

Recommended topics

Hot tags