TPL Dataflow vs plain Semaphore
Asked Answered
H

2

6

I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the database for a list of urls, then fetches data from these urls, deserialize the downloaded data to objects, then persist some of the data into crm dynamics and also to another database. Afterwards I need to update the first database which urls were processed. Part of the requirement is to make the parallelism degree configurable.

Initially I thought to implement it via a sequence of tasks with await and limit the parallelism using Semaphore - quite simple. Then I read a few posts and answers here of @Stephen Cleary which recommends using TPL Dataflow and I thought it could be a good candidate. However I want to make sure I'm "complicating" the code by using Dataflow for a worthy cause. I also got a suggestion to use a ForEachAsync extension method which is also simple to use, however I'm not sure if it won't cause a memory overhead because of the way it partitions the collection.

Is TPL Dataflow a good option for such a scenario? How is it better than a Semaphore or the ForEachAsync method - what benefits will I actually gain if I implement it via TPL DataFlow over each of the other options (Semaphore/ForEachASync)?

Harmonize answered 31/7, 2018 at 14:41 Comment(5)
Tpl Dataflow is better for cpu work. For async I/o calls I would use Task.WhenAll with a range of tasksPertinent
@PeterBons - My scenario has mainly I/O calls, but also a bit of cpu work (e.g. deserializing the files' content), I could just implement it with semaphore, but got the impression that I would gain performance by using the Tpl Dataflow, but I'm still not sure I fully understand Dataflow's benefits so I could determine if they're worth it because it'd probably make my code more complex than just using semaphore?Harmonize
I'm really interested on getting some expert opinion on this one. I'm doing exactly same thing like yours and cant decide between semaphore and TPL Dataflow. I'm leaning towards using ActionBlock with MaxDegreeOfParallelism as configurable. From what I understand, TPL manages threadpool for you in an efficient manner, but there are some other issues. I want to keep it simple, just limit the number of tasks running at one time, is that what you are doing too?Zephan
oh btw, Check out this answer from @Stephen Cleary . TPL Dataflow is great, especially if you're looking to limit work in one part of a larger pipeline However, if the there's just once action to throttle then semaphore is enough.Zephan
@TheUknown - Good news, we've got an answer from the expert :) My goal is not just to limit the number of tasks but also to make sure the entire process finishes as quickly as possible, knowing that the part which writes to Crm is the main bottleneck. Thank you for providing the reference on your comment to the other answer, it's also informative and fits my situation.Harmonize
X
13

The process has mainly IO operations with some minor CPU operations (mainly deserializing strings).

That's pretty much just I/O. Unless those strings are huge, the deserialization won't be worth parallelizing. The kind of CPU work you're doing will be lost in the noise.

So, you'll want to focus on concurrent asynchrony.

  • SemaphoreSlim is the standard pattern for this, as you've found.
  • TPL Dataflow can also do concurrency (both asynchronous and parallel forms).

ForEachAsync can take several forms; note that in the blog post you referenced, there are 5 different implementations of this method, each of which are valid. "[T]here are many different semantics possible for iteration, and each will result in different design choices and implementations." For your purposes (not wanting CPU parallelization), you shouldn't consider the ones using Task.Run or partitioning. In an asynchronous concurrency world, any ForEachAsync implementation is just going to be syntactic sugar that hides which semantics it implements, which is why I tend to avoid it.

This leaves you with SemaphoreSlim vs. ActionBlock. I generally recommend people start with SemaphoreSlim first, and consider moving to TPL Dataflow if their needs become more complex (in a way that seems like they would benefit from a dataflow pipeline).

E.g., "Part of the requirement is to make the parallelism degree configurable."

You may start off with allowing a degree of concurrency - where the thing being throttled is a single whole operation (fetch data from url, deserialize the downloaded data to objects, persist into crm dynamics and to another database, and update the first database). This is where SemaphoreSlim would be a perfect solution.

But you may decide you want to have multiple knobs: say, one degree of concurrency for how many urls you're downloading, and a separate degree of concurrency for persisting, and a separate degree of concurrency for updating the original database. And then you'd also need to limit the "queues" in-between these points: only so many deserialized objects in-memory, etc. - to ensure that fast urls with slow databases don't cause problems with your app using too much memory. If these are useful semantics, then you have started approaching the problem from a dataflow perspective, and that's the point that you may be better served with a library like TPL Dataflow.

Xerophthalmia answered 31/7, 2018 at 22:8 Comment(1)
Thank you so much for this answer, it is detailed and clear and you even referred to all the options I mentioned including the ForEachAsync! +100 :)Harmonize
L
1

Here are the selling points of the Semaphore approach:

  1. Simplicity

And here are the selling points of the TPL Dataflow approach:

  1. Task-parallelism on top of data-parallelism
  2. Optimal utilization of resources (bandwidth, CPU, database connections)
  3. Configurable degree of parallelism for each of the heterogeneous operations
  4. Reduced memory footprint

Let's review the following Semaphore implementation for example:

string[] urls = FetchUrlsFromDB();
var cts = new CancellationTokenSource();
var semaphore = new SemaphoreSlim(10); // Degree of parallelism (DOP)
Task[] tasks = urls.Select(url => Task.Run(async () =>
{
    await semaphore.WaitAsync(cts.Token);
    try
    {
        string rawData = DownloadData(url);
        var data = Deserialize(rawData);
        PersistToCRM(data);
        MarkAsCompleted(url);
    }
    finally
    {
        semaphore.Release();
    }
})).ToArray();
Task.WaitAll(tasks);

The above implementation ensures that at most 10 urls will be processed concurrently at any given moment. There will be no coordination between these parallel workflows though. So for example it is entirely possible that at a given moment all 10 parallel workflows will be downloading data, at another moment all 10 will be deserializing raw data, and at another moment all 10 will be persisting data to the CRM. This is far from ideal. Ideally you would like to have the bottleneck of the whole operation, either the network adapter, the CPU or the database server, to work non-stop all the time, and not be underutilized (or be completely idle) at various random moments.

Another consideration is how much parallelization is optimal for each of the heterogeneous operations. 10 DOP may be optimal for the communication with the web, but too low or too high for the communication with the database. The Semaphore approach does not allow for that level of fine-tuning. Your only option is to compromise by selecting a DOP value somewhere between these optimals.

If the number of urls is very large, lets say 1,000,000, then the Semaphore approach above raises also serious memory usage considerations. A url may have a size of 50 bytes on average, while a Task that is connected to CancellationToken may be 10 times heavier or more. Of course you could change the implementation and use the SemaphoreSlim in a more clever way that doesn't generate so many tasks, but this would go against the primary (and only) selling point of this approach, its simplicity.

The TPL Dataflow library solves all of these problems, at the cost of the (smallish) learning curve required in order to be able to tame this powerful tool.

Levinson answered 11/6, 2020 at 14:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.