I have a requirement to make a scalable process. The process has mainly I/O operations with some minor CPU operations (mainly deserializing strings). The process query the database for a list of urls, then fetches data from these urls, deserialize the downloaded data to objects, then persist some of the data into crm dynamics and also to another database. Afterwards I need to update the first database which urls were processed. Part of the requirement is to make the parallelism degree configurable.
Initially I thought to implement it via a sequence of tasks with await and limit the parallelism using Semaphore - quite simple. Then I read a few posts and answers here of @Stephen Cleary which recommends using TPL Dataflow and I thought it could be a good candidate. However I want to make sure I'm "complicating" the code by using Dataflow for a worthy cause. I also got a suggestion to use a ForEachAsync extension method which is also simple to use, however I'm not sure if it won't cause a memory overhead because of the way it partitions the collection.
Is TPL Dataflow a good option for such a scenario? How is it better than a Semaphore or the ForEachAsync method - what benefits will I actually gain if I implement it via TPL DataFlow over each of the other options (Semaphore/ForEachASync)?
ActionBlock
withMaxDegreeOfParallelism
as configurable. From what I understand, TPL manages threadpool for you in an efficient manner, but there are some other issues. I want to keep it simple, just limit the number of tasks running at one time, is that what you are doing too? – ZephanTPL Dataflow is great, especially if you're looking to limit work in one part of a larger pipeline
However, if the there's just once action to throttle then semaphore is enough. – Zephan