Pipelining vs Batching in Stackexchange.Redis
Asked Answered
A

2

27

I am trying to insert a large(-ish) number of elements in the shortest time possible and I tried these two alternatives:

1) Pipelining:

List<Task> addTasks = new List<Task>();
for (int i = 0; i < table.Rows.Count; i++)
{
    DataRow row = table.Rows[i];
    Task<bool> addAsync = redisDB.SetAddAsync(string.Format(keyFormat, row.Field<int>("Id")), row.Field<int>("Value"));
    addTasks.Add(addAsync);
}
Task[] tasks = addTasks.ToArray();
Task.WaitAll(tasks);

2) Batching:

List<Task> addTasks = new List<Task>();
IBatch batch = redisDB.CreateBatch();
for (int i = 0; i < table.Rows.Count; i++)
{
    DataRow row = table.Rows[i];
    Task<bool> addAsync = batch.SetAddAsync(string.Format(keyFormat, row.Field<int>("Id")), row.Field<int>("Value"));
    addTasks.Add(addAsync);
}
batch.Execute();
Task[] tasks = addTasks.ToArray();
Task.WaitAll(tasks);

I am not noticing any significant time difference (actually I expected the batch method to be faster): for approx 250K inserts I get approx 7 sec for pipelining vs approx 8 sec for batching.

Reading from the documentation on pipelining,

"Using pipelining allows us to get both requests onto the network immediately, eliminating most of the latency. Additionally, it also helps reduce packet fragmentation: 20 requests sent individually (waiting for each response) will require at least 20 packets, but 20 requests sent in a pipeline could fit into much fewer packets (perhaps even just one)."

To me, this sounds a lot like the a batching behaviour. I wonder if behind the scenes there's any big difference between the two because at a simple check with procmon I see almost the same number of TCP Sends on both versions.

Agnew answered 6/1, 2015 at 9:51 Comment(0)
S
39

Behind the scenes, SE.Redis does quite a bit of work to try to avoid packet fragmentation, so it isn't surprising that it is quite similar in your case. The main difference between batching and flat pipelining are:

  • a batch will never be interleaved with competing operations on the same multiplexer (although it may be interleaved at the server; to avoid that you need to use a multi/exec transaction or a Lua script)
  • a batch will be always avoid the chance of undersized packets, because it knows about all the data ahead of time
  • but at the same time, the entire batch must be completed before anything can be sent, so this requires more in-memory buffering and may artificially introduce latency

In most cases, you will do better by avoiding batching, since SE.Redis achieves most of what it does automatically when simply adding work.

As a final note; if you want to avoid local overhead, one final approach might be:

redisDB.SetAdd(string.Format(keyFormat, row.Field<int>("Id")),
    row.Field<int>("Value"), flags: CommandFlags.FireAndForget);

This sends everything down the wire, neither waiting for responses nor allocating incomplete Tasks to represent future values. You might want to do something like a Ping at the end without fire-and-forget, to check the server is still talking to you. Note that using fire-and-forget does mean that you won't notice any server errors that get reported.

Stroup answered 18/3, 2015 at 13:30 Comment(2)
Re: the final approach. If using SetAddAsync + FireAndForget + final Ping. Excluding case of unknown transient errors, would the Sets be guaranteed to be added to by the time Ping completes? Or could they arrive out of order?Isotonic
@Isotonic assuming we aren't talking about "cluster", order should currently be guaranteed either way; however, I want to introduce a new opt-in "pooled" mode, where-by we use more connections to avoid big pile-ups when something goes wrong. In that usage: batching would guarantee order, but non-batching could use multiple connections with no order guarantees. That mode would be opt-in, because of this semantic change.Stroup
M
-1

I can't speak to the batching but I would recommend against creating n number of Task items as you are doing based on the dynamic size of some row count in a table. There can be a lot of overhead on the ThreadPool by creating let's say 100 Tasks and expecting them to provide decent performance.

Madigan answered 14/11, 2022 at 15:34 Comment(3)
Tasks are much lighter than threads. While the overhead of a Task isn't 0, it's considerably less than the wait time for a network operation. I've done a async Task.WhenAll(+1000 tasks calling Redis) with no problems at all, and seen it execute more than 50x faster than calling them serially (over a network with 80ms rtt to the Redis Server)Murvyn
@PeterDrier have you tested this "at scale" in a proper load testing environment or just a test harness? You would need to have other non-test-related "real-world" traffic running at the same time with your test code, in a NON-local environment, to have good confidence that this will scale well. Creating so many tasks puts a huge amount of stress on the .NET ThreadPool and is not recommended by the .NET BCL Teams.Madigan
yep, in production, huge calculation engine, multiple data pipelines, thousands of tasks in multiple pods on a few k8 clusters. You might want to check if your BCL recommendations aren't dated for '23Murvyn

© 2022 - 2024 — McMap. All rights reserved.