This is an interesting question.
Only performing the subtraction after testing that the result won't be negative (as suggested by TTat and Maximum Cookie) has negligible impact, as this optimization can already be performed by the JIT compiler.
Parallelizing the task (as suggested by Selman22) is a good idea, but when the loop is as fast as it is in this case, the overhead ends up outwaying the gains, so Selman22's implementation actually runs slower in my testing. I suspect that nick_w's benchmarks were produced with debugger attached, hiding this fact.
Parallelizing the task in larger chunks (as suggested by nick_w) deals with the overhead problem, and can actually produce faster performance, but you don't have to calculate the chunks yourself - you can use Partitioner
to do this for you:
public static void SubtractBackgroundFromBufferPartitionedParallelForEach(
ushort[] backgroundBuffer)
{
Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
{
for (int i = range.Item1; i < range.Item2; ++i)
{
if (Buffer[i] < backgroundBuffer[i])
{
Buffer[i] = 0;
}
else
{
Buffer[i] -= backgroundBuffer[i];
}
}
});
}
The above method consistently outperforms nick_w's hand-rolled chunking in my testing.
But wait! There's more to it than that.
The real culprit in slowing down your code is not the assignment or arithmetic. It's the if
statement. How it affects performance is going to be majorly impacted by the nature of the data you are processing.
nick_w's benchmarking generates random data of the same magnitude for both buffers. However, I suspect that it is very likely that you actually have lower average magnitude data in the background buffer. This detail can be significant due to branch prediction (as explained in this classic SO answer).
When the value in the background buffer is usually smaller than that in the buffer, the JIT compiler can notice this, and optimize for that branch accordingly. When the data in each buffer is from the same random population there is no way to guess the outcome of the if
statement with greater than 50% accuracy. It is this latter scenario that nick_w is benchmarking, and under those conditions we could potentially further optimize your method by using unsafe code to convert a bool to an integer and avoid branching at all. (Note that the following code is relying on an implementation detail of how bool's are represented in memory, and while it works for your scenario in .NET 4.5, it is not necessarily a good idea, and is shown here for illustrative purposes.)
public static void SubtractBackgroundFromBufferPartitionedParallelForEachHack(
ushort[] backgroundBuffer)
{
Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
{
for (int i = range.Item1; i < range.Item2; ++i)
{
unsafe
{
var nonNegative = Buffer[i] > backgroundBuffer[i];
Buffer[i] = (ushort)((Buffer[i] - backgroundBuffer[i]) *
*((int*)(&nonNegative)));
}
}
});
}
If you are really looking to shave a bit more time off, then you can follow this approach in a safer manner by switching language to C++/CLI, as that will let you use a boolean in an arithmetic expression without resorting to unsafe code:
UInt16 MyCppLib::Maths::SafeSubtraction(UInt16 minuend, UInt16 subtrahend)
{
return (UInt16)((minuend - subtrahend) * (minuend > subtrahend));
}
You can create a purely managed DLL using C++/CLI exposing the above static method, and then use it in your C# code:
public static void SubtractBackgroundFromBufferPartitionedParallelForEachCpp(
ushort[] backgroundBuffer)
{
Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
{
for (int i = range.Item1; i < range.Item2; ++i)
{
Buffer[i] =
MyCppLib.Maths.SafeSubtraction(Buffer[i], backgroundBuffer[i]);
}
});
}
This outperforms the hacky unsafe C# code above. In fact, it is so fast that you could write the whole method using C++/CLI forgetting about parallelization, and it would still out-perform the other techniques.
Using nick_w's test harness, the above method will outperform any of the other suggestions published here so far. Here are the results I get (1-4 are the cases he tried, and 5-7 are the ones outlined in this answer):
1. SubtractBackgroundFromBuffer(ms): 2,021.37
2. SubtractBackgroundFromBufferWithCalcOpt(ms): 2,125.80
3. SubtractBackgroundFromBufferParallelFor(ms): 3,431.58
4. SubtractBackgroundFromBufferBlockParallelFor(ms): 1,401.36
5. SubtractBackgroundFromBufferPartitionedParallelForEach(ms): 1,197.76
6. SubtractBackgroundFromBufferPartitionedParallelForEachHack(ms): 742.72
7. SubtractBackgroundFromBufferPartitionedParallelForEachCpp(ms): 499.27
However, in the scenario I expect you actually have, where background values are typically smaller, successful branch prediction improves results across the board, and the 'hack' to avoid the if
statement is actually slower:
Here are the results I get using nick_w's test harness when I restrict values in the background buffer to the range 0-6500
(c. 10% of the buffer):
1. SubtractBackgroundFromBuffer(ms): 773.50
2. SubtractBackgroundFromBufferWithCalcOpt(ms): 915.91
3. SubtractBackgroundFromBufferParallelFor(ms): 2,458.36
4. SubtractBackgroundFromBufferBlockParallelFor(ms): 663.76
5. SubtractBackgroundFromBufferPartitionedParallelForEach(ms): 658.05
6. SubtractBackgroundFromBufferPartitionedParallelForEachHack(ms): 762.11
7. SubtractBackgroundFromBufferPartitionedParallelForEachCpp(ms): 494.12
You can see that results 1-5 have all dramatically improved since they are now benefiting from better branch prediction. Results 6 & 7 haven't changed much, since they have avoided branching.
This change in data has completely changes things. In this scenario, even the fastest all C# solution is now only 15% faster than your original code.
Bottom line: be sure to test any method you pick with representative data, or your results will be meaningless.