I've read this topic: C# Thread safe fast(est) counter and have implemented this feature in my parallel code. As far as I can see it all works fine, however it has measurably increased the processing time, as in 10% or so.
It's been bugging me a bit, and I think the problem lies in the fact that I'm doing a huge number of relatively cheap (<1 quantum) tasks on small data fragments which are well partioned and probably make good use of cache locality, thus running optimally. My best guess, based on what little I know about MESI, is that x86 LOCK
prefix in Interlocked.Increment
pushes cacheline into Exclusive mode and forces a cache miss on other cores and forces cache reload on every single parallel pass just for the sake of incrementing this counter. With 100ns-ish delay for cache miss and my workload it seems to add up. (Then again, I could be wrong)
Now, I don't see a way around it, but maybe I am missing something obvious. I was even thinking about using n counters (corresponding to degree of parallelization) and then incrementing each on specific core, however it seems unfeasible (detecting which core I am on will probably be more expensive, not to mention an elaborate if/then/else structure and messing up with the execution pipeline). Any ideas on how to break this beast? :)
Interlocked.Add
instead ofInterlocked.Increment
? – Klara