C#.net multithreading
Asked Answered
E

5

15

I am experimenting on optimizing some mathematical operations using C#.net within a package called Grasshopper (part of Rhino3D). The operation is quite simple but the list on which it has to be performed is big and may get much bigger.

I am using Parallel.ForEach and lists in my C# script and the number of final results I get is lower than what is expected. This is most probably due to the fact that list.add is not thread safe (or not thread safe within the software I'm building it on top of).

  private void RunScript(double z, int x, List<double> y, ref object A)
  {
    List<double> temp = new List<double>();
    double r;
    System.Threading.Tasks.Parallel.ForEach(y, numb =>
      {
      r = Math.Pow((numb * x), z);
      temp.Add(r);
      });
    A = temp;

Please help me figure out a simple and efficient way of running this simple math operation over several hundreds of values using CPU multithreading (or if you have suggestions about GPU CUDA).

I hope that the obscure and specific software does not bother you because as far as I know it performs identically to normal C#.Net/Python/VB.Net.

Environs answered 17/4, 2015 at 21:41 Comment(3)
Yes, most likely the List.Add is not threadsafe and can lead to issues in the list internally. Another issue is that you are sharing the local variable r between the executing threads without any synchronization. The local variable definition should be inside the executing block, or even better, just inline it into your List.Add method call.Crosscurrent
Instead of List, you could try ConcurrentBag, which is thread-safe: MSDNReisman
Just a suggestion: Try to write a more specific question title. The content of the question is fine and perfectly valid, but the title matters too.Leeds
E
15

You surmise correctly, List<T> is not thread-safe. You must synchronize access to any instance of it.

One option is to simply synchronize in each task:

private void RunScript(double z, int x, List<double> y, ref object A)
{
    List<double> temp = new List<double>();
    object l = new object();
    System.Threading.Tasks.Parallel.ForEach(y, numb =>
    {
      double r = Math.Pow((numb * x), z);
      lock (l) temp.Add(r);
    });
    A = temp;
}

Note: your code had another bug in it also. You were sharing the same r variable amongst all the tasks, which could lead to the same value being added two or more times to the result, while other values were left out. I fixed the bug by simply moving the variable declaration to the body of the anonymous method used for the ForEach() call.


Another option is to recognize that you know in advance how many results you will have, and so can simply initialize an array large enough to contain all the results:

private void RunScript(double z, int x, List<double> y, ref object A)
{
    double[] results = new double[y.Count];
    System.Threading.Tasks.Parallel.For(0, y.Count, i =>
    {
      // read-only access of `y` is thread-safe:
      results[i] = Math.Pow((y[i] * x), z);
    });
    A = new List<double>(results);
}

No two threads will ever try to access the same element in the results array, and the array itself will never change (i.e. be reallocated), so this is perfectly thread safe.

The above assumes that you really do need a List<double> as the output object. Of course, if an array is satisfactory, then you can just assign results to A instead of passing it to the List<T> constructor to create a whole new object at the end.

Ectoblast answered 17/4, 2015 at 22:12 Comment(2)
That lock around the entire body of the loop completely defeats the purpose of Parallel.ForEach.Glomma
@LucasTrzesniewski: yes, you're right...I wish someone had noticed that earlier. It wound up that way when I got rid of r, and I moved the computation into the lock (it wasn't there before) without thinking it through. Thanks for pointing that out. I'll note that even doing it "right", the synchronization is still harmful, hence the second example. Any "free-threaded" implementation will beat a synchronized one easily, and this "simple" approach is the worst.Ectoblast
T
7

A simpler solution would probably be to use .AsParallel() and work on the resulting ParallelEnumerable instead:

private void RunScript(double z, int x, List<double> y, ref object A)
{
    A = y
        .AsParallel().AsOrdered()
        .Select(elem => Math.Pow((elem * x), z))
        .ToList();
}
Tarsal answered 17/4, 2015 at 22:30 Comment(3)
It's interesting if you take out the .AsOrdered() that the computation is 3 to 4 times slower.Indigestive
@Indigestive you are also not guaranteed that the items in A correspond to the same indexes as the items in y, by default AsParallel() does not need to keep the input order the same.Dao
@ScottChamberlain - Yes, I am aware of that, but I found it interesting that if we do want order then the computation is 3 to 4 times faster. That seemed counter intuitive to me.Indigestive
I
2

Here is another option:

    private void RunScript(double z, int x, List<double> y, ref object A) {
        var temp = new System.Collections.Concurrent.BlockingCollection<double>();
        System.Threading.Tasks.Parallel.ForEach(y, numb => {
            double r = Math.Pow((numb * x), z);
            temp.Add(r);
        });
        A = temp; // if needed you can A = temp.ToList();
        }

Peter did a good job of outlining the issues with your code and I think the second function he suggests is probably your best option. Still it's nice to see alternatives and learn that the .NET framework contains concurrent safe collections.

Interlope answered 17/4, 2015 at 22:20 Comment(0)
E
0

Thanks very much for your input! If you are interested in the profiler output is as follows :

Peter Duniho 1st option : 330ms

Peter Duniho 2nd option : 207ms

Dweeberly option: 335ms

Mattias Buelens option: 376ms

this is very strange supposedly .net scripts must run quicker in grasshopper(because it is .net) however none of your solutions beats the python parallel computation of 129ms!

Anyway thank to all you you for the detailed answers! You are great!

Environs answered 17/4, 2015 at 23:14 Comment(11)
It's hard to compare apples-to-apples. However, one of the biggest factors here is that, at least for the problem you've described, parallelizing it doesn't buy you much. I don't know your exact inputs, but I created a scenario that completed in about 200ms. I found that about 25% of the cost of the algorithm was garbage-collection (switching to a version with 0 intermediate allocations improved the time by 25%). Running it single-threaded only increased the time by 50% on my 2-core (hyperthreaded) CPU. Python's math might not be as precise as .NET, etc.Ectoblast
Without complete examples for both C# and the Python implementation, it's impossible to know for sure where the time differences come from. And without those details, the actual time comparisons aren't very useful. It's a classic code optimization issue.Ectoblast
To be honest I am not that much concerned with precision ... as long as it is within 0.001 tolerance. It is a sketchy script so is the math in it. If I have to put it in one sentence. What is ultimately the quickest way to multiply and raise to a power large number of values. (lets say more than 20k number of values). The python option shows 2-3 improvement of the C# options so far and since python should be inherently slower in this environment I still think there is a more optimal way of doing it with C#.Environs
The first step is to make sure you have apples-to-apples comparisons. I ran a bunch of different tests on various implementations. One of the biggest factors in the tests was how garbage-collection is managed and how the run-time was measured, neither of which have much if anything to do with the real world scenario. .NET and Python have fairly different execution environments, so making apples-to-apples tests is even harder.Ectoblast
I will note that I did one test in which I managed the concurrency explicitly, and gained another 25% improvement over my no-GC solution. Depending on how you ran your tests, my no-GC solution is either equivalent to your test of my 2nd option, or is 25% faster than it, so the explicitly-managed one would either produce a result of ~160ms or ~120ms in your test environment.Ectoblast
Also note: until you do a single-threaded comparison between .NET and Python to see what the performance difference in just the math computation is, comparisons between different multi-threaded implementations are particularly not useful.Ectoblast
I believe that all the scripting languages in the software on top of which I'm scripting are .Net. C#.Net,VB.NET, IronPython... I'm not sure if this answers the apple to apple comparison.Environs
Sorry...I thought you were using straight Python. I agree IronPython should take some of the differences out of it, but again...without seeing complete examples of both the C# and Python implementations, it's not possible to comment on why your measurements came out differently. There are too many possibilities, the most likely being simply that the performance tests aren't really equivalent (writing a good, apples-to-apples performance comparison is surprisingly hard).Ectoblast
I don't mind showing you the python script but I am afraid it might not help much since it is using obscure library. snag.gy/GoJyN.jpgEnvirons
What does the "obscure library" have to do with the question at hand? The only relevant comparison between C# and IronPython would be programs that only perform the computation here. No "obscure library" should be involved, on either side, no any other code at all. This is what I'm saying: the only valid performance comparison is one that isolates just the piece of the computation that you actually want to test. It's way too easy for other code to interfere with the results.Ectoblast
Ah, ok, I understand the point now! Thanks for the effort to spell it out for me!Environs
E
0

I was also looking on changing the input a little bit. Splitting the data into separate branches, computing each branch on separate thread and then recombining them at the end. However it scores the worse at 531ms. I understand the script is bad but I think it shows my idea well and if written properly may reach success.No?

  private void RunScript(double z, int x, List<double> y, DataTree<double> u, ref object A)
  {
    System.Threading.Tasks.Task<double[]> th1 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(0).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th2 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(1).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th3 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(2).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th4 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(3).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th5 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(4).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th6 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(5).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th7 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(6).ToArray(), x, z));
    System.Threading.Tasks.Task<double[]> th8 = System.Threading.Tasks.Task<double[]>.Factory.StartNew(() => mP(u.Branch(7).ToArray(), x, z));

    List<double> list = new List<double>();

    list.AddRange(th1.Result);
    list.AddRange(th2.Result);
    list.AddRange(th3.Result);
    list.AddRange(th4.Result);
    list.AddRange(th5.Result);
    list.AddRange(th6.Result);
    list.AddRange(th7.Result);
    list.AddRange(th8.Result);


    A = list;


  }

Sorry, I cannot add stuff to "using"

Environs answered 18/4, 2015 at 22:45 Comment(1)
Part of the problem here is that the computation itself is so simple, that any attempt to micro-manage the concurrency, especially if you have to create intermediate objects and especially if you have to actually make new copies of the data, will have the tendency to dominate the computational cost. And since that micro-management is roughly proportional to the degree of concurrency, it will strongly offset any benefits of concurrency. Remember: my worst case, single-threaded, 0 concurrency solution was only 300 ms, compared to a best-case (so far) of 120 ms.Ectoblast

© 2022 - 2024 — McMap. All rights reserved.