The speed of .NET in numerical computing

Asked 2/12, 2009 at 8:9 Answered 8/12, 2015 at 20:16

Solved c#.net managed-c++managed-code nmath

In my experience, .NET is 2 to 3 times slower than native code. (I implemented L-BFGS for multivariate optimization).

I have traced the ads on stackoverflow to http://www.centerspace.net/products/

the speed is really amazing, the speed is close to native code. How can they do that? They said that:

Q. Is NMath "pure" .NET?

A. The answer depends somewhat on your definition of "pure .NET". NMath is written in C#, plus a small Managed C++ layer. For better performance of basic linear algebra operations, however, NMath does rely on the native Intel Math Kernel Library (included with NMath). But there are no COM components, no DLLs--just .NET assemblies. Also, all memory allocated in the Managed C++ layer and used by native code is allocated from the managed heap.

Can someone explain more to me?

Ventriloquism answered 2/12, 2009 at 8:9 Comment(4)

I could explain more, but I don't what to give up the recipe to our special 'secret sauce' :). I will say, as many know, careful memory management is critical for high performance. – Inlaw 14/12, 2009 at 4:41

Is your L-BFGS code available? – Mooring 1/7, 2010 at 22:53

I second @JonHarrop here. I have a requirement for the LBFGS algo. and would appreciate a look at your code if it is open to us? – Leavelle 6/2, 2014 at 18:57

Common guys, let's quit those lame 'how can they be that great' self advertising, no? – Pheon 9/1, 2016 at 17:25

The point about C++/CLI is correct. To complete the picture, just two additional interesting points:

.NET memory management (garbage collector) obviously is not the problem here, as NMath still depends on it
The performance advantage is actually provided by Intel MKL, which offers implementations extremely optimized for many CPUs. From my point of view, this is the crucial point. Using straight-forward, naiv C/C++ code wont necessarily give you superior performance over C#/.NET, it's sometimes even worse. However C++/CLI allows you to exploit all the "dirty" optimization options.

Wehner answered 2/12, 2009 at 8:51 Comment(1)

Correct. In NMath, we are using Intel's MKL wherever possible. We create memory on the managed heap, call MKL without copying data. So, the use gets the benefits of .NET code versus BLAS or LAPACK but, in many cases, the execution time is almost all inside of MKL. – Tanatanach 7/11, 2013 at 22:20

How can they do that?

Like most of the numerical libraries for .NET, NMath is little more than a wrapper over an Intel MKL embedded in the .NET assembly, probably by linking with C++/CLI to create a mixed assembly. You've probably just benchmarked those bits that are not actually written in .NET.

The F#.NET Journal articles Numerical Libraries: special functions, interpolation and random numbers (16th March 2008) and Numerical Libraries: linear algebra and spectral methods (16th April 2008) tested quite a bit of functionality and NMath was actually the slowest of all the commercial libraries. Their PRNG was slower than all others and 50% slower than the free Math.NET library, some basic functionality was missing (e.g. the ability to calculate Gamma(-0.5)) and other basic functionality (the Gamma-related functions they did provide) was broken. Both Extreme Optimization and Bluebit beat NMath at the eigensolver benchmark. NMath didn't even provide a Fourier Transform at the time.

Even more surprisingly, the performance discrepancies were sometimes huge. The most expensive commercial numerical library we tested (IMSL) was over 500× slower than the free FFTW library at the FFT benchmark and none of the libraries made any use of multiple cores at the time.

In fact, it was precisely the poor quality of these libraries that encouraged us to commercialize our own F# for Numerics library (which is 100% pure F# code).

Mooring answered 2/7, 2010 at 18:40 Comment(0)

I am one of the lead developers of ILNumerics. So I am biased, obviously ;) But we are more disclosed regarding our internals, so I will give some insights into our speed 'secrets'.

It all depends on how system resources are utilized! If you are about pure speed and need to handle large arrays, you will make sure to (ordered by importance, most important first)

Manage your memory appropriately! 'Naive' memory management will lead to bad performance, since it stresses the GC badly, causes memory fragmentation and degrades memory locality (hence cache performance). In a garbage collected environment like .NET, this boils down to preventing from frequent memory allocations. In ILNumerics, we implemented a high performance memory pool in order to archieve this goal (and deterministic disposal of temporary arrays to get a nice, comfortable syntax without clumsy function semantics).
Utilize parallelism! This targets both: thread level parallelism and data level parallelism. Multiple cores are utilized by threading computation intensive parts of the calculation. On X86/X64 CPUs SIMD/multimedia extensions like SSE.XX and AVX allow a small but effective vectorization. They are not directly addressable by current .NET languages. And this is the only reason, why MKL may is still faster than 'pure' .NET code. (But solutions are already rising.)
To archieve the speed of highly optimized languages like FORTRAN and C++, the same optimizations must get applied to your code as done for them. C# offers the option do do so.

Note, these precautions should be followed in that order! It does not make sense to care about SSE extensions or even bound check removal, if the bottleneck is the memory bandwith and the processor(s) spend most the time waiting for new data. Also, for many simple operations it does not even pays of to invest huge efforts to archieve the very last tiny scale up to peak performance! Consider the common example of the LAPACK function DAXPY. It adds the elements of a vector X to the corresponding element of another vector Y. If this is done for the first time, all the memory for X and Y will have to get fetched from the main memory. There is little to nothing you can do about it. And memory is the bottleneck! So regardless if the addition at the end is done the naive way in C#

for (int i = 0; i < C.Length; i++) {
    C[i] = X[i] + Y[i]; 
}

or done by using vectorization strategies - it will have to wait for memory!

I know, this answer does somehow 'over answers' the question, since most of these strategies are currently not utilized from the mentioned product (yet?). By following thoses points, you would eventually end up with much better performance than every naive implementation in a 'native' language.

If you are interested, you could disclose your implementation of L-BFGS? I'll be happy to convert it to ILNumerics and post comparison results and I am sure, other libraries listed here would like to follow. (?)

Pheon answered 17/2, 2012 at 12:12 Comment(0)

The point about C++/CLI is correct. To complete the picture, just two additional interesting points:

.NET memory management (garbage collector) obviously is not the problem here, as NMath still depends on it
The performance advantage is actually provided by Intel MKL, which offers implementations extremely optimized for many CPUs. From my point of view, this is the crucial point. Using straight-forward, naiv C/C++ code wont necessarily give you superior performance over C#/.NET, it's sometimes even worse. However C++/CLI allows you to exploit all the "dirty" optimization options.

Wehner answered 2/12, 2009 at 8:51 Comment(1)

I've posted up a blog article addressing this question.

Inlaw answered 21/12, 2009 at 20:43 Comment(0)

The key is C++/CLI. It allows you to compile C++ code into a managed .NET assembly.

Albright answered 2/12, 2009 at 8:29 Comment(3)

My experience C++/CLI is 2 - 3 times slower. I first coded my Logistic Regression using L-BFGS as optimization algorithm. Then I made a CLI wapper to it. So that the library is callable from F# and C#. But the speed is slower. – Ventriloquism 2/12, 2009 at 8:37

I suppose NMath uses P/Invoke or C++/CLI to call Intel Math Kernel Library native functions which is where the most intensive calculations are done and which is why it is so fast. – Albright 2/12, 2009 at 8:51

Exactly, Darin. The time is spent in decomposition methods inside of Intel's MKL. No copying of data is required, either. So, it's not an issue of whether CLI is fast or not. It's about where the execution happens. – Tanatanach 7/11, 2013 at 22:26

Today it's industry standard to make mixed .Net/native libraries in order to take advantages of both platforms for performance optimization. Not only NMath, many commercial and free libraries with .net interface working like this. For example: Math.NET Numerics, dnAnalytics, Extreme Optimization, FinMath and many others. Integration with MKL is extremely popular for .net numerical libraries, and most of them just use Managed C++ assembly as an intermediate level. But this solution has a number of drawbacks:

Intel MKL is a proprietary software and it's a bit expensive. But some libraries like dnAnalytics provides a free replacement of MKL functionality with pure .net code. Off course, it's much slower, but it's free and fully functional.
It reduces your compatibility you need to have heavy managed C++ kernel dlls for both 32bit and 64bit mode.
Managed to native calls need performing marshaling which slow down performance of fast frequently called operations such as Gamma or NormalCDF.

Last two problems solved in RTMath FinMath library. I don't really know how they did it, but they provide single pure .net dll which compiled for Any CPU platform and supports 32bit and 64bit. Also I didn't seen any performance degradation against MKL when I need to call NormalCDF billions times.

Albers answered 4/11, 2011 at 20:40 Comment(1)

The 32-bit and 64-bit can be determined and loaded at runtime. There's a one-time hit at startup. Function calls do not need to copy data. You can create memory on the managed heap and pass a pointer to underlying numerical libraries such as Intel's Math Kernel Library. – Tanatanach 7/11, 2013 at 22:25

Since the (native) Intel MKL is doing the math, you're actually not doing the math in managed code. You're merely using the memory manager from .Net, so the outcomes are easily used by .Net code.

Preponderant answered 2/12, 2009 at 11:23 Comment(0)

I learnt more form @Darin Dimitrov's comment to his answer and @Trevor Misfeldt's comment to @Darin's comment. Hence posting it as an answer, for future readers.

NMath uses P/Invoke or C++/CLI to call Intel Math Kernel Library native functions which is where the most intensive calculations are done and which is why it is so fast.

The time is spent in decomposition methods inside of Intel's MKL. No copying of data is required, either. So, it's not an issue of whether CLI is fast or not. It's about where the execution happens.

Also @Paul's blog is also a good read. Here's the summary.

C# is Fast, Memory Allocation Is Not. Reuse the variables as ref or out parameters, instead of returning new variables from methods. Allocating a new variable consumes memory and slows down execution. @Haymo Kutschbach has explained this well.

If the precision is not necessary, the performance gain in switching from double to single precision is considerable (not to mention the memory saving for the data storage).

For many short computations, to call a C++/cli routine from C#, pinning all pointers to data allocated in the managed space, and then call the Intel library is generally better than using P/Invoke to call the library directly from C#, due to the cost of marshaling the data . As mentioned by @Haymo Kutschbach in comments, for blittable types however, no difference between C++/CLI and C#. Arrays of blittable types and classes that contain only blittable members are pinned instead of copied during marshaling. Refer https://msdn.microsoft.com/en-us/library/75dwhxf7(v=vs.110).aspx for a list of blittable and non-blittable types.

Literate answered 8/12, 2015 at 20:16 Comment(5)

The last paragraph is not quite right. For blittable arrays no marshalling is required. No difference between C++/CLI and C# here. You can verify by looking at the IL code generated. – Pheon 9/1, 2016 at 17:21

Not really. I don't see your point yet. Why should it be "generally better" to go C# -> C++/CLI -> MKL (unmanaged) instead of C# -> MKL. The only way to 'measure' any difference is to look at the generated IL and to profile carefully. – Pheon 10/1, 2016 at 19:4

C# -> C++/CLI -> MKL would be better than C# -> MKL because the variables are not copied in the former unlike the latter. Copying variables is slow. Does it make sense? Please correct me if I am wrong. This answer should help readers. – Literate 10/1, 2016 at 21:18

Yes, wrong. There is no copy from C# calling the MKL. Also, Marshalling is a good thing. It helps to get the unmanaged interface right. Something which otherwise the caller has to care about herself. Introducing C++/CLI simply includes yet another toolchan... – Pheon 12/1, 2016 at 11:11

@HaymoKutschbach I am summarizing Paul's blog (link in his answer above). But feel free to edit my answer. This answers should help future readers. Be clear to mention its your edit. – Literate 14/1, 2016 at 19:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags