Math optimization in C#
Asked Answered
N

25

65

I've been profiling an application all day long and, having optimized a couple bits of code, I'm left with this on my todo list. It's the activation function for a neural network, which gets called over a 100 million times. According to dotTrace, it amounts to about 60% of the overall function time.

How would you optimize this?

public static float Sigmoid(double value) {
    return (float) (1.0 / (1.0 + Math.Pow(Math.E, -value)));
}
Nerland answered 5/1, 2009 at 0:57 Comment(9)
What's the range of input values?Vagary
Also, how precise are the input values? How many decimal digits are important to you?Vagary
The more the better, really, but I'd say about 6-7 decimal digitsNerland
What's the range of the value input?Deposit
It varies from run to run, but generally from -10.00000 to +10.000000. Changed it to float and works, aside from some casting in some classesNerland
Is there an easy way to ensure that the method is inlined? Perhaps final modifier?Riddance
p.s. I'm a C# beginner, so I am just guessing.Riddance
+1 for profiling before you determine that you need to optimize!Quick
@jjnguy: I'm not sure there's a way to force the JIT execution to inline a method.Lunna
D
63

Try:

public static float Sigmoid(double value) {
    return 1.0f / (1.0f + (float) Math.Exp(-value));
}

EDIT: I did a quick benchmark. On my machine, the above code is about 43% faster than your method, and this mathematically-equivalent code is the teeniest bit faster (46% faster than the original):

public static float Sigmoid(double value) {
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

EDIT 2: I'm not sure how much overhead C# functions have, but if you #include <math.h> in your source code, you should be able to use this, which uses a float-exp function. It might be a little faster.

public static float Sigmoid(double value) {
    float k = expf((float) value);
    return k / (1.0f + k);
}

Also if you're doing millions of calls, the function-calling overhead might be a problem. Try making an inline function and see if that's any help.

Deposit answered 5/1, 2009 at 1:2 Comment(6)
Do you know they range that the value parameter could be? If so, consider generating a lookup table.Caret
You changed the sign of value, my math is rusty but I don't think it's the same thing... you should have Math.Exp(-value) according to the initial code.Dandridge
@Marcel: Nope, he changed e^-value to 1/(e^value), then added with the 1.0 and swapped numerator/denominator.Basrelief
Please forgive me, why convert to float? Is it not the case that float is derived from double precission number? Seems if this is the case it would be better to use double?Heliochrome
So is it 1 / (1+k) or k / (1+k)?Septum
The first is when k=exp(-value), the second is when k=exp(+value). Mathematically equivalent.Anarchism
A
31

If it's for an activation function, does it matter terribly much if the calculation of e^x is completely accurate?

For example, if you use the approximation (1+x/256)^256, on my Pentium testing in Java (I'm assuming C# essentially compiles to the same processor instructions) this is about 7-8 times faster than e^x (Math.exp()), and is accurate to 2 decimal places up to about x of +/-1.5, and within the correct order of magnitude across the range you stated. (Obviously, to raise to the 256, you actually square the number 8 times -- don't use Math.Pow for this!) In Java:

double eapprox = (1d + x / 256d);
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;

Keep doubling or halving 256 (and adding/removing a multiplication) depending on how accurate you want the approximation to be. Even with n=4, it still gives about 1.5 decimal places of accuracy for values of x beween -0.5 and 0.5 (and appears a good 15 times faster than Math.exp()).

P.S. I forgot to mention -- you should obviously not really divide by 256: multiply by a constant 1/256. Java's JIT compiler makes this optimisation automatically (at least, Hotspot does), and I was assuming that C# must do too.

Assibilate answered 5/1, 2009 at 1:59 Comment(4)
Whoa. This lowered it even further!Nerland
If you're multiplying or dividing by powers of two, shift left or right (<< and >>), instead of using multiplication/division, it will be far faster.Nosepiece
@Nosepiece -- that would work for integer cases, though on modern processors isn't necessarily faster than just multiplying. But you really may as well let the compiler perform that kind of optimisation.Assibilate
But don't assume that your notions of processor timings and optimisations from 20 years ago still apply. You may well find that your processor can do an FP multiplication in the same time as an integer shift...Assibilate
C
24

Have a look at this post. it has an approximation for e^x written in Java, this should be the C# code for it (untested):

public static double Exp(double val) {  
    long tmp = (long) (1512775 * val + 1072632447);  
    return BitConverter.Int64BitsToDouble(tmp << 32);  
}

In my benchmarks this is more than 5 times faster than Math.exp() (in Java). The approximation is based on the paper "A Fast, Compact Approximation of the Exponential Function" which was developed exactly to be used in neural nets. It is basically the same as a lookup table of 2048 entries and linear approximation between the entries, but all this with IEEE floating point tricks.

EDIT: According to Special Sauce this is ~3.25x faster than the CLR implementation. Thanks!

Cussedness answered 5/1, 2009 at 12:38 Comment(7)
Curious: Can you simplify (1072693248 - 60801) as 1072632447? Also, can you take it outside the long cast, so it's not added to a double, for speed? Does this affect accuracy and/or performance?Escamilla
(Hindsight: I realize the subtraction is probably optimized by the compiler, but worth a shot regardless.)Escamilla
@Escamilla right, this is surely optimized by the compiler. I left it like this because it was partly how the formula was developed, but you can replace it with 1072632447.Cussedness
@martinus, Have you tried the other technique? Rewrite as: long tmp = (long)(1512775 * val) + 1072632447;Escamilla
Using this in Henrik's Sigmoid1/Sigmoid2 post offers a ~50% improvement (900ms vs 500ms). Haven't checked accuracy though.Nerland
I benchmarked this very simple function in C# and found it ~3.25x faster than the CLR implementation. To get an idea of the level of error, here are five random example pairs of (CLR result, approximated result) at different orders of magnitude: (0.0007242, 0.0007376), (1.55306, 1.57713), (307.78015, 309.18896), (1093286.54660, 1050935.0), (9.76825E+30, 9.57295E+30).Melendez
@Cussedness it's also important to note that according to the paper only input values approximately within the (-700,700) are supported by this method. I have noticed that even for input values 800+ I see unpredictable behavior. Alternative link to the paper: schraudolph.org/pubs/Schraudolph99.pdfHupp
F
13
  1. Remember, that any changes in this activation function come at cost of different behavior. This even includes switching to float (and thus lowering the precision) or using activation substitutes. Only experimenting with your use case will show the right way.
  2. In addition to the simple code optimizations, I would also recommend to consider parallelization of the computations (i.e.: to leverage multiple cores of your machine or even machines at the Windows Azure Clouds) and improving the training algorithms.

UPDATE: Post on lookup tables for ANN activation functions

UPDATE2: I removed the point on LUTs since I've confused these with the complete hashing. Thanks go to Henrik Gustafsson for putting me back on the track. So the memory is not an issue, although the search space still gets messed up with local extrema a bit.

Fiddlestick answered 5/1, 2009 at 5:37 Comment(12)
I'm already parallelizing parts of the code, some using Parallel.For and some others using PLINQ. So far, it's worked greatNerland
Yes, I would suspect. But the whole fun starts when the training algo (I'm using evolutionary one that allows to pick the network structure) is being run on multiple machines))Fiddlestick
Using a LUT for each possible value of the double range would eat all your memory, but if floats are used and some loss of precision can be acceptable, a table could be fit into less than 4k of memory. Your statement 1 is false.Panne
If an error of ~0.1% in the activation function can be accepted, the C-example I posted below would work if you fix it up a little (first two points in the TODO should do it :)Panne
I'll give it a try. I'm also trying to minimize memory usage (not related to your post) - apparently .NET thinks it's a nice idea to pass a huge around array... blah..Nerland
I wouldn't say 4k is a lot either way as it is only once per process :) Why I tried for 4k is of course to keep within a page boundary. That is easy o guarantee in C, but harder in C# I suppose; ymmv on thatPanne
4k is a rather mild understatement)) rabdullin.com/journal/2009/1/5/…Fiddlestick
You're not seriously using a hash-table? That version has already been discussed and rejected several times here already. Did you even read my code? Oh, and read this while you're at it: en.wikipedia.org/wiki/Lookup_tablePanne
Aliasing might just break stuff badly, but there might be ways of fixing that, and the gains of a LUT is potentially big, and are usually worth at least looking into. As it would be used often it would almost certainly be hot in the cache.Panne
Hm... mea culpa. I initially confused LUT with complete hashing. But still, using this kind of approach just did not work well in my scenarios and resulted in worse training results.Fiddlestick
"Worse training results" == it takes more attempts to get a reasonably good training result.Fiddlestick
I posted a new answer with a C# implementation, I also included some links to some papers dealing with the errors introduced. I haven't had the time to read them all, but in general they seem to say that quantization have good and bad features, but most are manageable using appropriate techniques.Panne
L
8

FWIW, here's my C# benchmarks for the answers already posted. (Empty is a function that just returns 0, to measure the function call overhead)

Empty Function:       79ms   0
Original:             1576ms 0.7202294
Simplified: (soprano) 681ms  0.7202294
Approximate: (Neil)   441ms  0.7198783
Bit Manip: (martinus) 836ms  0.72318
Taylor: (Rex Logan)   261ms  0.7202305
Lookup: (Henrik)      182ms  0.7204863
public static object[] Time(Func<double, float> f) {
    var testvalue = 0.9456;
    var sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < 1e7; i++)
        f(testvalue);
    return new object[] { sw.ElapsedMilliseconds, f(testvalue) };
}
public static void Main(string[] args) {
    Console.WriteLine("Empty:       {0,10}ms {1}", Time(Empty));
    Console.WriteLine("Original:    {0,10}ms {1}", Time(Original));
    Console.WriteLine("Simplified:  {0,10}ms {1}", Time(Simplified));
    Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation));
    Console.WriteLine("Bit Manip:   {0,10}ms {1}", Time(BitBashing));
    Console.WriteLine("Taylor:      {0,10}ms {1}", Time(TaylorExpansion));
    Console.WriteLine("Lookup:      {0,10}ms {1}", Time(LUT));
}
Lunna answered 5/1, 2009 at 0:58 Comment(1)
Good stuff! Since F# is .NET, you think it would be possible to include that as well? research.microsoft.com/en-us/downloads/…Panne
P
8

At 100 million calls, i'd start to wonder if profiler overhead isn't skewing your results. Replace the calculation with a no-op and see if it is still reported to consume 60% of the execution time...

Or better yet, create some test data and use a stopwatch timer to profile a million or so calls.

Perigordian answered 5/1, 2009 at 1:5 Comment(0)
A
8

If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:

void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){
    __m128* l_Output = (__m128*)a_Output;
    __m128* l_Start  = (__m128*)a_Values;
    __m128* l_End    = (__m128*)(a_Values + a_Size);

    const __m128 l_One        = _mm_set_ps1(1.f);
    const __m128 l_Half       = _mm_set_ps1(1.f / 2.f);
    const __m128 l_OneOver6   = _mm_set_ps1(1.f / 6.f);
    const __m128 l_OneOver24  = _mm_set_ps1(1.f / 24.f);
    const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f);
    const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f);
    const __m128 l_MinOne     = _mm_set_ps1(-1.f);

    for(__m128 *i = l_Start; i < l_End; i++){
        // 1.0 / (1.0 + Math.Pow(Math.E, -value))
        // 1.0 / (1.0 + Math.Exp(-value))

        // value = *i so we need -value
        __m128 value = _mm_mul_ps(l_MinOne, *i);

        // exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ...
        __m128 x = value;

        // result in l_Exp
        __m128 l_Exp = l_One; // = 1

        l_Exp = _mm_add_ps(l_Exp, x); // += x

        x = _mm_mul_ps(x, x); // = x ^ 2
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2))

        x = _mm_mul_ps(value, x); // = x ^ 3
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6))

        x = _mm_mul_ps(value, x); // = x ^ 4
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24))

#ifdef MORE_ACCURATE

        x = _mm_mul_ps(value, x); // = x ^ 5
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120))

        x = _mm_mul_ps(value, x); // = x ^ 6
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720))

#endif

        // we've calculated exp of -i
        // now we only need to do the '1.0 / (1.0 + ...' part
        *l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One,  l_Exp));
    }
}

However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.

Using SSE, I can calculate the result for all 100 million elements in around half a second. However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time. You might even want to consider using a double buffering approach with 100K elements or more.

Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).

Amphiarthrosis answered 5/1, 2009 at 11:15 Comment(1)
+1 for thinking outside the normal bounds and using hardware acceleration.Nosepiece
P
5

Note: This is a follow-up to this post.

Edit: Update to calculate the same thing as this and this, taking some inspiration from this.

Now look what you made me do! You made me install Mono!

$ gmcs -optimize test.cs && mono test.exe
Max deviation is 0.001663983
10^7 iterations using Sigmoid1() took 1646.613 ms
10^7 iterations using Sigmoid2() took 237.352 ms

C is hardly worth the effort anymore, the world is moving forward :)

So, just over a factor 10 6 faster. Someone with a windows box gets to investigate the memory usage and performance using MS-stuff :)

Using LUTs for activation functions is not so uncommon, especielly when implemented in hardware. There are many well proven variants of the concept out there if you are willing to include those types of tables. However, as have already been pointed out, aliasing might turn out to be a problem, but there are ways around that too. Some further reading:

Some gotchas with this:

  • The error goes up when you reach outside the table (but converges to 0 at the extremes); for x approx +-7.0. This is due to the chosen scaling factor. Larger values of SCALE give higher errors in the middle range, but smaller at the edges.
  • This is generally a very stupid test, and I don't know C#, It's just a plain conversion of my C-code :)
  • Rinat Abdullin is very much correct that aliasing and precision loss might cause problems, but since I have not seen the variables for that I can only advice you to try this. In fact, I agree with everything he says except for the issue of lookup tables.

Pardon the copy-paste coding...

using System;
using System.Diagnostics;

class LUTTest {
    private const float SCALE = 320.0f;
    private const int RESOLUTION = 2047;
    private const float MIN = -RESOLUTION / SCALE;
    private const float MAX = RESOLUTION / SCALE;
    
    private static readonly float[] lut = InitLUT();

    private static float[] InitLUT() {
      var lut = new float[RESOLUTION + 1];
      
      for (int i = 0; i < RESOLUTION + 1; i++) {
        lut[i] = (float)(1.0 / (1.0 + Math.Exp(-i / SCALE)));
      }
      return lut;
    }
    
    public static float Sigmoid1(double value) {
        return (float) (1.0 / (1.0 + Math.Exp(-value)));
    }
    
    public static float Sigmoid2(float value) {
      if (value <= MIN) return 0.0f;
      if (value >= MAX) return 1.0f;
      if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
      return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
    }
    
    public static float error(float v0, float v1) {
      return Math.Abs(v1 - v0);
    }
    
    public static float TestError() {
        float emax = 0.0f;
        for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
          float v0 = Sigmoid1(x);
          float v1 = Sigmoid2(x);
          float e = error(v0, v1);
          if (e > emax) emax = e;
        }
        return emax;
    }

    public static double TestPerformancePlain() {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                Sigmoid1(x);
            }
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds;
    }    
    
    public static double TestPerformanceLUT() {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                Sigmoid2(x);
            }
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds;
    }    
    
    static void Main() {
        Console.WriteLine("Max deviation is {0}", TestError());
        Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", TestPerformancePlain());
        Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", TestPerformanceLUT());
    }
}
Panne answered 5/1, 2009 at 19:10 Comment(7)
Thanks for the update. Note, that you might want to pick a different error measurement, since otherwise changing -6.0f; x < 6.0f to 7f kicks the error to 100%.Fiddlestick
Additionally compiling your code in release mode and running it without debugger attached yielded results: 1195ms vs 41ms. That's more than 10 times faster))Fiddlestick
But fixing the Sigmoid1 brought the speed advantage down to 10x. Additionally it is possible to improve Sigmoid2 by 2ms by saving intermediate value. See: rabdullin.com/journal/2009/1/5/…Fiddlestick
I didn't bother getting smart there. I thought it better to keep that one real simple and close to en.wikipedia.org/wiki/Approximation_error It would take something completely different to handle v1=0 well, but that is an inherent and well known and understood weakness of that measurement.Panne
About the 'local extremas', one way we used to get rid of the effects of that in a project (completely unrelated to machine learning) was to add some noise to the signal. I think adding that touch of nondeterminism would help the learning algorithm not getting stuck there.Panne
Re Noise in training - yeah, that's known trick. I've heard of it being used in training hardware ANNs (real-time environment analysis and reaction) with some decent results.Fiddlestick
PS: F# implementation posted. Its even faster.Fiddlestick
F
5

F# Has Better Performance than C# in .NET math algorithms. So rewriting neural network in F# might improve the overall performance.

If we re-implement LUT benchmarking snippet (I've been using slightly tweaked version) in F#, then the resulting code:

  • executes sigmoid1 benchmark in 588.8ms instead of 3899,2ms
  • executes sigmoid2 (LUT) benchmark in 156.6ms instead of 411.4 ms

More details could be found in the blog post. Here's the F# snippet JIC:

#light

let Scale = 320.0f;
let Resolution = 2047;

let Min = -single(Resolution)/Scale;
let Max = single(Resolution)/Scale;

let range step a b =
  let count = int((b-a)/step);
  seq { for i in 0 .. count -> single(i)*step + a };

let lut = [| 
  for x in 0 .. Resolution ->
    single(1.0/(1.0 +  exp(-double(x)/double(Scale))))
  |]

let sigmoid1 value = 1.0f/(1.0f + exp(-value));

let sigmoid2 v = 
  if (v <= Min) then 0.0f;
  elif (v>= Max) then 1.0f;
  else
    let f = v * Scale;
    if (v>0.0f) then lut.[int (f + 0.5f)]
    else 1.0f - lut.[int(0.5f - f)];

let getError f = 
  let test = range 0.00001f -10.0f 10.0f;
  let errors = seq { 
    for v in test -> 
      abs(sigmoid1(single(v)) - f(single(v)))
  }
  Seq.max errors;

open System.Diagnostics;

let test f = 
  let sw = Stopwatch.StartNew(); 
  let mutable m = 0.0f;
  let result = 
    for t in 1 .. 10 do
      for x in 1 .. 1000000 do
        m <- f(single(x)/100000.0f-5.0f);
  sw.Elapsed.TotalMilliseconds;

printf "Max deviation is %f\n" (getError sigmoid2)
printf "10^7 iterations using sigmoid1: %f ms\n" (test sigmoid1)
printf "10^7 iterations using sigmoid2: %f ms\n" (test sigmoid2)

let c = System.Console.ReadKey(true);

And the output (Release compilation against F# 1.9.6.2 CTP with no debugger):

Max deviation is 0.001664
10^7 iterations using sigmoid1: 588.843700 ms
10^7 iterations using sigmoid2: 156.626700 ms

UPDATE: updated benchmarking to use 10^7 iterations to make results comparable with C

UPDATE2: here are the performance results of the C implementation from the same machine to compare with:

Max deviation is 0.001664
10^7 iterations using sigmoid1: 628 ms
10^7 iterations using sigmoid2: 157 ms
Fiddlestick answered 6/1, 2009 at 11:30 Comment(5)
I'll update the C-code with time-keeping, but I can't to F#, and our machines differ so much I think its better if you ran the test. Stay tunedPanne
Hey, you're doing 10^7 iterations, are you not?Panne
Duh, yes. I've fixed this typo in the blog post and forgot about this snippet. Thanks. Re C snippet, I'll run it on my machine a bit later. Just need to get some C compiler.Fiddlestick
These were my numbers in mono: 10^7 iterations using sigmoid1: 1661.244000 ms 10^7 iterations using sigmoid2: 732.762000 msPanne
@Rinat Abdullin: Your benchmark is wrong. The effect you are observing is using float as a counter in the for loop in C#. If you use int as the counter and a delegate in C# to execute the sigmoid algorithm as you did in F#, C# is marginally faster. thoughtfulcode.wordpress.com/2010/12/30/…Hakan
L
4

First thought: How about some stats on the values variable?

  • Are the values of "value" typically small -10 <= value <= 10?

If not, you can probably get a boost by testing for out of bounds values

if(value < -10)  return 0;
if(value > 10)  return 1;
  • Are the values repeated often?

If so, you can probably get some benefit from Memoization (probably not, but it doesn't hurt to check....)

if(sigmoidCache.containsKey(value)) return sigmoidCache.get(value);

If neither of these can be applied, then as some others have suggested, maybe you can get away with lowering the accuracy of your sigmoid...

Labelle answered 5/1, 2009 at 2:5 Comment(3)
Memoization in this case is probably pretty expensive.Panne
Henrik: Quite possibly, yes. It may still be worthwhile depending on how often the same value is being passed to the function. I'm not sure how the rest of the algorithm uses this function, but it might re-call it many times, unnecessarily.Labelle
We're dealing with floats and neural networks, I assume the values will be all over the place :)Panne
C
4

Soprano had some nice optimizations your call:

public static float Sigmoid(double value) 
{
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.

For example try caching the last value and result. If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result. If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.

Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.

Caret answered 5/1, 2009 at 3:35 Comment(0)
H
4

Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point, (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.

Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible. See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).

Hibbert answered 5/1, 2009 at 11:30 Comment(1)
Your paper link appears broken now.Lampas
E
2

Idea: Perhaps you can make a (large) lookup table with the values pre-calculated?

Elberfeld answered 5/1, 2009 at 1:0 Comment(6)
I'll give that a try. Hopefully the table won't grow to titanic proportions.Nerland
hb: this could backfire very badly. If you're not sure about the maximum size you'd have to implement a structure with a limited size (kind of like a cache) and this is not a trivial task.Hypothesis
I know - gave it a quick test, got OutOfMemoryException. soprano's function helped heapsNerland
Lookup time will be faster than a Math.Exp time?Tiernan
Perhaps a table combined with interpolation, =using fixed point math (i.e. scaled integers) would work. In the old days, the 88000 DSP processor supported that in machine code.Oculomotor
Large lookup has lesser chance of fitting CPU cacheFiddlestick
P
2

This is slightly off topic, but just out of curiosity, I did the same implementation as the one in C, C# and F# in Java. I'll just leave this here in case someone else is curious.

Result:

$ javac LUTTest.java && java LUTTest
Max deviation is 0.001664
10^7 iterations using sigmoid1() took 1398 ms
10^7 iterations using sigmoid2() took 177 ms

I suppose the improvement over C# in my case is due to Java being better optimized than Mono for OS X. On a similar MS .NET-implementation (vs. Java 6 if someone wants to post comparative numbers) I suppose the results would be different.

Code:

public class LUTTest {
    private static final float SCALE = 320.0f;
    private static final  int RESOLUTION = 2047;
    private static final  float MIN = -RESOLUTION / SCALE;
    private static final  float MAX = RESOLUTION / SCALE;

    private static final float[] lut = initLUT();

    private static float[] initLUT() {
        float[] lut = new float[RESOLUTION + 1];

        for (int i = 0; i < RESOLUTION + 1; i++) {
            lut[i] = (float)(1.0 / (1.0 + Math.exp(-i / SCALE)));
        }
        return lut;
    }

    public static float sigmoid1(double value) {
        return (float) (1.0 / (1.0 + Math.exp(-value)));
    }

    public static float sigmoid2(float value) {
        if (value <= MIN) return 0.0f;
        if (value >= MAX) return 1.0f;
        if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
        return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
    }

    public static float error(float v0, float v1) {
        return Math.abs(v1 - v0);
    }

    public static float testError() {
        float emax = 0.0f;
        for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
            float v0 = sigmoid1(x);
            float v1 = sigmoid2(x);
            float e = error(v0, v1);
            if (e > emax) emax = e;
        }
        return emax;
    }

    public static long sigmoid1Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid1(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static long sigmoid2Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid2(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static void main(String[] args) {

        System.out.printf("Max deviation is %f\n", testError());
        System.out.printf("10^7 iterations using sigmoid1() took %d ms\n", sigmoid1Perf());
        System.out.printf("10^7 iterations using sigmoid2() took %d ms\n", sigmoid2Perf());
    }
}
Panne answered 6/1, 2009 at 17:42 Comment(1)
Just want it to be complete :)Panne
H
2

I realize that it has been a year since this question popped up, but I ran across it because of the discussion of F# and C performance relative to C#. I played with some of the samples from other responders and discovered that delegates appear to execute faster than a regular method invocation but there is no apparent peformance advantage to F# over C#.

  • C: 166ms
  • C# (delegate): 275ms
  • C# (method): 431ms
  • C# (method, float counter): 2,656ms
  • F#: 404ms

The C# with a float counter was a straight port of the C code. It is much faster to use an int in the for loop.

Hakan answered 31/12, 2010 at 0:15 Comment(0)
M
2

There are a much faster functions that do very similar things:

x / (1 + abs(x)) – fast replacement for TAHN

And similarly:

x / (2 + 2 * abs(x)) + 0.5 - fast replacement for SIGMOID

Compare plots with actual sigmoid

Marcin answered 29/10, 2016 at 17:32 Comment(0)
P
1

(Updated with performance measurements)(Updated again with real results :)

I think a lookup table solution would get you very far when it comes to performance, at a negligible memory and precision cost.

The following snippet is an example implementation in C (I don't speak c# fluently enough to dry-code it). It runs and performs well enough, but I'm sure there's a bug in it :)

#include <math.h>
#include <stdio.h>
#include <time.h>

#define SCALE 320.0f
#define RESOLUTION 2047
#define MIN -RESOLUTION / SCALE
#define MAX RESOLUTION / SCALE

static float sigmoid_lut[RESOLUTION + 1];

void init_sigmoid_lut(void) {
    int i;    
    for (i = 0; i < RESOLUTION + 1; i++) {
        sigmoid_lut[i] =  (1.0 / (1.0 + exp(-i / SCALE)));
    }
}

static float sigmoid1(const float value) {
    return (1.0f / (1.0f + expf(-value)));
}

static float sigmoid2(const float value) {
    if (value <= MIN) return 0.0f;
    if (value >= MAX) return 1.0f;
    if (value >= 0) return sigmoid_lut[(int)(value * SCALE + 0.5f)];
    return 1.0f-sigmoid_lut[(int)(-value * SCALE + 0.5f)];
}

float test_error() {
    float x;
    float emax = 0.0;

    for (x = -10.0f; x < 10.0f; x+=0.00001f) {
        float v0 = sigmoid1(x);
        float v1 = sigmoid2(x);
        float error = fabsf(v1 - v0);
        if (error > emax) { emax = error; }
    } 
    return emax;
}

int sigmoid1_perf() {
    clock_t t0, t1;
    int i;
    float x, y = 0.0f;

    t0 = clock();
    for (i = 0; i < 10; i++) {
        for (x = -5.0f; x <= 5.0f; x+=0.00001f) {
            y = sigmoid1(x);
        }
    }
    t1 = clock();
    printf("", y); /* To avoid sigmoidX() calls being optimized away */
    return (t1 - t0) / (CLOCKS_PER_SEC / 1000);
}

int sigmoid2_perf() {
    clock_t t0, t1;
    int i;
    float x, y = 0.0f;
    t0 = clock();
    for (i = 0; i < 10; i++) {
        for (x = -5.0f; x <= 5.0f; x+=0.00001f) {
            y = sigmoid2(x);
        }
    }
    t1 = clock();
    printf("", y); /* To avoid sigmoidX() calls being optimized away */
    return (t1 - t0) / (CLOCKS_PER_SEC / 1000);
}

int main(void) {
    init_sigmoid_lut();
    printf("Max deviation is %0.6f\n", test_error());
    printf("10^7 iterations using sigmoid1: %d ms\n", sigmoid1_perf());
    printf("10^7 iterations using sigmoid2: %d ms\n", sigmoid2_perf());

    return 0;
}

Previous results were due to the optimizer doing its job and optimized away the calculations. Making it actually execute the code yields slightly different and much more interesting results (on my way slow MB Air):

$ gcc -O2 test.c -o test && ./test
Max deviation is 0.001664
10^7 iterations using sigmoid1: 571 ms
10^7 iterations using sigmoid2: 113 ms

profile


TODO:

There are things to improve and ways to remove weaknesses; how to do is is left as an exercise to the reader :)

  • Tune the range of the function to avoid the jump where the table starts and ends.
  • Add a slight noise function to hide the aliasing artifacts.
  • As Rex said, interpolation could get you quite a bit further precision-wise while being rather cheap performance-wise.
Panne answered 5/1, 2009 at 0:58 Comment(6)
What you are doing is fixed point math. I prefer to use a scale factor that is 2^n like 256, 1024, 65536. The ideal ones are 2^8 or 2^16 then you can just grab bytes to get at the integer part. You can probably get even a bigger boost if tables are used with fixed point through out the entire app.Ovoid
Well, I intentionally did not use fixed point math. I could, of course, but I would rather avoid that additional source of aliasing.Panne
Also I did not use any of the more magic bit-twiddling optimizations since the code is supposed to be 1. readable, 2. easy to translate to C# (where most of it would be useless anyway).Panne
Of course, using 32 bit ints everywhere would probably be fastest, but given the nature of the sigmoid function fixed point math might not be the best choice. (I don't like the comment limit :)Panne
Another method I have used is to generate a table that is an array of floats and then interpolate between the points. To get rid of the quantization effects. Depending on how you build it it can still be fast. Might work well for a function like this that is squeezing the range down.Ovoid
Interpolation would certainly work very well here, but I don't think it's worth the extra complexity. Using just the 2k data points there was an error of <0.25% within the range of the table. Just outside it jumps up to almost 0.5%, but that could be fixed by adjusting the table coverage.Panne
C
1

You might also consider experimenting with alternative activation functions which are cheaper to evaluate. For example:

f(x) = (3x - x**3)/2

(which could be factored as

f(x) = x*(3 - x*x)/2

for one less multiplication). This function has odd symmetry, and its derivative is trivial. Using it for a neural network requires normalizing the sum-of-inputs by dividing by the total number of inputs (limiting the domain to [-1..1], which is also range).

Canaliculus answered 5/1, 2009 at 4:1 Comment(0)
D
1

A mild variation on Soprano's theme:

public static float Sigmoid(double value) {
    float v = value;
    float k = Math.Exp(v);
    return k / (1.0f + k);
}

Since you're only after a single precision result, why make the Math.Exp function calculate a double? Any exponent calculator that uses an iterative summation (see the expansion of the ex) will take longer for more precision, each and every time. And double is twice the work of single! This way, you convert to single first, then do your exponential.

But the expf function should be faster still. I don't see the need for soprano's (float) cast in passing to expf though, unless C# doesn't do implicit float-double conversion.

Otherwise, just use a real language, like FORTRAN...

Dostie answered 5/1, 2009 at 12:1 Comment(1)
When did Math.Exp take floats?Lunna
P
1

There are a lot of good answers here. I would suggest running it through this technique, just to make sure

  • You're not calling it any more times than you need to.
    (Sometimes functions get called more than necessary, just because they are so easy to call.)
  • You're not calling it repeatedly with the same arguments
    (where you could use memoization)

BTW the function you have is the inverse logit function,
or the inverse of the log-odds-ratio function log(f/(1-f)).

Potassium answered 29/11, 2009 at 1:51 Comment(0)
W
0

Doing a Google search, I found an alternative implementation of the Sigmoid function.

public double Sigmoid(double x)
{
   return 2 / (1 + Math.Exp(-2 * x)) - 1;
}

Is that correct for your needs? Is it faster?

http://dynamicnotions.blogspot.com/2008/09/sigmoid-function-in-c.html

Windswept answered 5/1, 2009 at 1:18 Comment(1)
Giving this a try atm, I'll post back in a minuteNerland
C
0

1) Do you call this from only one place? If so, you may gain a small amount of performance by moving the code out of that function and just putting it right where you would normally have called the Sigmoid function. I don't like this idea in terms of code readability and organization but when you need to get every last performance gain, this might help because I think function calls require a push/pop of registers on the stack, which could be avoided if your code was all inline.

2) I have no idea if this might help but try making your function parameter a ref parameter. See if it's faster. I would have suggested making it const (which would have been an optimization if this were in c++) but c# doesn't support const parameters.

Caret answered 5/1, 2009 at 3:56 Comment(0)
Q
0

If you need a giant speed boost, you could probably look into parallelizing the function using the (ge)force. IOW, use DirectX to control the graphics card into doing it for you. I have no idea how to do this, but I've seen people use graphics cards for all kinds of calculations.

Quick answered 5/1, 2009 at 17:6 Comment(0)
M
0

I've seen that a lot of people around here are trying to use approximation to make Sigmoid faster. However, it is important to know that Sigmoid can also be expressed using tanh, not only exp. Calculating Sigmoid this way is around 5 times faster than with exponential, and by using this method you are not approximating anything, thus the original behaviour of Sigmoid is kept as-is.

    public static double Sigmoid(double value)
    {
        return 0.5d + 0.5d * Math.Tanh(value/2);
    }

Of course, parellization would be the next step to performance improvement, but as far as the raw calculation is concerned, using Math.Tanh is faster than Math.Exp.

Massasoit answered 9/7, 2017 at 16:26 Comment(0)
C
0

Remember, Sigmoid constraints results to range between 0 and 1. Values of smaller than about -10 return a value very, very close to 0.0. Values of greater than about 10 return a value very, very close to 1.

Back in the old days when computers couldn't handle arithmetic overflow/underflow that well, putting if conditions to limit the calculation was usual. If I were really concerned about its performance (or basically Math's performance), I would change your code to the old fashioned way (and mind the limits) so that it does not call Math unnecessarily:

public double Sigmoid(double value)
{
    if (value < -45.0) return 0.0;
    if (value > 45.0) return 1.0;
    return 1.0 / (1.0 + Math.Exp(-value));
}

I realize anyone reading this answer may be involved in some sort of NN development. Be mindful of how the above affects the other aspects of your training scores.

Corncob answered 11/5, 2020 at 9:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.