Julia vs Mathematica: Numerical integration performance
Asked Answered
C

3

8

Julia absolute newcomer here with a question for you.

I'm teaching myself some Julia by porting some of my Mathematica and Python code (mostly scientific computations in physics etc.), and seeing what's what. Until now things have been pretty smooth. And fast. Until now.

Now, I'm simulating an elementary lock-in amplifier, which, in essence, takes a - possibly very complicated - time-dependent signal, Uin(t), and produces an output, Uout(t), phase-locked at some reference frequency fref (that is, it highlights the component of Uin(t), which has a certain phase relationship with a reference sine wave). Little does the description matter, what matters is that it basically does that by calculating the integral (I'm actually omitting the phase here for clarity):

enter image description here

So, I set out and tested this in Mathematica and Julia: I define a mockup Uin(t), pass some parameter values, and then build an array of Uout(t), at time t = 0, for a range of fref.

  • Julia: I used the QuadGK package for numerical integration.

    T = 0.1
    f = 100.
    Uin(t) = sin(2pi * f * t) + sin(2pi * 2f *t)
    Uout(t, fref)::Float64 = quadgk(s -> Uin(s) * sin(2pi * fref * s), t-T, t, rtol=1E-3)[1]/T
    frng = 80.:1.:120.
    print(@time broadcast(Uout, 0., frng))
    
  • Mathematica

    T = 0.1;
    f = 100.;
    Uin[t_] := Sin[2 π f t] + Sin[2 π 2 f t]
    Uout[t_, fref_] := NIntegrate[Sin[2 π fref s] Uin[s], {s, t - T, t}]/T
    frng = Table[i, {i, 80, 120, 1}];
    Timing[Table[Uout[0, fr], {fr, frng}]]
    

Results:

Julia timed the operation anywhere between 45 and 55 seconds, on an i7-5xxx laptop on battery power, which is a lot, while Mathematica did it in ~2 seconds. The difference is abysmal and, honestly, hard to believe. I know Mathematica has some pretty sweet and refined algorithms in its kernel, but Julia is Julia. So, question is: what gives?

P.S.: setting f and T as const does reduce Julia's time to ~8-10 seconds, but f and T cannot be consts in the actual program. Other than that, is there something obvious I'm missing?

EDIT Feb 2, 2020:

The slowing down seem to be due to the algorithm trying to hunt down precision when the value is near-zero, e.g. see below: for fref = 95 the calculation takes 1 whole second(!), while for adjacent frequency values it computes instantly (returned result is a tuple of (res, error)). Seems the quadgk function stalls at very small values):

  0.000124 seconds (320 allocations: 7.047 KiB)
fref = 94.0 (-0.08637214864144352, 9.21712218998258e-6)

  1.016830 seconds (6.67 M allocations: 139.071 MiB, 14.35% gc time)
fref = 95.0 (-6.088184966010742e-16, 1.046186419361636e-16)

  0.000124 seconds (280 allocations: 6.297 KiB)
fref = 96.0 (0.1254003757465191, 0.00010132083518769636)

Notes: this is regardless of what tolerance I ask to be produced. Also, Mathematica generally hits machine precision tolerances by default, while somewhat slowing down at near-zeros, and numpy/scipy just fly through the whole thing, but produce less precise results than Mathematica (at default settings; didn't look much into this).

Cryptozoite answered 7/2, 2020 at 22:8 Comment(7)
You can enclose the variables instead of having them globals.Homolographic
I strongly recommend reading the performance tips section of the manual: docs.julialang.org/en/v1/manual/performance-tips. In particular, read performance tip number one.Augmentative
@ChrisRackauckas enclose? as in use them as arguments? Yes, see the response below by ARamirez.Cryptozoite
@Augmentative I did :( . However, as I said, these can't be constants and in the REPL they are globals. Howver, I also did it in code, where I enclosed them in the local scope of a function, so probably not globals. Will figure it out eventually, thanks.Cryptozoite
Is the tolerance and bit width the same in Mathematica?Toilsome
@FredrikBagge Mathematica, as far as I understand, uses $MachinePrecision numbers (that would be Float64 for my laptop) if input is also $MachinePrecision, otherwise it will use precision that returns only computationally correct digits based on the input (this is quite a bit slower than $MachinePrecision, though). Tolerance, at 1E-3 seems pretty timid, really.Cryptozoite
Why don't you make an appropriate comparison setting Method -> "GaussKronrodRule" in Mathematica (see here for documentation)? It looks like the Mathematica code can be made significantly faster as well.Mysterious
A
13

Your problem is related to the choice of error tolerance. Relative error of 1e-3 doesn't sound so bad, but it actually is when the integral is close to zero. In particular, this happens when fref = 80.0 (and 85, 90, 95, not 100, 105, etc.):

julia> Uout(0.0, 80.0, f, T)
1.2104987553880609e-16

To quote from the docstring of quadgk:

(Note that it is useful to specify a positive atol in cases where norm(I) may be zero.)

Let's try to set an absolute tolerance, for example 1e-6, and compare. First the code (using the code from @ARamirez):

Uin(t, f) = sin(2π * f * t) + sin(4π * f * t)

function Uout(t, fref, f , T)
    quadgk(s -> Uin(s, f) * sin(2π * fref * s), t-T, t, rtol=1e-3)[1]/T
end
function Uout_new(t, fref, f , T) # with atol
    quadgk(s -> Uin(s, f) * sin(2π * fref * s), t-T, t, rtol=1e-3, atol=1e-6)[1]/T
end

Then the benchmarking (use BenchmarkTools for this)

using BenchmarkTools
T = 0.1
f = 100.0
freqs = 80.0:1.0:120.0

@btime Uout.(0.0, $freqs, $f, $T);
6.302 s (53344283 allocations: 1.09 GiB)

@btime Uout_new.(0.0, $freqs, $f, $T);
1.270 ms (11725 allocations: 262.08 KiB)

OK, that's 5000 times faster. Is that ok?

Augmentative answered 9/2, 2020 at 21:18 Comment(1)
@Augmentative Excellent sir, answer accepted! What's funny is that I just realized the same thing myself, that the quadgk function stalls at near zero integral values, and edited the description of the issue. Just now! But I didn't find out about atol, and you did, thank you.Cryptozoite
V
9

The first problem I see with your code is that it is type unstable. This is caused because you are using global variables (see Performance Tip number one at Julia Performance Tips) : The compiler cannot know the types of f and T, which you are using inside your functions, therefore it cannot do an efficient compilation. This is also why when you mark them as const, the performance improves: now the compiler has the guarantee that they will not change their type, so it can efficiently compile your two functions.

How to see that your code is unstable

If you run your first function with the macro @code_warntype like this:

@code_warntype Uin(0.1,f)

You will see an output like this:

julia> @code_warntype Uin(0.1)
Variables
  #self#::Core.Compiler.Const(Uin, false)
  t::Float64

Body::Any
1 ─ %1 = (2.0 * Main.pi)::Core.Compiler.Const(6.283185307179586, false)
│   %2 = (%1 * Main.f * t)::Any
│   %3 = Main.sin(%2)::Any
│   %4 = (2.0 * Main.pi)::Core.Compiler.Const(6.283185307179586, false)
│   %5 = (2.0 * Main.f)::Any
│   %6 = (%4 * %5 * t)::Any
│   %7 = Main.sin(%6)::Any
│   %8 = (%3 + %7)::Any
└──      return %8

All those Anys tell you that the compile doesn't know the type of the output at any step.

How to fix

You can redefine your functions to take in f and T as variables:

Uin(t,f) = sin(2.0pi * f * t) + sin(2.0pi * 2.0f *t)
Uout(t, fref,f,T)::Float64 = quadgk(s -> Uin(s,f) * sin(2pi * fref * s), t-T, t, rtol=1E-3)[1]/T

With these redefinitions, your code runs much faster. If you try to check them with @code_warntype you will see that now the compiler correctly infers the type of everything.

For further performance improvements, you can check out the Julia Performance Tips

In particular, the generally adviced method to measure performance instead of using @time is @btime from the package BenchmarkTools. It is so because when running @time you are measuring also compilation times (another option is to run @time two times - the second measure will be correct since all functions had a chance to compile).

Virgy answered 7/2, 2020 at 23:2 Comment(2)
Thanks for that. Ok, I thought the compiler would infer that f and T were floats, since I had already assigned values to them, but it makes sense and I should have checked; nothing prevents me from redefining them later and the compiler can't know that. Fair.Cryptozoite
After all the adjustments, execution time drops to about 7.5 - 8 seconds, still a factor of 4 slower than Mathematica. The obvious elephant in the room though is the ~50M allocations of ~1Gb!Cryptozoite
T
5

There are various things you can do to speed it up further. Chaning the order of the integration did help a bit, using Float32 instead of Float64 made a small improvement and using @fastmath made a further small improvement. One can also make use of SLEEFPirates.sin_fast

using QuadGK, ChangePrecision

@changeprecision Float32 begin
    T = 0.1
    f = 100.
    @inline @fastmath Uin(t,f) = sin(2pi * f * t) + sin(2pi * 2f *t)
    @fastmath Uout(t, fref,f,T) = first(quadgk(s -> Uin(s,f) * sin(2pi * fref * s), t-T, t, rtol=1e-2, order=10))/T

    frng = 80.:1.:120.
    @time Uout.(0., frng, f, T)
end
Toilsome answered 8/2, 2020 at 4:31 Comment(7)
Thank you kind stranger, you are hereby awarded the golden bagge (noone's ever made that pun on your name, right?). Ok, "@inline" or reducing the tolerance didn't make much difference, "@fastmath" I'd rather avoid, "order = 10" dropped a couple of seconds, but... "@changeprecision" Float32 dropped execution time from 8 seconds to 0.25 seconds, and the 50Mil allocations of 1Gb, to about 700k allocations of 37Mb! What sorcery is this? recasting the whole thing explicitly as Float32 had made no difference before, how does "@changeprecision" work?Cryptozoite
ChangePrecition only sees the syntax the macro is applied to, so it can only modify numerical literals. My guess is thst you missed a literal somewhere before when specifying Float32 manually. The tool Cthulhu.jl can help you figure out what types are used inside functions arbitrarily deep down the call stack.Toilsome
Thanks for the follow-up. The reason I haven't accepted this as the answer is that (a) it still feels like kind of a hack (even the readme at the github page of ChangePrecision says it's meant for quick hacks), especially given that Mathematica does 1.5-2 sec at 64-bit precision, and a numpy/scipy quick hack is even faster (0.2 sec), and (b) The 30x enhancement going from 64- to 32-bit floats seems outlandish, as is the diff in memory allocations (1Gb to 40Mb). In your response you say "using Float32 instead of Float64 made a small improvement". Would you share how much of an improvement?Cryptozoite
For me the improvement was much smaller, I didn't get it down to under 5s. I ran on a very old laptop so there might be differences in the processor architecture, SIMD width etc.Toilsome
I ran the code through Traceur.jl and it did not complain about any instabilities or other issues.Toilsome
Apparently the function was iterating deeply to meet relative tolerance requirements for zero values (fref = 80, 85, ...). At these frequencies, each computation was taking orders of magnitude longer. Passing a small absolute tolerance value to the function, sped up execution by a factor of > 1000. See @DNF's accepted solution above.Cryptozoite
A factor of 1000 is just very good stuff :) Then you can perhaps forget about all the @inline and Float32 stuff to make the code cleaner.Toilsome

© 2022 - 2024 — McMap. All rights reserved.