Julia Memory Allocation for Addition of Two Matrices in place

Asked 17/2, 2016 at 19:19 Answered 28/5, 2020 at 0:50

I'm curious why Julias implementation of matrix addition appears to make a copy. Heres an example:

foo1=rand(1000,1000)
foo2=rand(1000,1000)
foo3=rand(1000,1000)

julia> @time foo1=foo2+foo3;
  0.001719 seconds (9 allocations: 7.630 MB)
julia> sizeof(foo1)/10^6
8.0

The amount of memory allocated is roughly the same as the memory required by a matrix of these dimensions.

It looks like in order to process foo2+foo3 memory is allocated to store the result and then foo1 is assigned to it by reference.

Does this mean that for most linear algebra operations we need to call BLAS and LAPACK functions directly to do things in place?

Aristotle answered 17/2, 2016 at 19:19 Comment(2)

sizeof(foo1)/(1024*1024) does in fact round to 7.630. – Ally 17/2, 2016 at 20:11

It's worth noting that Eigen C++ library does not do this eigen.tuxfamily.org/dox/TopicLazyEvaluation.html – Aristotle 18/2, 2016 at 15:35

To understand what is going on here, let's consider what foo1 = foo2 + foo3 actually does.

First it evaluates foo2 + foo3. To do this it will allocate a new temporary array to hold the output
Then it will bind the name foo1 to this new temporary array, undoing all effort you put in to pre-allocate the output array.

In short, you see that memory usage is about that of the resultant array because the routine is indeed allocating new memory for an array of that size.

Here are some alternatives:

write a loop
use broadcast!
We could try do do copy!(foo1, foo2+foo3) and then the array you pre-allocated will be filled, but it will still allocate the temporary (see below)
The original version posted here

Here's some code for those 4 cases

julia> function with_loop!(foo1, foo2, foo3)
           for i in eachindex(foo2)
              foo1[i] = foo2[i] + foo3[i]
           end
       end

julia> function with_broadcast!(foo1, foo2, foo3)
           broadcast!(+, foo1, foo2, foo3)
       end

julia> function with_copy!(foo1, foo2, foo3)
           copy!(foo1, foo2+foo3)
       end

julia> function original(foo1, foo2, foo3)
           foo1 = foo2 + foo3
       end

Now let's time these functions

julia> for f in [:with_broadcast!, :with_loop!, :with_copy!, :original]
           @eval $f(foo1, foo2, foo3) # compile
           println("timing $f")
           @eval @time $f(foo1, foo2, foo3)
       end
timing with_broadcast!
  0.001787 seconds (5 allocations: 192 bytes)
timing with_loop!
  0.001783 seconds (4 allocations: 160 bytes)
timing with_copy!
  0.003604 seconds (9 allocations: 7.630 MB)
timing original
  0.002702 seconds (9 allocations: 7.630 MB, 97.91% gc time)

You can see that with_loop! and broadcast! do about the same and both are much faster and more efficient than the others. with_copy! and original are both slower and use more memory.

In general, to do inplace operations I'd recommend starting out by writing a loop

Himeji answered 17/2, 2016 at 20:8 Comment(1)

it looks like this is a pretty informative answer, perhaps the original poster could accept it? – Mandibular 15/6, 2016 at 2:53

First, read @spencerlyon2's answer. Another approach is to use Dahua Lin's package Devectorize.jl. It defines the @devec macro which automates the translations of vector (array) expressions into looped code.

In this example we will define with_devec!(foo1,foo2,foo3) as follows,

julia> using Devectorize      # install with Pkg.add("Devectorize")

julia> with_devec!(foo1,foo2,foo3) = @devec foo1[:]=foo2+foo3

Running the benchmark achieves the 4 allocations results.

Slipslop answered 17/2, 2016 at 22:39 Comment(0)

You can use axpy! function from the LinearAlgebra package.

using LinearAlgebra

julia> @time BLAS.axpy!(1., foo2, foo3)
  0.002187 seconds (4 allocations: 160 bytes)

Claybourne answered 28/5, 2020 at 0:50 Comment(2)

This is a bad answer as for almost all cases, foo1 .= foo2 .+ foo3 is more appropriate – Swashbuckler 28/5, 2020 at 0:58

More appropriate in what aspect? If you benchmark, the median runtime of axpy! is 1.2ms whereas your suggestion is 1.8ms. – Claybourne 31/5, 2020 at 18:4

Recommended topics

Hot tags