Numpy fusing multiply and add to avoid wasting memory

Asked 19/7, 2017 at 20:19 Answered 2/5, 2021 at 16:31

Solved python numpy memory multiplication

Is it possible to multiply two ndarray A, and B and add the result to C, without creating a large intermediate array for A times B?

Numpy has the out keyword parameter for the case of C = A times B:

numpy.multiply(A, B, out=C)

How about the case of C += A times B?

Subterranean answered 19/7, 2017 at 20:19 Comment(5)

Then use numpy.add(... out = C)? No extra array created that way. – Remington 19/7, 2017 at 20:21

@Remington that still creates an intermediate for A times B – Subterranean 19/7, 2017 at 20:21

Are you worried about memory or performance or both? In the comments and question being tagged as memory suggests its memory that you are focussing on. If that's the case, numpy.add uses no extra memory, as mentioned earlier as well. – Remington 19/7, 2017 at 20:27

Are you hitting memory errors? Otherwise I question whether it's worth your time to micromanage numpy's use of memory. If it's speed you are worried about, try several options on realistic arrays and see it they make any difference. – Stockbroker 19/7, 2017 at 20:44

Writing and commenting on SO takes time. Yes, it is out of memory error. I really did a calculation on #elements multiplied by 1e-9 and 4 to get memory used in Gb, and know the machine only has 8Gb memory... It sucks. – Subterranean 19/7, 2017 at 20:45

Numpy only supports operations one at a time. With that said, there are several workarounds.

In place operations

The most simple solution is to use in-place operations via += and *=

import numpy as np

n = 100
b = 5.0

x = np.random.rand(n)
y = np.random.rand(n)

z = b * x
z += y

BLAS

You can access the underlying BLAS programs and apply them manually. Sadly, there is no multiply add instruction, but there is the "AXPY" instruction, which performs

y <- a * x + y

This can be called via:

import scipy

axpy = scipy.linalg.blas.get_blas_funcs('axpy', arrays=(x, y))
axpy(x, y, n, b)

Numexpr

Another option is to use some package like numexpr which allows you to compile expressions:

import numexpr

z = numexpr.evaluate('b * x + y')

Theano

Recently several machine-learning packages have started supporting compiled expressions, one such package is theano. You could do something like:

import theano

x = theano.tensor.vector()         # declare variable
y = theano.tensor.vector()         # declare variable

out = b * x + y                    # build symbolic expression
f = theano.function([x, y], out)   # compile function

z = f(x, y)

Acknowledgment answered 19/7, 2017 at 20:33 Comment(7)

linalg is nice, but not for numpy array with dimension > 2. Will try numexpr. Hope that works. – Subterranean 19/7, 2017 at 20:39

Please report back with status! Added some other options as well – Acknowledgment 19/7, 2017 at 20:47

i guess numpy could have an add_to keyword. But I haven't read the code base and don't know if adding that feature is difficult. – Subterranean 19/7, 2017 at 20:51

Numpy does not have a add-to keyword sadly – Acknowledgment 19/7, 2017 at 20:51

it works.

import numpy as np; import numexpr as ne; a = np.ones(int(1.9e9),dtype=np.int32).reshape((2,4,5,-1)); print(ne.evaluate("a*a+a",out=a))

didn't use swap. – Subterranean 19/7, 2017 at 21:32

Does += actually solve the problem? I believe it still creates a temporary result on the RHS of the operation. – Baptistry 27/7, 2017 at 18:57

It does, but it helps in related cases. Numexpr style stuff is the only full solution here. – Acknowledgment 27/7, 2017 at 20:12

I compared different variants and found that you're not going wrong with SciPy's BLAS interface

scipy.linalg.blas.daxpy(x, y, len(x), a)

Code to reproduce the plot:

import numexpr
import numpy as np
import perfplot
import scipy.linalg
import theano

a = 1.36

# theano preps
x = theano.tensor.vector()
y = theano.tensor.vector()
out = a * x + y
f = theano.function([x, y], out)


def setup(n):
    x = np.random.rand(n)
    y = np.random.rand(n)
    return x, y


def manual_axpy(data):
    x, y = data
    return a * x + y


def manual_axpy_inplace(data):
    x, y = data
    out = a * x
    out += y
    return out


def scipy_axpy(data):
    x, y = data
    n = len(x)
    axpy = scipy.linalg.blas.get_blas_funcs("axpy", arrays=(x, y))
    axpy(x, y, n, a)
    return y


def scipy_daxpy(data):
    x, y = data
    return scipy.linalg.blas.daxpy(x, y, len(x), a)


def numpexpr_evaluate(data):
    x, y = data
    return numexpr.evaluate("a * x + y")


def theano_function(data):
    x, y = data
    return f(x, y)


b = perfplot.bench(
    setup=setup,
    kernels=[
        manual_axpy,
        manual_axpy_inplace,
        scipy_axpy,
        scipy_daxpy,
        numpexpr_evaluate,
        theano_function,
    ],
    n_range=[2 ** k for k in range(24)],
    equality_check=None,
    xlabel="len(x), len(y)",
)
# b.save("out.png")
b.show()

Generic answered 2/5, 2021 at 16:31 Comment(0)

-1

As far as I understand, NumPy array operations can only be done one at a time, but by putting it inside of a function you can ensure it is not in memory, as the commenter suggested.

Bloodhound answered 19/7, 2017 at 20:23 Comment(8)

erh. I guess we are talking about numpy.add(A*B, out=C). That needs memory to hold A times B, and memory to hold the original value of C. That add function alone doesn't do multiplication. A times B will be evaluated by python interpreter first, and put into an intermediate array. After that, the intermediate is added to C. – Subterranean 19/7, 2017 at 20:25

Yes, but then A times B would no longer be in memory. – Bloodhound 19/7, 2017 at 20:27

Well. When the intermediate exists, we need memory. If there isn't enough memory for both C and the intermediate at that point, the program will get out of memory error. And not enough memory is the problem. – Subterranean 19/7, 2017 at 20:28

You have a case where you cannot hold the multiple of two matrices, but you can hold those two added to a third matrix with the same size? Doesn't seem probable to me. – Bloodhound 19/7, 2017 at 20:29

Yes. Outer product of two 3 element vector gives a 9 element array. A and B are much smaller than C, but A*B and C have same size. I have enough memory to hold A, B, and C, but not enough to hold A, B, C, and the intermediate. – Subterranean 19/7, 2017 at 20:30

we shouldn't need that intermediate, but if numpy creates that intermediate first before adding the whole intermeddiate array to C. Well, that is a waste of memory. – Subterranean 19/7, 2017 at 20:33

is there a way to do this operation mathematically without an intermediate? Then you can do the same thing. – Bloodhound 19/7, 2017 at 20:35

the point of using a library is not to do everything yourself. doing everything yourself is the last resort. – Subterranean 19/7, 2017 at 21:10

In place operations

BLAS

Numexpr

Theano

Recommended topics

Hot tags