broadcast views irregularly numpy
Asked Answered
D

2

9

Assuming I want have a numpy array of size (n,m) where n is very large, but with a lot of duplication, ie. 0:n1 are identical, n1:n2 are identical etc. (with n2%n1!=0, ie not regular intervals). Is there a way to store only one set of values for each of the duplicates while having a view of the entire array?

example:

unique_values = np.array([[1,1,1], [2,2,2] ,[3,3,3]]) #these are the values i want to store in memory
index_mapping = np.array([0,0,1,1,1,2,2]) # a mapping between index of array above, with array below

unique_values_view = np.array([[1,1,1],[1,1,1],[2,2,2],[2,2,2],[2,2,2], [3,3,3],[3,3,3]]) #this is how I want the view to look like for broadcasting reasons

I plan to multiply array(view) by some other array of size (1,m), and take the dot product of this product:

other_array1 = np.arange(unique_values.shape[1]).reshape(1,-1) # (1,m)
other_array2 = 2*np.ones((unique_values.shape[1],1)) # (m,1)
output = np.dot(unique_values_view * other_array1, other_array2).squeeze()

Output is a 1D array of length n.

Disproportion answered 1/6, 2018 at 7:47 Comment(9)
How do you plan to use the output? FYI : A view at the first stage, doesn't guarantee that there won't be forced-copy later on.Syringomyelia
@Syringomyelia True. I plan to multiply with another array of shape (1,m), then storing the dot product with another array. The main consideration is fitting the array into memory. I could do the last step in chunks if it copying is enforcedDisproportion
Could you add mcve for the entire process, i.e. sample for that (1,m) array and then the final desired output?Syringomyelia
That multiplication with 2 doesn't seem like a generic operation. Could you explain its significance?Syringomyelia
Or could add a bit more generic minimal sample, say index_mapping with bigger range of numbers and unique_values with random numbers?Syringomyelia
By "identical" do you mean they are integers? Or just very close floats? (i.e can you do bitwise equvalence)Doubletongued
@DanielF If you look at unique_values_view, the zeroth and first inner list/array are both from unique_values[0], the second, third and fourth are from unique_values[1] etc. The values will typically be floats, but I don't see how the type would matter. I want to duplicate array subsets into one large array, which doesn't store copies of all the duplicates.Disproportion
@Syringomyelia If you are able to show a solution to this minimal example, I will upvote it. If you can show that it doesn't require significantly more memory than unique_values using some kind of memory profiling, I will accept it right away.Disproportion
@Disproportion Yakym Pirozhenko 's soln seem to work for the listed minimal case. But again, I can't see what would be the generic case.Syringomyelia
C
1

Your expression admits two significant optimizations:

  • do the indexing last
  • multiply other_array1 with other_array2 first and then with unique_values

Let's apply these optimizations:

>>> output_pp = (unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]

# check for correctness
>>> (output == output_pp).all()
True

# and compare it to @Yakym Pirozhenko's approach
>>> from timeit import timeit
>>> print("yp:", timeit("np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]", globals=globals()))
yp: 3.9105667411349714
>>> print("pp:", timeit("(unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]", globals=globals()))
pp: 2.2684884609188884

These optimizations are easy to spot if we observe two things:

(1) if A is an mxn-matrix and b is an n-vector then

A * b == A @ diag(b)
A.T * b[:, None] == diag(b) @ A.T

(2) if A is anmxn-matrix and I is a k-vector of integers from range(m) then

A[I] == onehot(I) @ A

onehot can be defined as

def onehot(I, m, dtype=int):
    out = np.zeros((I.size, m), dtype=dtype)
    out[np.arange(I.size), I] = 1
    return out

Using these facts and abbreviating uv, im, oa1 and oa2 we can write

uv[im] * oa1 @ oa2 == onehot(im) @ uv @ diag(oa1) @ oa2

The above optimizations are now simply a matter of choosing the best order for these matrix multiplications which is

onehot(im) @ (uv @ (diag(oa1) @ oa2))

Using (1) and (2) backwards on this we obtain the optimized expression from the beginning of this post.

Corridor answered 10/6, 2018 at 21:22 Comment(0)
D
6

Based on your example, you can simply factor the index mapping through the computation to the very end:

output2 = np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]

assert (output == output2).all()
Divorce answered 4/6, 2018 at 11:2 Comment(0)
C
1

Your expression admits two significant optimizations:

  • do the indexing last
  • multiply other_array1 with other_array2 first and then with unique_values

Let's apply these optimizations:

>>> output_pp = (unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]

# check for correctness
>>> (output == output_pp).all()
True

# and compare it to @Yakym Pirozhenko's approach
>>> from timeit import timeit
>>> print("yp:", timeit("np.dot(unique_values * other_array1, other_array2).squeeze()[index_mapping]", globals=globals()))
yp: 3.9105667411349714
>>> print("pp:", timeit("(unique_values @ (other_array1.ravel() * other_array2.ravel()))[index_mapping]", globals=globals()))
pp: 2.2684884609188884

These optimizations are easy to spot if we observe two things:

(1) if A is an mxn-matrix and b is an n-vector then

A * b == A @ diag(b)
A.T * b[:, None] == diag(b) @ A.T

(2) if A is anmxn-matrix and I is a k-vector of integers from range(m) then

A[I] == onehot(I) @ A

onehot can be defined as

def onehot(I, m, dtype=int):
    out = np.zeros((I.size, m), dtype=dtype)
    out[np.arange(I.size), I] = 1
    return out

Using these facts and abbreviating uv, im, oa1 and oa2 we can write

uv[im] * oa1 @ oa2 == onehot(im) @ uv @ diag(oa1) @ oa2

The above optimizations are now simply a matter of choosing the best order for these matrix multiplications which is

onehot(im) @ (uv @ (diag(oa1) @ oa2))

Using (1) and (2) backwards on this we obtain the optimized expression from the beginning of this post.

Corridor answered 10/6, 2018 at 21:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.