There isn't a real clear case when broadcasting is bad. Often broadcasting is the simplest most readable solution which is probably what you want. If that is too slow after benchmarking I would only then consider optimising the broadcasts or sequential operations away.
As many of the existing comments say, there is often a tradeoff between memory and compute with regards to broadcasting. However if your algorithms are designed incorrectly you can hurt both aspects.
The biggest problem I find is that while numpy may try to optimise different steps such that it uses views of an array, often it won't be able to do these optimisations for broadcasts or sequential operations. Clever use of numpy may still not be able to get around this problem and so it might be worthwhile considering rewriting your program using loops so that you can merge your operations together manually. This can minimise memory usage and maximising performance. Doing this in plain python however would be extremely slow, but fortunately we have things like numba
which can JIT (just in time) compile an annotated function down into efficient machine code.
Another problem is very large arrays, broadcasting can rapidly increase memory usage. Often moving from O(n) even O(n^2) or even worse if arrays are different shapes (I don't mean broadcasting with a scalar). While this may be fine for small arrays, it will quickly become an issue as the size of the arrays increase. Multiplying by a scaler may just double memory usage which isn't nearly as bad.
a = np.arange(128, dtype='float64')
# temp. array memory usage -- A: ~1KB, B: ~1MB, C: ~1GB
A: float = a.sum()
B: float = (a[:, None] + a[None, :]).sum()
C: float = (a[:, None, None] + a[None, :, None] + a[None, None, :]).sum()
# If the size of a was small at, then we are OK
# eg. a.shape == (16,) gives -- a: ~128B, b: ~2KB, c: ~32KB
The example above while quite readable is not efficient as the size of the arrays increase due to the temporary arrays used in reduction operation, this could be converted to a loop based format which would only use O(n). If the output you wanted was the broadcast arrays themselves not the reduction, then broadcasting would be very near optimal.
example: I recently run into this problem myself. I had a 1D binary mask that I needed to broadcast with itself into a 2D matrix so that I could then use it to extract elements from a large pre-computed distance matrix, the extra condition was that I had to exclude the diagonal too (I also did not have access to the original 1d positions).
Naturally this would look as follows:
import numpy as np
def broadcast_masked_tril_total(dists2d, mask):
# broadcast 1d mask into 2d array
# - this can be very slow, moving from O(N) to O(N^2) memory
mask2d = mask[None, :] & mask[:, None]
# ignore diagonal
# - the 2D array needs to exist in memory to make these edits, a view cannot work.
np.fill_diagonal(mask2d, False)
# index array with elements
# - 2d mask means O(N^2) memory is read, instead of O(N)
total = dists2d[mask2d].sum()
# elems are repeated so divide by 2
return total / 2
The problem was of course the memory usage and the intermediate storage of values.
There may be a clever numpy fix, but the obvious solution is just to convert it to loops, as you say, where you don't need to make use of the broadcasting. Advantageously you can try identifying which operations can be merged together instead of chaining them like in numpy.
Usually a general rule of thumb is that the less memory accesses and intermediate storage locations the faster it will run.
import numba
@numba.njit()
def efficient_masked_tril_total(dists2d, mask):
total = 0.
for y, y_val in enumerate(mask):
# we only ever need to read from the same O(N) mask memory
# we can immediately skip invalid rows
if not y_val:
continue
# skip the diagonal and one triangle of the distance matrix
# - can't do this efficiently with numpy broadcasting and
# mask, intermediate storage of 2d mask was required
for x in range(y+1, len(mask)):
# again accessing the same O(n) mask item without broadcasting
if not mask[x]:
continue
total += dists2d[y, x]
return total
for example using this:
N = int(np.sqrt((10*1024**2)/(64/8))) # enough elems for 10.0 MB
# make distance matrices
mask = np.random.random(N) > 0.5
positions = np.random.random(N)
# again we broadcast, note that we could further optimise
# our efficient approach by just passing in the positions directly.
dists2d = np.abs(positions[:, None] - positions[None, :])
# warmup
print(broadcast_masked_tril_total(dists2d, mask))
print(efficient_masked_tril_total(dists2d, mask))
# timeit
import timeit
print(timeit.timeit(lambda: broadcast_masked_tril_total(dists2d, mask), number=1000))
print(timeit.timeit(lambda: efficient_masked_tril_total(dists2d, mask), number=1000))
tl;dr: In short, I would suggest that you always use the simplest most readable solution. Only then if it becomes a performance problem should you spend time benchmarking and optimising your approach. Just remember that "premature optimisation is the root of all evil."
So there isn't really a specific case where broadcasting is a bad idea. Often it is the simplest solution and that is a good thing. Sometimes broadcasting will be the optimal solution if you need to return the the actual broadcast array. If you are using the broadcast with some sort of secondary operation such as a reduction, you can probably optimise the combined operation by converting it to loops. Just remember that it isn't necessarily bad, its only a problem if performance becomes an issue. Smaller arrays are usually not a problem, but if you are working with much larger ones broadcasting can easily cause memory issues too.