Iterating in Python can be quite slow. It's always best to "vectorise" and use numpy operations on arrays as much as possible, which pass the work to numpy's low-level implementation, which is fast.
cosine_similarity
is already vectorised. An ideal solution would therefore simply involve cosine_similarity(A, B)
where A and B are your first and second arrays. Unfortunately this matrix is 500,000 by 160,000 which is too large to do in memory (it throws an error).
The next best solution then is to split A (by rows) into large blocks (instead of individual rows) so that the result still fits in memory, and iterate over them. I find for your data that using 100 rows in each block fits in memory; much more and it doesn't work. Then we simply use .max
and get our 100 maxes for each iteration, which we can collect together at the end.
This way strongly suggests we do an additional time save, though. The formula for the cosine similarity of two vectors is u.v / |u||v|, and it is the cosine of the angle between the two. Because we're iterating, we keep recalculating the lengths of the rows of B each time and throwing the result away. A nice way around this is to use the fact that cosine similarity does not vary if you scale the vectors (the angle is the same). So we can calculate all the row lengths only once and divide by them to make the rows unit vectors. And then we calculate the cosine similarity simply as u.v, which can be done for arrays via matrix multiplication. I did a quick test of this and it was about 3 times faster.
Putting it all together:
import numpy as np
# Example data
A = np.random.random([500000, 100])
B = np.random.random([160000, 100])
# There may be a proper numpy method for this function, but it won't be much faster.
def normalise(A):
lengths = (A**2).sum(axis=1, keepdims=True)**.5
return A/lengths
A = normalise(A)
B = normalise(B)
results = []
rows_in_slice = 100
slice_start = 0
slice_end = slice_start + rows_in_slice
while slice_end <= A.shape[0]:
results.append(A[slice_start:slice_end].dot(B.T).max(axis=1))
slice_start += rows_in_slice
slice_end = slice_start + rows_in_slice
result = np.concatenate(results)
This takes me about 2 seconds per 1,000 rows of A to run. So it should be about 1,000 seconds for your data.
cosine_similarity
. – Hydrogenous