Python: Cosine similarity between two large numpy arrays
Asked Answered
P

2

8

I have two numpy arrays:

Array 1: 500,000 rows x 100 cols

Array 2: 160,000 rows x 100 cols

I would like to find the largest cosine similarity between each row in Array 1 and Array 2. In other words, I compute the cosine similarities between the first row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity, and then I compute the cosine similarities between the second row in Array 1 and all the rows in Array 2, and find the maximum cosine similarity; and do this for the rest of Array 1.

I currently use sklearn's cosine_similarity() function and do the following, but it is extremely slow. I wonder if there is a faster way that doesn't involve multiprocessing/multithreading to accomplish what I want to do. Also, the arrays I have are not sparse.

from sklearn.metrics.pairwise import cosine_similarity as cosine

results = []
for i in range(Array1.shape[0]):
     results.append(numpy.max(cosine(Array1[None,i,:], Array2)))
Pro answered 26/8, 2018 at 23:18 Comment(6)
Unless I've misunderstood the problem, you know that's always going to require 80000000000 operations on rows?Hydrogenous
Yes...that’s why it’s so slow. The nature of the task is as follows: Array2 is numeric representations of 160k documents. Array1 is numeric reps of 500k docs. I want to find which of the 160k documents each of the 500k docs is most similar to, hence the use of cosine similarity l.Pro
Okay. My point is just that regardless of the optimisation, it's always going to take a long time to do this. The problem is probably not with cosine_similarity.Hydrogenous
It's an interesting problem though, I will give it a try.Hydrogenous
Honestly what you're doing seems like a good approach given your chosen tool. You are doing the calculation vectorized between one row and the entire 2nd array.. this is a good approach. Maybe consider this post? #47625937Birchard
@Pro Do you understand the benefits of vectorisation? The solution I'm testing simply involves slicing A up into small enough chunks (but not rows).Hydrogenous
H
10

Iterating in Python can be quite slow. It's always best to "vectorise" and use numpy operations on arrays as much as possible, which pass the work to numpy's low-level implementation, which is fast.

cosine_similarity is already vectorised. An ideal solution would therefore simply involve cosine_similarity(A, B) where A and B are your first and second arrays. Unfortunately this matrix is 500,000 by 160,000 which is too large to do in memory (it throws an error).

The next best solution then is to split A (by rows) into large blocks (instead of individual rows) so that the result still fits in memory, and iterate over them. I find for your data that using 100 rows in each block fits in memory; much more and it doesn't work. Then we simply use .max and get our 100 maxes for each iteration, which we can collect together at the end.

This way strongly suggests we do an additional time save, though. The formula for the cosine similarity of two vectors is u.v / |u||v|, and it is the cosine of the angle between the two. Because we're iterating, we keep recalculating the lengths of the rows of B each time and throwing the result away. A nice way around this is to use the fact that cosine similarity does not vary if you scale the vectors (the angle is the same). So we can calculate all the row lengths only once and divide by them to make the rows unit vectors. And then we calculate the cosine similarity simply as u.v, which can be done for arrays via matrix multiplication. I did a quick test of this and it was about 3 times faster.

Putting it all together:

import numpy as np

# Example data
A = np.random.random([500000, 100])
B = np.random.random([160000, 100])

# There may be a proper numpy method for this function, but it won't be much faster.
def normalise(A):
    lengths = (A**2).sum(axis=1, keepdims=True)**.5
    return A/lengths

A = normalise(A)
B = normalise(B)

results = []

rows_in_slice = 100

slice_start = 0
slice_end = slice_start + rows_in_slice

while slice_end <= A.shape[0]:

    results.append(A[slice_start:slice_end].dot(B.T).max(axis=1))

    slice_start += rows_in_slice
    slice_end = slice_start + rows_in_slice

result = np.concatenate(results)

This takes me about 2 seconds per 1,000 rows of A to run. So it should be about 1,000 seconds for your data.

Hydrogenous answered 27/8, 2018 at 1:3 Comment(1)
Thanks a lot. This helped me in reducing memory space for my cosine similarity on large arrays.Goer
C
2

Just adding numba version which will transform into fast machine code.

I made many for loops because numpy use broadcast which will allocate temp memory and it will already be memory bound I guess.

I have just re-written the cosine logic in numba. Also you can parallelize this by adding parallel=True in njit option.

Though it depends on problem whether numba will perform better than numpy, but numpy parallel is difficult

import numpy as np
import numba as nb

A_1 = np.random.random((500, 100))
A_2 = np.random.random((160, 100))

@nb.njit((nb.float64[:, ::100], nb.float64[:, ::100]))
def max_cos(a, b):
    norm_a = np.empty((a.shape[0],), dtype=np.float64)
    norm_b = np.empty((b.shape[0],), dtype=np.float64)

    for i in nb.prange(a.shape[0]):
        sq_norm = 0.0
        for j in range(100):
            sq_norm += a[i][j] ** 2
        norm_a[i] = sq_norm ** 0.5
    
    for i in nb.prange(b.shape[0]):
        sq_norm = 0.0
        for j in range(100):
            sq_norm += b[i][j] ** 2
        norm_b[i] = sq_norm ** 0.5
        
    max_pair = (0, 0)
    min_dot = 1e+307
    for i in nb.prange(a.shape[0]):
        max_j = 0
        min_idot = 1e+307
        for j in range(b.shape[0]):
            dot_ij = 0.0
            for k in range(100):
                dot_ij += a[i][k] * b[j][k]
            dot_ij /= norm_b[j]
            if min_idot > dot_ij:
                min_idot = dot_ij
                max_j = j
        min_idot /= norm_a[i]
        if min_dot > min_idot:
            min_dot = min_idot
            max_pair = (i, j)
    return max_pair
%%timeit
max_cos(A_1, A_2)
# 6.03 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
from sklearn.metrics.pairwise import cosine_similarity as cosine

results = []
for i in range(A_1.shape[0]):
     results.append(np.max(cosine(A_1[None,i,:], A_2)))
# 115 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Chimere answered 9/9, 2021 at 13:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.