For a pure numpy solution you could do something like this:
Use np.unique
to get the unique values and corresponding indices in x
and y
separately:
# sorted unique values in x and y and the indices corresponding to their first
# occurrences, such that u_x == x[u_idx_x]
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
Find the intersection of the unique values using np.intersect1d
:
# we can assume_unique, which can be faster for large arrays
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
Finally, use np.in1d
to select only the indices that correspond to unique values in x
or y
that also happen to be in the intersection of x
and y
:
# it is also safe to assume_unique here
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
To pull all that together into a single function:
def intersect_indices(x, y):
u_x, u_idx_x = np.unique(x, return_index=True)
u_y, u_idx_y = np.unique(y, return_index=True)
i_xy = np.intersect1d(u_x, u_y, assume_unique=True)
i_idx_x = u_idx_x[np.in1d(u_x, i_xy, assume_unique=True)]
i_idx_y = u_idx_y[np.in1d(u_y, i_xy, assume_unique=True)]
return i_idx_x, i_idx_y
For example:
x = np.array([4, 1, 10, 5, 8, 13, 11])
y = np.array([20, 5, 4, 9, 11, 7, 25])
i_idx_x, i_idx_y = intersect_indices(x, y)
print(i_idx_x, i_idx_y)
# (array([0, 3, 6]), array([2, 1, 4]))
Speed test:
In [1]: k = 1000000
In [2]: %%timeit x, y = np.random.randint(k, size=(2, k))
intersect_indices(x, y)
....:
1 loops, best of 3: 597 ms per loop
Update:
I initially missed the fact that in your case both x
and y
contain only unique values. Taking that into account, it's possible to do slightly better by using an indirect sort:
def intersect_indices_unique(x, y):
u_idx_x = np.argsort(x)
u_idx_y = np.argsort(y)
i_xy = np.intersect1d(x, y, assume_unique=True)
i_idx_x = u_idx_x[x[u_idx_x].searchsorted(i_xy)]
i_idx_y = u_idx_y[y[u_idx_y].searchsorted(i_xy)]
return i_idx_x, i_idx_y
Here's a more realistic test case, where x
and y
both contain unique (but partially overlapping) values:
In [1]: n, k = 10000000, 1000000
In [2]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices(x, y)
....:
1 loops, best of 3: 593 ms per loop
In [3]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
intersect_indices_unique(x, y)
....:
1 loops, best of 3: 453 ms per loop
@Divakar's solution is very similar in terms of performance:
In [4]: %%timeit x, y = (np.random.choice(n, size=k, replace=False) for _ in range(2))
searchsorted_based(x, y)
....:
1 loops, best of 3: 472 ms per loop