slicing sparse (scipy) matrix
Asked Answered
N

4

13

I would appreciate any help, to understand following behavior when slicing a lil_matrix (A) from the scipy.sparse package.

Actually, I would like to extract a submatrix based on an arbitrary index list for both rows and columns.

When I used this two lines of code:

x1 = A[list 1,:]
x2 = x1[:,list 2]

Everything was fine and I could extract the right submatrix.

When I tried to do this in one line, it failed (The returning matrix was empty)

x=A[list 1,list 2]

Why is this so? Overall, I have used a similar command in matlab and there it works. So, why not use the first, since it works? It seems to be quite time consuming. Since I have to go through a large amount of entries, I would like to speed it up using a single command. Maybe I use the wrong sparse matrix type...Any idea?

Nevels answered 30/9, 2011 at 10:30 Comment(2)
what does list1 and list2 contents ? What gives A[list1:list2] ??Brenan
list1 and list 2 are are python list objects containing integers e.g. [1,4,6,8] A[list1:list2] is empty (<1x3 sparse matrix of type '<type 'numpy.int32'>' with 0 stored elements in LInked List format>Nevels
R
17

The method you are already using,

A[list1, :][:, list2]

seems to be the fastest way to select the desired values from a spares matrix. See below for a benchmark.

However, to answer your question about how to select values from arbitrary rows and columns of A with a single index, you would need to use so-called "advanced indexing":

A[np.array(list1)[:,np.newaxis], np.array(list2)]

With advanced indexing, if arr1 and arr2 are NDarrays, the (i,j) component of A[arr1, arr2] equals

A[arr1[i,j], arr2[i,j]]

Thus you would want arr1[i,j] to equal list1[i] for all j, and arr2[i,j] to equal list2[j] for all i.

That can be arranged with the help of broadcasting (see below) by setting arr1 = np.array(list1)[:,np.newaxis], and arr2 = np.array(list2).

The shape of arr1 is (len(list1), 1) while the shape of arr2 is (len(list2), ) which broadcasts to (1, len(list2)) since new axes are added on the left automatically when needed.

Each array can be further broadcasted to shape (len(list1),len(list2)). This is exactly what we want for A[arr1[i,j],arr2[i,j]] to make sense, since we want (i,j) to run over all possible indices for a result array of shape (len(list1),len(list2)).


Here is a microbenchmark for one test case which suggests that A[list1, :][:, list2] is the fastest option:

In [32]: %timeit orig(A, list1, list2)
10 loops, best of 3: 110 ms per loop

In [34]: %timeit using_listener(A, list1, list2)
1 loop, best of 3: 1.29 s per loop

In [33]: %timeit using_advanced_indexing(A, list1, list2)
1 loop, best of 3: 1.8 s per loop

Here is the setup I used for the benchmark:

import numpy as np
import scipy.sparse as sparse
import random
random.seed(1)

def setup(N):
    A = sparse.rand(N, N, .1, format='lil')
    list1 = np.random.choice(N, size=N//10, replace=False).tolist()
    list2 = np.random.choice(N, size=N//20, replace=False).tolist()
    return A, list1, list2

def orig(A, list1, list2):
    return A[list1, :][:, list2]

def using_advanced_indexing(A, list1, list2):
    B = A.tocsc()  # or `.tocsr()`
    B = B[np.array(list1)[:, np.newaxis], np.array(list2)]
    return B

def using_listener(A, list1, list2):
    """https://mcmap.net/q/861724/-slicing-sparse-scipy-matrix (listener)"""
    B = A.tocsr()[list1, :].tocsc()[:, list2]
    return B

N = 10000
A, list1, list2 = setup(N)
B = orig(A, list1, list2)
C = using_advanced_indexing(A, list1, list2)
D = using_listener(A, list1, list2)
assert np.allclose(B.toarray(), C.toarray())
assert np.allclose(B.toarray(), D.toarray())
Reflect answered 30/9, 2011 at 12:23 Comment(6)
Thanks. This seems quite elegant, but keep in mind that I am using a sparse matrix from the scipy.sparse package. Unfortunatelly, this kind of indexing does not work. It gives an IndexError.Nevels
Hm. Indeed, it does not work with lil_matrix, but it does work with csc_matrix or csr_matrix.Reflect
It seems to me that something like A[list1,:][:,list2] gives the same result but operates much faster on sparse matrices.Andel
@unutbu, please consider including my new answer to this topic in your answer as you did with listener's solution.Washhouse
@yogabonito: Thanks for your comment. Upon re-benchmarking the available options, it seems @Andel is right. A[list1, :][:, list2] is much faster than A[np.array(list1)[:,np.newaxis], np.array(list2)] or A.tocsr()[list1, :].tocsc()[:, list2]. I've edited my answer accordingly. I'd be interest to know if A[list1, :][:, list2] is not the fastest option on your machine for your setup.Reflect
@unutbu: You're right :) +1 for taking indexing with [list1, :][:, list2] into account. Just updated my answer to include the %timeit-results for this solution too.Washhouse
T
4

for me the solution from unutbu works well, but is slow.

I found as a fast alternative,

A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]

You can see that row'S and col's get cut separately, but each one converted to the fastest sparse format, to get index this time.

In my test environment this code is 1000 times faster than the other one.

I hope, I don't tell something wrong or make a mistake.

Tubulate answered 27/10, 2014 at 16:59 Comment(0)
W
2

Simultaneous indexing as in B[arr1, arr2] does work and it's faster than listener's solution on my machine. See In [5] in the Jupyter example below. To compare it with the mentioned answer refer to In [6]. Furthermore, my solution doesn't need the .tocsc() conversion, making it more readable IMO.

Please note that for B[arr1, arr2] to work, arr1 and arr2 must be broadcastable numpy arrays.

A much faster solution, however, is using B[list1][:, list2] as pointed out by unutbu. See In [7] below.

In [1]: from scipy import sparse
      : import numpy as np
      : 
      : 

In [2]: B = sparse.rand(1000, 1000, .1, format='lil')
      : list1=[1,4,6,8]
      : list2=[2,4]
      : 
      : 

In [3]: arr1 = np.array(list1)[:, None]  # make arr1 a (n x 1)-array
      : arr1
      : 
      : 
Out[3]: 
array([[1],
       [4],
       [6],
       [8]])

In [4]: arr2 = np.array(list2)[None, :]  # make arr2 a (1 x m)-array
      : arr2
      : 
      : 
Out[4]: array([[2, 4]])

In [5]: %timeit A = B.tocsr()[arr1, arr2]
100 loops, best of 3: 13.1 ms per loop

In [6]: %timeit A = B.tocsr()[np.array(list1),:].tocsc()[:,np.array(list2)]
100 loops, best of 3: 14.6 ms per loop

In [7]: %timeit B[list1][:, list2]
1000 loops, best of 3: 205 µs per loop
Washhouse answered 4/9, 2017 at 20:25 Comment(0)
B
-1

slicing happens with this syntax :

a[1:4]

for a = array([1,2,3,4,5,6,7,8,9]), the result is

array([2, 3, 4])

The first parameter of the tuple indicates the first value to be retained, and the second parameter indicates the first value not to be retained.

If you use lists on both sides, it means that your array has as many dimensions as the lists length.

So, with your syntax, you will probably need something like this :

x = A[list1,:,list2]

depending on the shape of A.

Hope it did help you.

Brenan answered 30/9, 2011 at 12:15 Comment(1)
The question wasn't about array().Cuirassier

© 2022 - 2024 — McMap. All rights reserved.