Iterating through a scipy.sparse vector (or matrix)
Asked Answered
C

6

52

I'm wondering what the best way is to iterate nonzero entries of sparse matrices with scipy.sparse. For example, if I do the following:

from scipy.sparse import lil_matrix

x = lil_matrix( (20,1) )
x[13,0] = 1
x[15,0] = 2

c = 0
for i in x:
  print c, i
  c = c+1

the output is

0 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13   (0, 0) 1.0
14 
15   (0, 0) 2.0
16 
17 
18 
19  

so it appears the iterator is touching every element, not just the nonzero entries. I've had a look at the API

http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.lil_matrix.html

and searched around a bit, but I can't seem to find a solution that works.

Creigh answered 30/11, 2010 at 21:46 Comment(0)
M
80

Edit: bbtrb's method (using coo_matrix) is much faster than my original suggestion, using nonzero. Sven Marnach's suggestion to use itertools.izip also improves the speed. Current fastest is using_tocoo_izip:

import scipy.sparse
import random
import itertools

def using_nonzero(x):
    rows,cols = x.nonzero()
    for row,col in zip(rows,cols):
        ((row,col), x[row,col])

def using_coo(x):
    cx = scipy.sparse.coo_matrix(x)    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo(x):
    cx = x.tocoo()    
    for i,j,v in zip(cx.row, cx.col, cx.data):
        (i,j,v)

def using_tocoo_izip(x):
    cx = x.tocoo()    
    for i,j,v in itertools.izip(cx.row, cx.col, cx.data):
        (i,j,v)

N=200
x = scipy.sparse.lil_matrix( (N,N) )
for _ in xrange(N):
    x[random.randint(0,N-1),random.randint(0,N-1)]=random.randint(1,100)

yields these timeit results:

% python -mtimeit -s'import test' 'test.using_tocoo_izip(test.x)'
1000 loops, best of 3: 670 usec per loop
% python -mtimeit -s'import test' 'test.using_tocoo(test.x)'
1000 loops, best of 3: 706 usec per loop
% python -mtimeit -s'import test' 'test.using_coo(test.x)'
1000 loops, best of 3: 802 usec per loop
% python -mtimeit -s'import test' 'test.using_nonzero(test.x)'
100 loops, best of 3: 5.25 msec per loop
Manna answered 30/11, 2010 at 21:57 Comment(6)
How about using izip() instead of zip()? Should be faster for big matrices.Willtrude
nice, didn't know about izip(). Actually I'm a bit surprised that tocoo() is faster than the coo_matrix() constructor...Rugose
The main point here is to use tocoo, but I strongly recommend no to iterate! All of x.row, x.col and x.data are numpy arrays which can be use as is in most cases. Simple example, setting matrix element value to the product of their indices: x.data[:] = x.col*x.rowBlase
For python3 you can just use the built-in zip function: python2to3Ravelment
x,nonzero first does x.tocoo(), and then masks with A.data!=0 to ensure that there aren't any 'stray' 0s in the matrix. That's why it's slower.Scrappy
@hpaulj: Thanks very much for the explanation. For others who are curious, here is the definition of nonzero for sparse matrices.Manna
R
39

The fastest way should be by converting to a coo_matrix:

cx = scipy.sparse.coo_matrix(x)

for i,j,v in zip(cx.row, cx.col, cx.data):
    print "(%d, %d), %s" % (i,j,v)
Rugose answered 30/11, 2010 at 22:5 Comment(3)
Is it faster to convert and then iterate, or is this assuming that I can change to work with coo_matrix?Creigh
@scandido: this depends on what you are going to achieve. coo_matrix is a very simple format and very fast to construct and access but might be ill-suited for other tasks. Here's an overview over the different matrix formats scipy.org/SciPyPackages/Sparse and especially section "constructing from scratch faster, with coo_matrix"Rugose
A warning: cx.row may be int32 (see this question), so a subprogram that expects int64 can misbehave. For example, pyglpk gives "TypeError: row/col indices must be ints or strings" -- meaning int64. The fix is simple: zip( cx.row.astype(int), cx.col.astype(int), cx.data )Inhibitory
M
5

To loop a variety of sparse matrices from the scipy.sparse code section I would use this small wrapper function (note that for Python-2 you are encouraged to use xrange and izip for better performance on large matrices):

from scipy.sparse import *
def iter_spmatrix(matrix):
    """ Iterator for iterating the elements in a ``scipy.sparse.*_matrix`` 

    This will always return:
    >>> (row, column, matrix-element)

    Currently this can iterate `coo`, `csc`, `lil` and `csr`, others may easily be added.

    Parameters
    ----------
    matrix : ``scipy.sparse.sp_matrix``
      the sparse matrix to iterate non-zero elements
    """
    if isspmatrix_coo(matrix):
        for r, c, m in zip(matrix.row, matrix.col, matrix.data):
            yield r, c, m

    elif isspmatrix_csc(matrix):
        for c in range(matrix.shape[1]):
            for ind in range(matrix.indptr[c], matrix.indptr[c+1]):
                yield matrix.indices[ind], c, matrix.data[ind]

    elif isspmatrix_csr(matrix):
        for r in range(matrix.shape[0]):
            for ind in range(matrix.indptr[r], matrix.indptr[r+1]):
                yield r, matrix.indices[ind], matrix.data[ind]

    elif isspmatrix_lil(matrix):
        for r in range(matrix.shape[0]):
            for c, d in zip(matrix.rows[r], matrix.data[r]):
                yield r, c, d

    else:
        raise NotImplementedError("The iterator for this sparse matrix has not been implemented")
Mumford answered 6/3, 2017 at 12:29 Comment(0)
J
2

tocoo() materializes the entire matrix into a different structure, which is not the preferred MO for python 3. You can also consider this iterator, which is especially useful for large matrices.

from itertools import chain, repeat
def iter_csr(matrix):
  for (row, col, val) in zip(
    chain(*(
          repeat(i, r)
          for (i,r) in enumerate(comparisons.indptr[1:] - comparisons.indptr[:-1])
    )),
    matrix.indices,
    matrix.data
  ):
    yield (row, col, val)

I have to admit that I'm using a lot of python-constructs which possibly should be replaced by numpy-constructs (especially enumerate).

NB:

In [43]: t=time.time(); sum(1 for x in rather_dense_sparse_matrix.data); print(time.time()-t)
52.48686504364014
In [44]: t=time.time(); sum(1 for x in enumerate(rather_dense_sparse_matrix.data)); print(time.time()-t)
70.19013023376465
In [45]: rather_dense_sparse_matrix
<99829x99829 sparse matrix of type '<class 'numpy.float16'>'
with 757622819 stored elements in Compressed Sparse Row format>

So yes, enumerate is somewhat slow(ish)

For the iterator:

In [47]: it = iter_csr(rather_dense_sparse_matrix)
In [48]: t=time.time(); sum(1 for x in it); print(time.time()-t)
113.something something

So you decide whether this overhead is acceptable, in my case the tocoo caused MemoryOverflows's.

IMHO: such an iterator should be part of the csr_matrix interface, similar to items() in a dict() :)

Johny answered 6/7, 2015 at 11:11 Comment(3)
That's supposed to be matrix.indptr rather than comparisons.indptr, isn't it?Indetermination
No, comparisons.indptr[1:] - comparisons.indptr[:-1] is a vector with the lengths of the rows :) The chain / repeat part is just to iterate through the row indices, which are 0 0 0 ... 0 1 1 ... 1 2 2 ... 2 .... n n .... n (that is how csr works), so I simply repeat the row as much times as the number of elements in the row, and chain these all together.Johny
Uh, there's no variable 'comparisons' in scope!Indetermination
D
1

I had the same problem and actually, if your concern is only speed, the fastest way (more than 1 order of magnitude faster) is to convert the sparse matrix to a dense one (x.todense()), and iterating over the nonzero elements in the dense matrix. (Though, of course, this approach requires a lot more memory)

Devilmaycare answered 28/12, 2010 at 16:18 Comment(2)
Are you proposing using dense matrices all the time, or converting to dense and then iterating? I can't imagine the latter would be faster. But, of course using a dense matrix will be much, much faster if you have enough memory.Creigh
I guess it depends on the scenario, and the kind of data. I've been doing some prophiling on a script that iterates on matrices containing at least 50-100M boolean elements. When iterating, converting to dense and then iterating requires way less time than iterating using the 'best solution' from unutbu's answer. But of course the memory usage increases A LOT.Devilmaycare
I
0

Try filter(lambda x:x, x) instead of x.

Inviolate answered 30/11, 2010 at 21:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.