How to find indices of non zero elements in large sparse matrix?

Asked 9/9, 2011 at 12:12 Answered 9/9, 2011 at 15:20

Solved python algorithm r sparse-matrix indices

i have two sq matrix (a, b) of size in order of 100000 X 100000. I have to take difference of these two matrix (c = a-b). Resultant matrix 'c' is a sparse matrix. I want to find the indices of all non-zero elements. I have to do this operation many times (>100).

Simplest way is to use two for loops. But that's computationally intensive. Can you tell me any algorithm or package/library preferably in R/python/c to do this as quickly as possible?

Towle answered 9/9, 2011 at 12:12 Comment(7)

It seems to me that any way you do this, you'd have to touch every entry anyway. But then, the same is true for subtraction; so what if, while subtracting, you immediately store the resulting matrix in one of the sparse matrix data-structures? Can you do that? – Domenicadomenico 9/9, 2011 at 12:36

Do you know if there is any relationship between the locations of the non-zero entries in A and B, such as they tend to be in the same positions? Otherwise, C will have a union of their two locations. – Eightieth 9/9, 2011 at 12:44

Are your square matrices dense? – Cookout 9/9, 2011 at 13:2

@harold yeah that can be one of the solution. thanks. – Towle 9/9, 2011 at 13:14

@Eightieth right now, i dont have knowledge for the relationship – Towle 9/9, 2011 at 13:15

@David yeah, it is dense matrix – Towle 9/9, 2011 at 13:16

Where do you get your matrices from? – Denial 9/9, 2011 at 16:2

Since you have two dense matrices then the double for loop is the only option you have. You don't need a sparse matrix class at all since you only want to know the list of indices (i,j) for which a[i,j] != b[i,j].

In languages like R and Python the double for loop will perform poorly. I'd probably write this in native code for a double for loop and add the indices to a list object. But no doubt the wizards of interpreted code (i.e. R, Python etc.) know efficient ways to do it without resorting to native coding.

Cookout answered 9/9, 2011 at 13:21 Comment(2)

AFAIK, the only other method is vectorization. :) – Eightieth 9/9, 2011 at 14:10

@Eightieth well yes, but that's essentially the same method – Cookout 9/9, 2011 at 14:19

In R, if you use the Matrix package, and sparseMatrix for the conversion from the coordinate list to the sparse matrix, then you can convert back to the 3 column via:

TmpX <- as(M, "dgTMatrix")
X3col <- matrix(c(TmpX@i, TmpX@j, TmpX@val), ncol = 3)

This will give you the coordinates and values in the sparse matrix.

Depending on the locations of non-zero entries in A and B, you may find it much better to work with the coordinate list than the sparse matrix representation (there are, by the way, dozens of sparse matrix representations), as you can take direct advantage of vectorized operations, rather than rely upon your sparse matrix package to perform optimally. I tend to alternate between using the COO or sparse matrix support in different languages, depending on how I will get the fastest performance for the algorithm of interest.

Update 1: I was unaware that your two matrices, A and B, are dense. As such, the easiest solution for finding non-zero entries in C is quite simply to not even subtract at first - just compare the entries of A and B. A logical comparison should be faster than a subtraction. First, find the entries of A and B where A != B, then subtract just those entries. Next, you simply need to convert from the vectorization of indices in A and B to their (row, col) representation. This is similar to ind2sub and sub2ind of Matlab - take a look at this R reference for the calculations.

Eightieth answered 9/9, 2011 at 12:54 Comment(0)

You could use c.nonzero() method:

>>> from scipy.sparse import lil_eye
>>> c = lil_eye((4, 10)) # as an example
>>> c
<4x10 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in LInked List format>
>>> c.nonzero()
(array([0, 1, 2, 3], dtype=int32), array([0, 1, 2, 3], dtype=int32))
>>> import numpy as np
>>> np.ascontiguousarray(c)
array([  (0, 0) 1.0
  (1, 1)        1.0
  (2, 2)        1.0
  (3, 3)        1.0], dtype=object)

You don't need to calculate c matrix to find out indexes of non-zero elements in c = a - b; you could do (a != b).nonzero():

>>> a = np.random.random_integers(2, size=(4,4))
>>> b = np.random.random_integers(2, size=(4,4))
>>> (a != b).nonzero()
(array([0, 0, 1, 1, 1, 2, 3]), array([1, 2, 1, 2, 3, 2, 0]))
>>> a - b
array([[ 0,  1,  1,  0],
       [ 0,  1, -1, -1],
       [ 0,  0,  1,  0],
       [-1,  0,  0,  0]])

Mightily answered 9/9, 2011 at 12:25 Comment(0)

have a look at numpy it have everything you ask for and more!

See this for sparse matrix support

Cosmism answered 9/9, 2011 at 12:15 Comment(1)

thnks Fredrik. i had modified my question.I have to repeat the process for more than 100 times. do u know which algorithm are used. Because i might have to use other language & then i can code myself. – Towle 9/9, 2011 at 12:35

This code takes less then 0.1s.

m <- matrix(rpois(1000000,0.01),ncol=1000)
m0 <- lapply(seq(NCOL(m)),function(x) which(m[,x] != 0))

EDIT: For sparse matrices of any size (which fits memory).

DATA

library(data.table)

N <- 1e+5
n <- 1e+6

ta <- data.table(r=sample(seq(N), n,replace=TRUE),
                 c=sample(seq(N), n,replace=TRUE),
                 a=sample(1:20,n,replace=TRUE))
tb <- data.table(r=sample(seq(N), n,replace=TRUE),
                 c=sample(seq(N), n,replace=TRUE),
                 b=sample(1:20,n,replace=TRUE))
setkey(ta,r,c)
setkey(tb,r,c)

CODE

system.time(tw <- ta[tb][is.na(a)|is.na(b)|(a-b != 0),list(r=r,c=c)])

Zwick answered 9/9, 2011 at 13:27 Comment(4)

But he got 100000x100000 matrix. – Mathew 9/9, 2011 at 16:6

More: this code k<-100000; m<-rpois(k, 0.01); system.time(replicate(k, which(m != 0))) takes ~1m40s. If OP need to repeat this 100 times than you need over 1h to complete. – Mathew 9/9, 2011 at 16:16

Max vector size is 2^32~5*10^9<10*10^9, so it is impossible to construct matrix of that size in R. – Zwick 10/9, 2011 at 5:37

Vector size of sparse-matrix <not-equal> product of dimensions. – Corenecoreopsis 29/9, 2012 at 15:34

I haven't timed it, but the simplest code is

all.indices<- which (C>0, arr.ind=T)

Forfeiture answered 9/9, 2011 at 15:20 Comment(1)

This works well on sparse matrices created with Matrix package. – Corenecoreopsis 29/9, 2012 at 15:32

Recommended topics

Hot tags