Constructing sparse matrix from list of lists of tuples
Asked Answered
A

5

13

I have a Python list of row information for a sparse matrix. Each row is represented as a list of (column, value) tuples. Call it alist:

alist = [[(1,10), (3,-3)],
         [(2,12)]]

How can I efficiently construct a scipy sparse matrix from this list of lists, resulting in a matrix like this:

0  10   0  -3
0   0  12   0

The obvious approach is to make a scipy.sparse.lil_matrix, which internally has this "list of lists" structure. But from the scipy.sparse.lil_matrix — SciPy v0.19.0 Reference Guide I see just three ways of constructing them:

  • starting from a dense array
  • starting from another sparse array
  • just constructing an empty array

So the only way to get fresh data in is either to solve this problem with some other sparse matrix representation, or to start with a dense array, neither of which address the initial problem, and both of which seem likely to be less efficient representations than lil_matrix itself for this data.

I guess I can make an empty one, and use a loop to add values, but surely I'm missing something.

The scipy documentation is really frustrating when it comes to sparse matrices.

Academic answered 25/4, 2017 at 19:4 Comment(1)
See also, github.com/pandas-dev/pandas/issues/21909Alvera
R
8

Your data layout is an unusual one. Here's my first stab at using it.

In [565]: M = sparse.lil_matrix((2,4), dtype=int)
In [566]: M
Out[566]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 0 stored elements in LInked List format>
In [567]: for i,row in enumerate(alist):
     ...:     for col in row:
     ...:         M[i, col[0]] = col[1]
     ...:         
In [568]: M
Out[568]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [569]: M.A
Out[569]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Yes, it is iterative; and lil is the best format for that purpose.

Or using the common coo style of inputs:

In [580]: data,col,row = [],[],[]
In [581]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         row.append(i)
     ...:         col.append(cc[0])
     ...:         data.append(cc[1])
     ...:         
In [582]: data,col,row
Out[582]: ([10, -3, 12], [1, 3, 2], [0, 0, 1])
In [583]: M1=sparse.coo_matrix((data,(row,col)),shape=(2,4))
In [584]: M1
Out[584]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in COOrdinate format>
In [585]: M1.A
Out[585]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Another option is to create the blank lil matrix, and directly fill in its attributes:

In other words, start with:

In [591]: m.data
Out[591]: array([[], []], dtype=object)
In [592]: m.rows
Out[592]: array([[], []], dtype=object)

and change them to:

In [587]: M.data
Out[587]: array([[10, -3], [12]], dtype=object)
In [588]: M.rows
Out[588]: array([[1, 3], [2]], dtype=object)

It would still require the 2 level iteration on your alist structure.

In [593]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         m.rows[i].append(cc[0])
     ...:         m.data[i].append(cc[1])       
In [594]: m
Out[594]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [595]: m.A
Out[595]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

In another comment you mentioned the difficulty in understanding the csr indptr. The easiest way to get that is to convert one these formats:

In [597]: Mr=M.tocsr()
In [598]: Mr.indptr
Out[598]: array([0, 2, 3], dtype=int32)
In [599]: Mr.data
Out[599]: array([10, -3, 12])
In [600]: Mr.indices
Out[600]: array([1, 3, 2], dtype=int32)
Rationalism answered 25/4, 2017 at 19:30 Comment(3)
Such a clear, helpful, detailed answer - thank you! The constructor based on the COO format seems most natural, and I could come up with some generators to produce it and achieve a memory- and time-efficient input pipeline. I hope the scipy folks add some examples like this in a way that people would find them. This is the format my data came in, and given the number of systems that support all these different sparse formats, as discussed at Sparse array - Wikipedia, I'd think more people would exchange data using them.Academic
I first worked with sparse matrices in MATLAB for finite element problems. There the coo style of input is the only option, though internally it stores the data as csc (at least that's format it saves to .mat files). Most of sparse math was developed for linear algebra problems. scipy adds a number of formats (note the links in the Wiki article). Now much of the sparse matrix interest is coming from big-data problems, sparse feature matrices, machine learning, linguistics, and such. scikit learn for example adds some of its own compiled sparse matrix utilities.Rationalism
I find it surprising the original data structure of the OP would be considered unusual (contrary to a comment above)Alvera
A
9

If you have the whole alist before creating the sparse matrix, there is no need to use lil_matrix, since that is optimized for incrementally updating the sparse matrix.

If you want to do any sort of arithmetic with the matrix afterwords, csr_matrix is probably your best choice. You can construct the csr_matrix directly by using (data, (row, column)) format, like this:

In [40]: alist = [[(1,10), (3,-3)],
    ...:          [(2,12)]]

In [41]: i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(alist) for t in row))

In [42]: (i, j, data)
Out[42]: ((0, 0, 1), (1, 3, 2), (10, -3, 12))

In [43]: csr_matrix((data, (i, j)), shape=(2, 4)).todense()
Out[43]: 
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]], dtype=int64)

If efficiency is a real concern, you can create the csr_matrix internal format (using indptr) directly:

In [57]: indptr = np.cumsum([0] + [len(row) for row in alist])

In [58]: j, data = zip(*(t for row in alist for t in row))

In [59]: csr_matrix((data, j, indptr), shape=(2, 4)).todense()
Out[59]: 
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]])

If you're converting to pandas afterwords, coo_matrix is the way to go, since pandas converts to coo_matrix anyway:

In [41]: i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(alist) for t in row))

In [43]: coo_matrix((data, (i, j)), shape=(2, 4))
Aerobiosis answered 13/7, 2018 at 21:23 Comment(4)
Thanks for the deliberation! and actually, in my particular case, I just need it as a sparse feature matrix alongside other dataframe columns, that can be lightly munged and looked at with pandas, before fitting an sklearn model (.fit) over it.Alvera
In that case coo_matrix is the way to go, since pandas anyway converts to that formatAerobiosis
Is providing the shape to coo_matrix performance-wise helpful/necessary? it works even without that!Alvera
If you don't provide the shape, coo_matrix will deduce it from shape = (max(i), max(j)). This can lead to problems if for instance the last column or last row is all zeros, as you will lose that row/column from the matrix.Aerobiosis
R
8

Your data layout is an unusual one. Here's my first stab at using it.

In [565]: M = sparse.lil_matrix((2,4), dtype=int)
In [566]: M
Out[566]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 0 stored elements in LInked List format>
In [567]: for i,row in enumerate(alist):
     ...:     for col in row:
     ...:         M[i, col[0]] = col[1]
     ...:         
In [568]: M
Out[568]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [569]: M.A
Out[569]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Yes, it is iterative; and lil is the best format for that purpose.

Or using the common coo style of inputs:

In [580]: data,col,row = [],[],[]
In [581]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         row.append(i)
     ...:         col.append(cc[0])
     ...:         data.append(cc[1])
     ...:         
In [582]: data,col,row
Out[582]: ([10, -3, 12], [1, 3, 2], [0, 0, 1])
In [583]: M1=sparse.coo_matrix((data,(row,col)),shape=(2,4))
In [584]: M1
Out[584]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in COOrdinate format>
In [585]: M1.A
Out[585]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

Another option is to create the blank lil matrix, and directly fill in its attributes:

In other words, start with:

In [591]: m.data
Out[591]: array([[], []], dtype=object)
In [592]: m.rows
Out[592]: array([[], []], dtype=object)

and change them to:

In [587]: M.data
Out[587]: array([[10, -3], [12]], dtype=object)
In [588]: M.rows
Out[588]: array([[1, 3], [2]], dtype=object)

It would still require the 2 level iteration on your alist structure.

In [593]: for i, rr in enumerate(alist):
     ...:     for cc in rr:
     ...:         m.rows[i].append(cc[0])
     ...:         m.data[i].append(cc[1])       
In [594]: m
Out[594]: 
<2x4 sparse matrix of type '<class 'numpy.int32'>'
    with 3 stored elements in LInked List format>
In [595]: m.A
Out[595]: 
array([[ 0, 10,  0, -3],
       [ 0,  0, 12,  0]])

In another comment you mentioned the difficulty in understanding the csr indptr. The easiest way to get that is to convert one these formats:

In [597]: Mr=M.tocsr()
In [598]: Mr.indptr
Out[598]: array([0, 2, 3], dtype=int32)
In [599]: Mr.data
Out[599]: array([10, -3, 12])
In [600]: Mr.indices
Out[600]: array([1, 3, 2], dtype=int32)
Rationalism answered 25/4, 2017 at 19:30 Comment(3)
Such a clear, helpful, detailed answer - thank you! The constructor based on the COO format seems most natural, and I could come up with some generators to produce it and achieve a memory- and time-efficient input pipeline. I hope the scipy folks add some examples like this in a way that people would find them. This is the format my data came in, and given the number of systems that support all these different sparse formats, as discussed at Sparse array - Wikipedia, I'd think more people would exchange data using them.Academic
I first worked with sparse matrices in MATLAB for finite element problems. There the coo style of input is the only option, though internally it stores the data as csc (at least that's format it saves to .mat files). Most of sparse math was developed for linear algebra problems. scipy adds a number of formats (note the links in the Wiki article). Now much of the sparse matrix interest is coming from big-data problems, sparse feature matrices, machine learning, linguistics, and such. scikit learn for example adds some of its own compiled sparse matrix utilities.Rationalism
I find it surprising the original data structure of the OP would be considered unusual (contrary to a comment above)Alvera
O
4

You can create a dict of positions and values from the list of (column, value) tuples alist and then use dok_matrix to construct the sparse matrix

>>> d = {(i,j):v for i,l in enumerate(alist) for j,v in l}
>>> d
{(0, 1): 10, (0, 3): -3, (1, 2): 12}
>>> 
>>> from operator import itemgetter
>>> m = max(d.keys(), key=itemgetter(0))[0] + 1
>>> n = max(d.keys(), key=itemgetter(1))[1] + 1
>>> m,n
(2, 4)
>>>
>>> from scipy.sparse import dok_matrix
>>> S = dok_matrix((m,n), dtype=int)
>>> for pos,v in d.items():
...     S[pos] = v
... 
>>> S.todense()
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]])
>>> 
Orth answered 13/7, 2018 at 23:43 Comment(4)
When end with .todense() though?Alvera
Any reason absolutely requiring itemgetter here?Alvera
@matanster. .todense() is just to show how the matrix looks likeOrth
@matanster. itemgetter is not necessary. We just need to get the dimensions automatically. You can replace it with lambda t: t[0]. So it would look like m = max(d.keys(), key=lambda t: t[0])[0] + 1 and n = max(d.keys(), key=lambda t: t[1])[1] + 1Orth
O
3

Just wanted to post another answer using coo_matrix, It is a fast format for constructing sparse matrices.

>>> alist = [[(1, 10), (3, -3)], [(2, 12)]]
>>> row, col, data = zip(*((i,j,v) for i,l in enumerate(alist) for j,v in l))
>>>
>>> from scipy.sparse import coo_matrix
>>> S = coo_matrix((data, (row, col)), (max(row)+1,max(col)+1), dtype=np.int8)
>>> S.todense()
matrix([[ 0, 10,  0, -3],
        [ 0,  0, 12,  0]], dtype=int8)
>>> 
Orth answered 14/7, 2018 at 16:18 Comment(0)
S
0

Posted solutions can be summarized in three steps: add row numbers, explode row data, extract coordinates and data. Here is a bit more elaborative implementation with iterators:

alist = [[(1,10), (3,-3)],[(2,12)]]

import itertools
from scipy import sparse

# add row ids

it = enumerate(alist)
# explode row data by id
it = map(lambda t: itertools.product([t[0]],*t[1:]), it)
it = itertools.chain(*it)
# flatten tupples
it = map(lambda t: (t[0],*t[1]), it)
# extract coordinates and data 
i,j,data=zip(*it)
# use coo format
sparse.coo_matrix((data,(i,j)))
Staffan answered 4/6, 2023 at 13:21 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.