Storing numpy sparse matrix in HDF5 (PyTables)

E

3

21

I am having trouble storing a numpy csr_matrix with PyTables. I'm getting this error:

TypeError: objects of type ``csr_matrix`` are not supported in this context, sorry; supported objects are: NumPy array, record or scalar; homogeneous list or tuple, integer, float, complex or string

My code:

f = tables.openFile(path,'w')

atom = tables.Atom.from_dtype(self.count_vector.dtype)
ds = f.createCArray(f.root, 'count', atom, self.count_vector.shape)
ds[:] = self.count_vector
f.close()

Any ideas?

Thanks

Enrico answered 20/6, 2012 at 23:6 Comment(2)

Are you worried about the size of the data on disk? I think hdf5 files can be stored in compressed format, in which case you might get away with just storing the dense matrix. – Slype 20/6, 2012 at 23:26

See #8895620, it looks like there is no pytables support for sparse matrices. – Slype 20/6, 2012 at 23:30

L

23

A CSR matrix can be fully reconstructed from its data, indices and indptr attributes. These are just regular numpy arrays, so there should be no problem storing them as 3 separate arrays in pytables, then passing them back to the constructor of csr_matrix. See the scipy docs.

Edit: Pietro's answer has pointed out that the shape member should also be stored

Lareine answered 21/6, 2012 at 0:56 Comment(3)

I believe the point, though, is to use it like a dense matrix. How could I convert a csr_matrix to a "dense-format" pytables instance? – Nickelplate 1/6, 2013 at 9:58

You can convert a csr_matrix to a dense array using its member function toarray, which can then be saved in pytables. Of course, this can potentially waste a lot of file space, although hdf5 has file compression options which may help. – Lareine 3/6, 2013 at 0:16

NumPy toarray() can't handle converting gigantic ones to dense. I was hoping to construct the table directly from CSR. – Nickelplate 3/6, 2013 at 0:39

S

35

The answer by DaveP is almost right... but can cause problems for very sparse matrices: if the last column(s) or row(s) are empty, they are dropped. So to be sure that everything works, the "shape" attribute must be stored too.

This is the code I regularly use:

import tables as tb
from numpy import array
from scipy import sparse

def store_sparse_mat(m, name, store='store.h5'):
    msg = "This code only works for csr matrices"
    assert(m.__class__ == sparse.csr.csr_matrix), msg
    with tb.openFile(store,'a') as f:
        for par in ('data', 'indices', 'indptr', 'shape'):
            full_name = '%s_%s' % (name, par)
            try:
                n = getattr(f.root, full_name)
                n._f_remove()
            except AttributeError:
                pass

            arr = array(getattr(m, par))
            atom = tb.Atom.from_dtype(arr.dtype)
            ds = f.createCArray(f.root, full_name, atom, arr.shape)
            ds[:] = arr

def load_sparse_mat(name, store='store.h5'):
    with tb.openFile(store) as f:
        pars = []
        for par in ('data', 'indices', 'indptr', 'shape'):
            pars.append(getattr(f.root, '%s_%s' % (name, par)).read())
    m = sparse.csr_matrix(tuple(pars[:3]), shape=pars[3])
    return m

It is trivial to adapt it to csc matrices.

Sosa answered 23/3, 2014 at 9:23 Comment(2)

What does the name variable correspond to in the above answer? – Trottier 13/4, 2017 at 10:0

@Rama: just a key to store the object. Arbitrary, you just need it to retrieve it back (in a same HDF store, you can store tons of different objects). – Sosa 14/4, 2017 at 6:49

L

23

A CSR matrix can be fully reconstructed from its data, indices and indptr attributes. These are just regular numpy arrays, so there should be no problem storing them as 3 separate arrays in pytables, then passing them back to the constructor of csr_matrix. See the scipy docs.

Edit: Pietro's answer has pointed out that the shape member should also be stored

Lareine answered 21/6, 2012 at 0:56 Comment(3)

I believe the point, though, is to use it like a dense matrix. How could I convert a csr_matrix to a "dense-format" pytables instance? – Nickelplate 1/6, 2013 at 9:58

You can convert a csr_matrix to a dense array using its member function toarray, which can then be saved in pytables. Of course, this can potentially waste a lot of file space, although hdf5 has file compression options which may help. – Lareine 3/6, 2013 at 0:16

NumPy toarray() can't handle converting gigantic ones to dense. I was hoping to construct the table directly from CSR. – Nickelplate 3/6, 2013 at 0:39

T

6

I have updated Pietro Battiston's excellent answer for Python 3.6 and PyTables 3.x, as some PyTables function names have changed in the upgrade from 2.x.

import numpy as np
from scipy import sparse
import tables

def store_sparse_mat(M, name, filename='store.h5'):
    """
    Store a csr matrix in HDF5

    Parameters
    ----------
    M : scipy.sparse.csr.csr_matrix
        sparse matrix to be stored

    name: str
        node prefix in HDF5 hierarchy

    filename: str
        HDF5 filename
    """
    assert(M.__class__ == sparse.csr.csr_matrix), 'M must be a csr matrix'
    with tables.open_file(filename, 'a') as f:
        for attribute in ('data', 'indices', 'indptr', 'shape'):
            full_name = f'{name}_{attribute}'

            # remove existing nodes
            try:  
                n = getattr(f.root, full_name)
                n._f_remove()
            except AttributeError:
                pass

            # add nodes
            arr = np.array(getattr(M, attribute))
            atom = tables.Atom.from_dtype(arr.dtype)
            ds = f.create_carray(f.root, full_name, atom, arr.shape)
            ds[:] = arr

def load_sparse_mat(name, filename='store.h5'):
    """
    Load a csr matrix from HDF5

    Parameters
    ----------
    name: str
        node prefix in HDF5 hierarchy

    filename: str
        HDF5 filename

    Returns
    ----------
    M : scipy.sparse.csr.csr_matrix
        loaded sparse matrix
    """
    with tables.open_file(filename) as f:

        # get nodes
        attributes = []
        for attribute in ('data', 'indices', 'indptr', 'shape'):
            attributes.append(getattr(f.root, f'{name}_{attribute}').read())

    # construct sparse matrix
    M = sparse.csr_matrix(tuple(attributes[:3]), shape=attributes[3])
    return M

Thursday answered 31/5, 2017 at 10:46 Comment(0)

Recommended topics

Hot tags