Is there an efficient way of concatenating scipy.sparse matrices?
Asked Answered
B

5

41

I'm working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matrix from separate parts.

Right now I'm using the following way to concatenate four matrices, but it's horribly inefficient. Is there any better way to do this that doesn't involve converting to a dense matrix?

rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridge
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()
Banas answered 27/7, 2011 at 13:20 Comment(0)
P
56

The sparse library now has hstack and vstack for respectively concatenating matrices horizontally and vertically.

Proudhon answered 11/5, 2012 at 19:2 Comment(2)
Make sure you use scipy.sparse.hstack instead of numpy.hstackOverheat
it should be added to this answer thathstack: concatenate sparse matrices with the same number of rows (horizontal concatenation) and vstack: concatenate sparse matrices with the same number of columns (vertical concatenation)Marta
G
17

Using hstack, vstack, or concatenate, is dramatically slower than concatenating the inner data objects themselves. The reason is that hstack/vstack converts the sparse matrix to coo format which can be very slow when the matrix is very large not and not in coo format. Here is the code for concatenating csc matrices, similar method can be used for csr matrices:

def concatenate_csc_matrices_by_columns(matrix1, matrix2):
    new_data = np.concatenate((matrix1.data, matrix2.data))
    new_indices = np.concatenate((matrix1.indices, matrix2.indices))
    new_ind_ptr = matrix2.indptr + len(matrix1.data)
    new_ind_ptr = new_ind_ptr[1:]
    new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))

    return csc_matrix((new_data, new_indices, new_ind_ptr))
Godesberg answered 21/10, 2015 at 12:39 Comment(4)
Was just looking at a fast way of appending new rows to a CSR matrix. This is exactly what I need. Thanks @amos.Cooper
If you use this method you need to specify the shape in 'return csc_matrix((new_data, new_indices, new_ind_ptr))' ie: 'return csc_matrix((new_data, new_indices, new_ind_ptr), shape=(matrix1.shape[1], matrix1.shape[1] + matrix2.shape[1])'Liverpudlian
What would be the code for csr matrices? Is the native scipy implementation really faster now? Because I have to concatenate four submatrices (upper-left, upper-right,lower-left,lower-right) and I am not satisfied with the result. It takes less time to recompute the entire matrix although I would only have to compute upper-right and lower-left. So this slowness essentially makes tabulation useless in my case. It annoys me because I think you would just have to change some pointers in C if both the matrix and the operation were optimally implemented.Rainer
Although I am not sure if the index pointers are stored in a list in C or in an array. If it were a list would you not just have to reset one pointer at the end of the list? The way it is now, the larger the matrix, the longer the stacking...Rainer
N
17

Amos's answer is no longer necessary. Scipy now does something similar to this internally if the input matrices are in csr or csc format and the desired output format is set to none or the same format as the input matrices. It's efficient to vertically stack matrices in csr format, or to horizontally stack matrices in csc format, using scipy.sparse.vstack or scipy.sparse.hstack, respectively.

Nancinancie answered 31/8, 2017 at 21:6 Comment(3)
Which version does "now" refer to? Do you have any reference for this?Ezraezri
The relevant code is this snippet from scipy.sparse.bmat, which both vstack and hstack use. This hack was originally added here in 2013. It looks like it was originally included in scipy 1.0.0.Nancinancie
Actually, I was wrong about that. It was originally included in 0.14.Nancinancie
B
14

Okay, I found the answer. Using scipy.sparse.coo_matrix is much much faster than using lil_matrix. I converted the matrices to coo (painless and fast) and then just concatenated the data, rows and columns after adding the right padding.

data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1])) 

scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )
Banas answered 28/7, 2011 at 3:46 Comment(1)
Thanks for coming back and commenting on how you did it quickly. I needed it for my NLP class.Producer
M
0

Amos' solution only works for two matrices and fails if one of them is a zero matrix, which happens very often in my use case.

hstack is not efficient either if there are zero matrices (they have no indptr attribute).

This version does not have these limitations:

def concatenate_sparse_csc_matrices(list_of_matrices):
    num_rows = list_of_matrices[0].shape[0]
    num_cols = sum(m.shape[1] for m in list_of_matrices)

    data = []
    indices = []
    indptr_diff = [[0]]

    for m in list_of_matrices:
        if m.nnz > 0:
            data.append(m.data)
            indices.append(m.indices)
            indptr_diff.append(np.diff(m.indptr))
        else:
            indptr_diff.append([0] * m.shape[1])

    if data:
        data = np.concatenate(data)

    if indices:
        indices = np.concatenate(indices)

    indptr = np.cumsum(np.concatenate(indptr_diff))
    concatenated_matrix = csc_matrix((data, indices, indptr), shape=(num_rows, num_cols))

    return concatenated_matrix
Merit answered 1/3, 2024 at 12:36 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.