I have a very large and also sparse matrix (531K x 315K), the number of total cells is ~167 Billion. The non-zero values are only 1s. Total number of non-zero values are around 45K. Is there an efficient NMF package to solve my problem? I know there are couple of packages for that and they are working well only for small size of data matrix. Any idea helps. Thanks in advance.
Very Large and Very Sparse Non Negative Matrix factorization
Asked Answered
Which lib did you try? What's your model? And how many components you are looking for? –
Gaylenegayler
I tried scikit-learn lib and it failed. Also, tried nipunbatra.github.io/blog/2017/nnmf-tensorflow.html and it also failed due to the out of memory issue. The latent dimension is 50. –
Mada
scikit-learn will handle this easily!
Code:
from time import perf_counter as pc
import numpy as np
import scipy.sparse as sps
from sklearn.decomposition import NMF
""" Create sparse data """
nnz_i, nnz_j, nnz_val = np.random.choice(531000, size=45000), \
np.random.choice(315000, size=45000), \
np.random.random(size=45000)
X = sps.csr_matrix((nnz_val, (nnz_i, nnz_j)), shape=(531000, 315000))
print('X-shape: ', X.shape, ' X nnzs: ', X.nnz)
print('type(X): ', type(X))
# <class 'scipy.sparse.csr.csr_matrix'> # !!!!!!!!!!
""" NMF """
model = NMF(n_components=50, init='random', random_state=0, verbose=True)
start_time = pc()
W = model.fit_transform(X)
end_time = pc()
print('Used (secs): ', end_time - start_time)
print(model.reconstruction_err_)
print(model.n_iter_)
Output:
X-shape: (531000, 315000) X nnzs: 45000
type(X): <class 'scipy.sparse.csr.csr_matrix'>
violation: 1.0
violation: 0.2318929397542804
violation: 0.11045394409727402
violation: 0.08104138988253409
...
violation: 9.659665625799714e-05
Converged at iteration 71
Used (secs): 247.94092973091756
122.27109041
70
Remarks:
- Make sure you use sparse-matrices as input or you can't exploit sparsity
- I'm using version 0.19.1, so the multiplicative-update solver is used (>= 0.19)
- But the older CD-based solver should handle this too!
- The above is using < 800 MB of memory
Additional Constraints
As mentioned in the comments, OP wants to add additional constraints, while still not specifying these formally.
This will need a whole new implementation of some optimization-procedure including some theory-footwork (depending on the constraints).
As an alternative, this can be solved by general-purpose Convex-Programming solvers. E.g. formulated by cvxpy and solved by SCS. Of course the alternating-minimization procedure needs to be done too (as the joint-problem is non-convex) and it will scale worse than this specialized sklearn-implementation. But it might work for OPs data.
thanks for the example but also I need to penalize some of the cells to be zero. Sklearn doesn't allow me to modify the cost function. –
Mada
@Mada And where was that specified in your question? Why did you try and mention sklearn then when i asked? There is no detailed/formal model in your question! And if so, you probably have to implement it yourself (the whole learning-procedure). –
Gaylenegayler
What ways are there to plot a three-dimension solution in an 3D axis? Thanks. –
Waller
@Emmanuel plotly is a good place to start. plotly.com/python/pca-visualization –
Ilka
© 2022 - 2024 — McMap. All rights reserved.