I am trying to apply PCA on huge sparse matrix, in the following link it says that randomizedPCA of sklearn can handle sparse matrix of scipy sparse format. Apply PCA on very large sparse matrix
However, I always get error. Can someone point out what I am doing wrong.
Input matrix 'X_train' contains numbers in float64:
>>>type(X_train)
<class 'scipy.sparse.csr.csr_matrix'>
>>>X_train.shape
(2365436, 1617899)
>>>X_train.ndim
2
>>>X_train[0]
<1x1617899 sparse matrix of type '<type 'numpy.float64'>'
with 81 stored elements in Compressed Sparse Row format>
I am trying to do:
>>>from sklearn.decomposition import RandomizedPCA
>>>pca = RandomizedPCA()
>>>pca.fit(X_train)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/decomposition/pca.py", line 567, in fit
self._fit(check_array(X))
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 334, in check_array
copy, force_all_finite)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/sklearn/utils/validation.py", line 239, in _ensure_sparse_format
raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.
if I try to convert to dense matrix, I think I am out of memory.
>>> pca.fit(X_train.toarray())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 949, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/coo.py", line 274, in toarray
B = self._process_toarray_args(order, out)
File "/home/RT11/.pyenv/versions/2.7.9/lib/python2.7/site-packages/scipy/sparse/base.py", line 800, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
TruncatedSVD
and it did not work? If so please document that in your question as well. – Electrolyze