How to make TF-IDF matrix dense?
Asked Answered
I

1

17

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features. Here is what I have:

tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
    print X[i]

However X seems to be a sparse(?) matrix, since the output is:

  (0, 9)    0.723131915847
  (0, 8)    0.090245047798
  (0, 6)    0.117465276892
  (0, 4)    0.379981697363
  (0, 3)    0.235921470645
  (0, 2)    0.0968780456528
  (0, 1)    0.495689001273

  (0, 9)    0.624910843051
  (0, 8)    0.545911131362
  (0, 7)    0.160545991411
  (0, 5)    0.49900042174
  (0, 4)    0.191549050212

  ...

Where I think the (0, col) states the column index in the matrix, which actually like an array, where every cell points to a list.

How do I convert this matrix to a dense one (so that every row has the same number of columns)?


>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>
Iraidairan answered 31/1, 2016 at 1:44 Comment(2)
Can you print type(X)?Lothario
With pleasure @Will, I updated my question.Iraidairan
L
19

This should be as simple as:

dense = X.toarray()

TfIdfVectorizer.fit_transform() is returning a SciPy csr_matrix() (Compressed Sparse Row Matrix), which has a toarray() method just for this purpose. There are several formats of sparse matrices in SciPy, but they all have a .toarray() method.

Note that for a large matrix, this will use a tremendous amount of memory compared to a sparse matrix, so generally it's a good approach to leave it sparse for as long as possible.

Lothario answered 31/1, 2016 at 2:21 Comment(5)
Then maybe I should leave it sparse and alter my distance function to put 0 when there is no entry, but I am not sure how to do that and I will use the dense format to actually implement the k-means algorithm first!Iraidairan
Yeah, for bigger datasets, you'll need to stay sparse as much as possible. In your example of trying to iterate through rows in a sparse matrix, try some of the approaches here. You can iterate over them, but you'll just need some type of generator that returns 0s for rows/cells that aren't populated.Lothario
@Iraidairan don't expect good results from k-means on such data. (You can run k-means on sparse data)Flutist
@Anony-Mousse, that's mostly for getting a feel of Hadoop, so yes I know. :/ Thanks Will!Iraidairan
@Lothario You are right Will, for large matrix spares matrix is the best. However, I was trying to use Affinity propagation for clustering, It throws an error if I provide spares matrix, hence I had to go with toarray(), but the main problem is it utilizes lot of RAM and kills my process. How do I overcome such problems?Florrieflorry

© 2022 - 2024 — McMap. All rights reserved.