How to make TF-IDF matrix dense? - McMap

About

How to make TF-IDF matrix dense?

Asked 31/1, 2016 at 1:44 Answered 31/1, 2016 at 2:21

Solved python scikit-learn cluster-analysis sparse-matrix tf-idf

I

1

17

I am using TfidfVectorizer to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features. Here is what I have:

tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
    print X[i]

However X seems to be a sparse(?) matrix, since the output is:

  (0, 9)    0.723131915847
  (0, 8)    0.090245047798
  (0, 6)    0.117465276892
  (0, 4)    0.379981697363
  (0, 3)    0.235921470645
  (0, 2)    0.0968780456528
  (0, 1)    0.495689001273

  (0, 9)    0.624910843051
  (0, 8)    0.545911131362
  (0, 7)    0.160545991411
  (0, 5)    0.49900042174
  (0, 4)    0.191549050212

  ...

Where I think the (0, col) states the column index in the matrix, which actually like an array, where every cell points to a list.

How do I convert this matrix to a dense one (so that every row has the same number of columns)?

>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>

Iraidairan answered 31/1, 2016 at 1:44 Comment(2)

Can you print type(X)? – Lothario 31/1, 2016 at 2:20

With pleasure @Will, I updated my question. – Iraidairan 31/1, 2016 at 2:23

L

19

This should be as simple as:

dense = X.toarray()

TfIdfVectorizer.fit_transform() is returning a SciPy csr_matrix() (Compressed Sparse Row Matrix), which has a toarray() method just for this purpose. There are several formats of sparse matrices in SciPy, but they all have a .toarray() method.

Note that for a large matrix, this will use a tremendous amount of memory compared to a sparse matrix, so generally it's a good approach to leave it sparse for as long as possible.

Lothario answered 31/1, 2016 at 2:21 Comment(5)

Then maybe I should leave it sparse and alter my distance function to put 0 when there is no entry, but I am not sure how to do that and I will use the dense format to actually implement the k-means algorithm first! – Iraidairan 31/1, 2016 at 2:32

Yeah, for bigger datasets, you'll need to stay sparse as much as possible. In your example of trying to iterate through rows in a sparse matrix, try some of the approaches here. You can iterate over them, but you'll just need some type of generator that returns 0s for rows/cells that aren't populated. – Lothario 31/1, 2016 at 6:4

@Iraidairan don't expect good results from k-means on such data. (You can run k-means on sparse data) – Flutist 31/1, 2016 at 9:3

@Anony-Mousse, that's mostly for getting a feel of Hadoop, so yes I know. :/ Thanks Will! – Iraidairan 31/1, 2016 at 13:16

@Lothario You are right Will, for large matrix spares matrix is the best. However, I was trying to use Affinity propagation for clustering, It throws an error if I provide spares matrix, hence I had to go with toarray(), but the main problem is it utilizes lot of RAM and kills my process. How do I overcome such problems? – Florrieflorry 6/3, 2018 at 22:19

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.