I am using TfidfVectorizer
to convert a collection of raw documents to a matrix of TF-IDF features, which I then plan to input into a k-means algorithm (which I will implement). In that algorithm I will have to compute distances between centroids (categories of articles) and data points (articles). I am going to use Euclidean distance, so I need these two entities to be of same dimension, in my case max_features
. Here is what I have:
tfidf = TfidfVectorizer(max_features=10, strip_accents='unicode', analyzer='word', stop_words=stop_words.extra_stopwords, lowercase=True, use_idf=True)
X = tfidf.fit_transform(data['Content']) # the matrix articles x max_features(=words)
for i, row in enumerate(X):
print X[i]
However X
seems to be a sparse(?) matrix, since the output is:
(0, 9) 0.723131915847
(0, 8) 0.090245047798
(0, 6) 0.117465276892
(0, 4) 0.379981697363
(0, 3) 0.235921470645
(0, 2) 0.0968780456528
(0, 1) 0.495689001273
(0, 9) 0.624910843051
(0, 8) 0.545911131362
(0, 7) 0.160545991411
(0, 5) 0.49900042174
(0, 4) 0.191549050212
...
Where I think the (0, col)
states the column index in the matrix, which actually like an array, where every cell points to a list.
How do I convert this matrix to a dense one (so that every row has the same number of columns)?
>print type(X)
<class 'scipy.sparse.csr.csr_matrix'>
print type(X)
? – Lothario