I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.
corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)
But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.
Thanks in advance.
UPDATE
I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.
UPDATE
For example. I have the training data:
["a", "b", "c"]
["a", "b", "d"]
And do TFIDF, the result will contains 4 features(a,b,c,d)
When I TEST:
["a", "c", "d"]
to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"]
, there may have other problems.)
So how to store the features list for testing data (even more, store it in file)?
UPDATE
Solved, see answers below.