Keep TFIDF result for predicting new content using Scikit for Python
Asked Answered
G

5

27

I am using sklearn on Python to do some clustering. I've trained 200,000 data, and code below works well.

corpus = open("token_from_xml.txt")
vectorizer = CountVectorizer(decode_error="replace")
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
km = KMeans(30)
kmresult = km.fit(tfidf).predict(tfidf)

But when I have new testing content, I'd like to cluster it to existed clusters I'd trained. So I'm wondering how to save IDF result, so that I can do TFIDF for the new testing content and make sure the result for new testing content have same array length.

Thanks in advance.

UPDATE

I may need to save "transformer" or "tfidf" variable to file(txt or others), if one of them contains the trained IDF result.

UPDATE

For example. I have the training data:

["a", "b", "c"]
["a", "b", "d"]

And do TFIDF, the result will contains 4 features(a,b,c,d)

When I TEST:

["a", "c", "d"]

to see which cluster(already made by k-means) it belongs to. TFIDF will only give the result with 3 features(a,c,d), so the clustering in k-means will fall. (If I test ["a", "b", "e"], there may have other problems.)

So how to store the features list for testing data (even more, store it in file)?

UPDATE

Solved, see answers below.

Godolphin answered 22/4, 2015 at 4:55 Comment(4)
by new content, what you mean? new testing content or training content?Subpoena
new testing content @SubpoenaGodolphin
I guess you might not be able to append to new training content to previously trained content. You have to train atleast once with entire training data, then you can pickle that trained data, which can be used later to eliminate training delay. But when you get new content, you would have to train it atleast onceSubpoena
@Subpoena Thank you for your reply. I updated my question. I am not going to append new training content to previously trained content, but to TEST new content to see which cluster it belongs to, will it be possible?Godolphin
G
34

I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ddd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))

That works. tfidf will have same feature length as trained data.

Godolphin answered 22/4, 2015 at 9:44 Comment(6)
Will this save the tfidf trained model along with the generated tfidf matrix?Pulliam
Load it later section is wrong !! .. Why it's fit_transform.. It should be only transform, technically, if you are transforming new/unseen data.Spectatress
If anyone interested here is the explanation what @Spectatress said: fit() method calculates parameters for a transformation. On the other hand transform() method just transforms the data-set based on the parameters calculated in the fit() method. Again fit_transform() just does it one after another in optimized way. But in machine learning, we calculate parameters based on train-set. While testing, we don't calculate any new parameters, rather, we just apply the parameters calculated from train-set to transform the test data. So, while testing we should only use transform().Homograph
I tried this example because they deleted an apparently similar question of mine but I always get "The TF-IDF vectorizer is not fitted"Pazpaza
@Homograph if you replace fit_tranform this code doesn't workPazpaza
@Spectatress if you replace fit_transform this code doesn't workPazpaza
L
17

Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)

# Fit the model
tf_transformer = tf.fit(corpus)

# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))


# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))

# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
                          max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)

The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.

Labradorite answered 14/9, 2018 at 18:42 Comment(2)
Why is anybody here saying the final fit_tranform is not required and anybody posting code is using it?Pazpaza
Btw, this is the only thing that worksPazpaza
S
8

If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

#store the content
with open("x_result.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )
Subpoena answered 22/4, 2015 at 9:27 Comment(1)
tfidf does not contains feature list, I've successfully saved feature list for reuse, and answered this myself. Thank you for inspiring me.Godolphin
Z
6

a simpler solution, just use joblib libarary as document said:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.externals import joblib

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
feature_name = vectorizer.get_feature_names()
tfidf = TfidfTransformer()
tfidf.fit(X)

# save your model in disk
joblib.dump(tfidf, 'tfidf.pkl') 

# load your model
tfidf = joblib.load('tfidf.pkl') 
Zoography answered 21/4, 2018 at 9:9 Comment(0)
S
2

you can do the vectorization and tfidf transformation in one stage:

vec =TfidfVectorizer()

then fit and transform on the training data

tfidf = vec.fit_transform(training_data)

and use the tfidf model to transform

unseen_tfidf = vec.transform(unseen_data)
km = KMeans(30)
kmresult = km.fit(tfidf).predict(unseen_tfid)
Strophanthus answered 22/4, 2015 at 5:38 Comment(2)
Thanks. But I want to save the tfidf result to a file(txt or something), and load it later. You mean to reuse the "vec" variable, but can it be saved?Godolphin
This is the standard use of TfidfVectorizer .. OP had a requirement for saving and reloading the vectorizer. While it might sound strange, it is useful for people using services such as Amazon Sagemaker where the training and prediction run on seperate ec2 instances.Manzano

© 2022 - 2024 — McMap. All rights reserved.