Problems using a custom vocabulary for TfidfVectorizer scikit-learn

Asked 20/2, 2013 at 18:7 Answered 7/7, 2015 at 16:18

I'm trying to use a custom vocabulary in scikit-learn for some clustering tasks and I'm getting very weird results.

The program runs ok when not using a custom vocabulary and I'm satisfied with the cluster creation. However, I have already identified a group of words (around 24,000) that I would like to use as a custom vocabulary.

The words are stored in a SQL Server table. I have tried so far 2 approaches, but I get the same results at the end. The first one is to create a list, the second is to create a dictionary. The code for the creation of the dictionary is like this:

myvocab = {}
vocabulary = []

count = 0

for row in results:
    skillName = re.sub(r'&#?[a-z0-9]+;', ' ', row['SkillName']) 
    skillName = unicode(skillName,"utf-8")  
    vocabulary.append(skillName)  #Using a list 
    myvocab[str(skillName)] = count #Using a dictionary
    count+=1

I then use the vocabulary (either the list version or the dictionary, both of them give the same result at the end) in the TfidfVectorizer as follows:

vectorizer = TfidfVectorizer(max_df=0.8, 
                         stop_words='english' ,ngram_range=(1,2) ,vocabulary=myvocab)
X = vectorizer.fit_transform(dataset2)

The shape of X is (651, 24321) as I have 651 instances to cluster and 24321 words in the vocabulary.

If I print the contents of X, this is what I get:

(14, 11462) 1.0
(20, 10218) 1.0
(34, 11462) 1.0
(40, 11462) 0.852815313278
(40, 10218) 0.52221264006
(50, 11462) 1.0
(81, 11462) 1.0
(84, 11462) 1.0
(85, 11462) 1.0
(99, 10218) 1.0
(127, 11462)    1.0
(129, 10218)    1.0
(132, 11462)    1.0
(136, 11462)    1.0
(138, 11462)    1.0
(150, 11462)    1.0
(158, 11462)    1.0
(186, 11462)    1.0
(210, 11462)    1.0

:   :

As it can be seen, for most of the instances, only word from the vocabulary is present (which is wrong as there are at least 10) and for a lot of instances, not even one word is found. Also, the words found tend to be always the same across the instances, which doesn't make sense.

If I print the feature_names using :

feature_names = np.asarray(vectorizer.get_feature_names())

I get:

['.NET' '10K' '21 CFR Part 11' ..., 'Zend Studio' 'Zendesk' 'Zenworks']

I must say that the program was running perfectly when the vocabulary used was the one determined from the input documents, so I strongly suspect that the problem is related to using a custom vocabulary.

Does anyone have a clue of what's happening?

(I'm not using a pipeline so this problem can't be related to a previous bug which has already been fixed)

Insignificancy answered 20/2, 2013 at 18:7 Comment(1)

I am also getting different results: the values of TF-IDF for a given term and document are different when I change the custom vocabulary, even though I am using a binary TF. Did you find a solution or a culprit? – Rueful 6/4, 2023 at 9:56

One thing that strikes me as unusual is that when you create the vectorizer you specify ngram_range=(1,2). This means you can't get the feature '21 CFR Part 11' using the standard tokenizer. I suspect the 'missing' features are n-grams for n>2. How many of your pre-selected vocabulary items are unigrams or bigrams?

Sherrylsherurd answered 20/2, 2013 at 18:12 Comment(1)

I have tried using different variations for ngram_range from (1,1) to (1,5) and I always get the same results. The number of 1-gram vocabulary items are 9692, 2-grams : 13215, 3-grams : 1337, 4-grams : 77. I don't think that's where the problem lies. – Insignificancy 21/2, 2013 at 10:55

I am pretty sure that this is caused by the (arguably confusing) default value of min_df=2 to cut off any feature from the vocabulary if it's not occurring at least twice in the dataset. Can you please confirm by setting explicitly min_df=1 in your code?

Carrara answered 20/2, 2013 at 21:57 Comment(3)

I change the value to min_df=1 and the result is exactly the same. If I print the TdidfVectorizer this is what I get: TfidfVectorizer(analyzer=word, binary=False, charset=utf-8, charset_error=strict, dtype=<type 'long'>, input=content, lowercase=True, max_df=0.8, max_features=None, max_n=None, min_df=1, min_n=None, ngram_range=(1, 2), norm=l2, preprocessor=None, smooth_idf=True, stop_words=english, strip_accents=None, sublinear_tf=False, token_pattern=(?u)\b\w\w+\b, tokenizer=None, use_idf=True, vocabulary=None) – Insignificancy 21/2, 2013 at 10:48

Then maybe your dataset2 is not what TfidfVectorizer is expecting? Check the input parameter in the TfidfVectorizer documentation. You can to TfidfVectorizer(**params).build_analyzer()(dataset2) to check the results of the text analysis step (preprocessing, tokenization + n-grams extraction). – Carrara 21/2, 2013 at 12:28

I have used that dataset extensively for classification and clustering, but now that I'm trying to use my own vocabulary, it's when I've been having problems. Still, I will do what you suggest and report on the results – Insignificancy 21/2, 2013 at 16:30

-1

In Python for-in loop, it could not use count+=1 to make count add one when every loop. You could use for i in range(n): to replace it. Because count's value would stay 1.

Holbein answered 7/7, 2015 at 16:18 Comment(0)

Recommended topics

Hot tags