How do I properly combine numerical features with text (bag of words) in scikit-learn?
Asked Answered
C

1

12

I am writing a classifier for web pages, so I have a mixture of numerical features, and I also want to classify the text. I am using the bag-of-words approach to transform the text into a (large) numerical vector. The code ends up being like this:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np

numerical_features = [
  [1, 0],
  [1, 1],
  [0, 0],
  [0, 1]
]
corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]
bag_of_words_vectorizer = CountVectorizer(min_df=1)
X = bag_of_words_vectorizer.fit_transform(corpus)
words_counts = X.toarray()
tfidf_transformer = TfidfTransformer()
tfidf = tfidf_transformer.fit_transform(words_counts)

bag_of_words_vectorizer.get_feature_names()
combinedFeatures = np.hstack([numerical_features, tfidf.toarray()])

This works, but I'm concerned about the accuracy. Notice that there are 4 objects, and only two numerical features. Even the simplest text results in a vector with nine features (because there are nine distinct words in the corpus). Obviously, with real text, there will be hundreds, or thousands of distinct words, so the final feature vector would be < 10 numerical features but > 1000 words based ones.

Because of this, won't the classifier (SVM) be heavily weighting the words over the numerical features by a factor of 100 to 1? If so, how can I compensate to make sure the bag of words is weighted equally against the numerical features?

Cowbell answered 12/9, 2016 at 7:12 Comment(3)
You can reduce the dimensionality of your word features using the TruncatedSVD in scikit learn. scikit-learn.org/stable/modules/generated/…Imponderabilia
Did you find how to handle this? I'm doing a similar thing with Spark.Octavie
I dont know much about the subject but I was looking for the same, and it seems what you are looking for is a FeatureUnion - #39445551Hardin
C
1

I think your concern is totally valid regarding the significantly higher dimension produced from sparse text tokens in a naive way (as multi-hot vectors). You could at least tackle that with two approaches below. Both of them will produce a low-dimensional vector (for example, 100-dimension) from the text. The dimension is not going to increase when your vocabulary increases.

Cootch answered 18/6, 2020 at 3:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.