How to add another feature (length of text) to current bag of words classification? Scikit-learn

Asked 24/8, 2016 at 10:42 Answered 25/5, 2018 at 23:16

python machine-learning scikit-learn classification text-classification

I am using bag of words to classify text. It's working well but I am wondering how to add a feature which is not a word.

Here is my sample code.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = [[0],[0],[0],[0],[1],[1],[1],[1]]

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']

classifier = Pipeline([
    ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
for item, labels in zip(X_test, predicted):
    print '%s => %s' % (item, ', '.join(target_names[x] for x in labels))

Now it is clear that the text about London tends to be much longer than the text about New York. How would I add length of the text as a feature? Do I have to use another way of classification and then combine the two predictions? Is there any way of doing it along with the bag of words? Some sample code would be great -- I'm very new to machine learning and scikit learn.

Petrosal answered 24/8, 2016 at 10:42 Comment(4)

Your code does not run, namely because you are using OneVsRestClassifier when there is only a single target. – Inspection 24/8, 2016 at 14:51

The following link does almost exactly what you are after, using sklearn's FeatureUnion: zacstewart.com/2014/08/05/… – Inspection 24/8, 2016 at 15:2

take a look at the answer for this question #39002456 – Kellar 25/8, 2016 at 5:54

Does this answer your question? use Featureunion in scikit-learn to combine two pandas columns for tfidf – Paramecium 18/4, 2020 at 21:52

As shown in the comments, this is a combination of a FunctionTransformer, a FeaturePipeline and a FeatureUnion.

import numpy as np
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                    "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                    "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                    "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])

X_test = np.array(["it's a nice day in nyc",
                   'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                   ])   
target_names = ['Class 1', 'Class 2']


def get_text_length(x):
    return np.array([len(t) for t in x]).reshape(-1, 1)

classifier = Pipeline([
    ('features', FeatureUnion([
        ('text', Pipeline([
            ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
            ('tfidf', TfidfTransformer()),
        ])),
        ('length', Pipeline([
            ('count', FunctionTransformer(get_text_length, validate=False)),
        ]))
    ])),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, y_train)
predicted = classifier.predict(X_test)
predicted

This will add the length of the text to the features used by the classifier.

Lotte answered 18/9, 2017 at 12:19 Comment(2)

I would like to do something similar, but where the feature to be added is not a function of the text itself, but external, e.g. from a pandas DataFrame column. How could I add this to a pipeline? It seems FunctionTransformer has no way of getting the index of X_train, which would be needed to insert the data. – Televisor 28/9, 2018 at 7:32

@Televisor Three options I know of. 1. make sure the new data is in the same order as the text (split columns just before training), and just use FeatureUnion to join them together. 2. Use the whole dataframe as an input but use ColumnSelector from mlxtend to select the text and the additional info in the two branches of the FeatureUnion. 3. Have a look at sklearn-pandas which makes sklearn dataframe-aware. – Lotte 28/9, 2018 at 10:52

I assume that the new feature that you want to add is numeric. Here is my logic. First transform the text into sparse using TfidfTransformer or something similar. Then convert the sparse representation to a pandas DataFrame and add your new column which I assume is numeric. At the end, you may want to convert your data frame back to sparse matrix using scipy or any other module that you feel comfortable with. I assume that your data is in a pandas DataFrame called dataset containing a 'Text Column' and a 'Numeric Column'. Here is some code.

dataset = pd.DataFrame({'Text Column':['Sample Text1','Sample Text2'], 'Numeric Column': [2,1]})
dataset.head()

        Numeric Column   Text Column
0                   2    Sample Text1
1                   1    Sample Text2

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from scipy import sparse

tv = TfidfVectorizer(min_df = 0.05, max_df = 0.5, stop_words = 'english')
X = tv.fit_transform(dataset['Text column'])
vocab = tv.get_feature_names()

X1 = pd.DataFrame(X.toarray(), columns = vocab)
X1['Numeric Column'] = dataset['Numeric Column']


X_sparse = sparse.csr_matrix(X1.values)

Finally, you may want to;

print(X_sparse.shape)
print(X.shape)

to ensure that the new column was successfully added. I hope this helps.

Kincardine answered 25/5, 2018 at 23:16 Comment(0)

Recommended topics

Hot tags