How to combine TFIDF features with other features
Asked Answered
V

2

13

I have a classic NLP problem, I have to classify a news as fake or real.

I have created two sets of features:

A) Bigram Term Frequency-Inverse Document Frequency

B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...

Which is the best way to combine the TFIDF features with the other features for a single prediction? Thanks a lot to everyone.

Vindicate answered 1/2, 2018 at 23:2 Comment(4)
Please add your script where I've made the <edit> your scriptcode how you use it. The reference to a webpage with code is insufficient. 1) This can either change over time and 2) its too much code there touching different topic.... and you want me to choose which code you made based on that and a few lines in your question.... not gonna happen!Minimize
I'm sorry, perhaps I explained bad. Mine is a theoretical question, I'm not interested in the scriptcode.Vindicate
..yeah.. then you're question get flagged.... if you had added code to show something and than made your point and ask this then that would be accepted without problem (SO rules changed over time). Now it will probably get downvoted... :-(Minimize
see here: datascience.stackexchange.com/questions/22813/…Stablish
S
12

Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.

Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).

That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.

Sennight answered 1/2, 2018 at 23:56 Comment(3)
Thanks a lot for the answer. Then add the pattern.en fetures to the TFIDF features (resulting in only one large array) and using a single classification model on the resulting matrix is probably not a good idea. The best way is to use 2 distinct classifiers: a classifier A for the features TFIDF, a classifier B for the features generated with pattern.en. Then I combine the two predictors using a third ensemble classifier as random forest to get the final result. Correct?Vindicate
Good idea I've not tried that, do you have evidence that often gives better results? I can see it avoids the problem I listed above however by disentangling the features you lose relationships between the dense and sparse. E.g. maybe one bigram with a positive polarity is a key indicator of real news, and with negative is fake, but both features separately are weakly correlated. One combined classifier would capture this, but separating wouldn't.Sennight
I'll try both ways and I'll let you know. Currently I have tried to coach two classifiers. The classifier based on TFIDF gave excellent results, whereas the classifier based on "pattern.en" gave worse result compared to the ones before. Thank you so much for your answers.Vindicate
V
2

What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones XGBoost model

NN model with Tensoflow

Light GBM

Vibrate answered 11/7, 2022 at 17:57 Comment(1)
A log loss of <0.25 as well for allVibrate

© 2022 - 2024 — McMap. All rights reserved.