I run a Random Forest algorithm with TF-IDF and non-TF-IDF features.
In total the features are around 130k in number (after a feature selection conducted on the TF-IDF features) and the observations of the training set are around 120k in number.
Around 500 of them are the non-TF-IDF features.
The issue is that the accuracy of the Random Forest on the same test set etc with
- only the non-TF-IDF features is 87%
- the TF-IDF and non-TF-IDF features is 76%
This significant aggravation of the accuracy raises some questions in my mind.
The relevant piece of code of mine with the training of the models is the following:
drop_columns = ['labels', 'complete_text_1', 'complete_text_2']
# Split to predictors and targets
X_train = df.drop(columns=drop_columns).values
y_train = df['labels'].values
# Instantiate, train and transform with tf-idf models
vectorizer_1 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_1 = vectorizer_1.fit_transform(df['complete_text_1'])
vectorizer_2 = TfidfVectorizer(analyzer="word", ngram_range=(1,2), vocabulary=tf_idf_feature_names_selected)
X_train_tf_idf_2 = vectorizer_2.fit_transform(df['complete_text_2'])
# Covert the general features to sparse array
X_train = np.array(X_train, dtype=float)
X_train = csr_matrix(X_train)
# Concatenate the general features and tf-idf features array
X_train_all = hstack([X_train, X_train_tf_idf_1, X_train_tf_idf_2])
# Instantiate and train the model
rf_classifier = RandomForestClassifier(n_estimators=150, random_state=0, class_weight='balanced', n_jobs=os.cpu_count()-1)
rf_classifier.fit(X_train_all, y_train)
Personally, I have not seen any bug in my code (this piece above and in general).
The hypothesis which I have formulated to explain this decrease in accuracy is the following.
- The number of non-TF-IDF features is only 500 (out of the 130k features in total)
- This gives some chances that the non-TF-IDF features are not picked that much at each split by the trees of the random forest (eg because of
max_features
etc) - So if the non-TF-IDF features do actually matter then this will create problems because they are not taken enough into account.
Related to this, when I check the features' importances of the random forest after training it I see the importances of the non-TF-IDF features being very very low (although I am not sure how reliable indicator are the feature importances especially with TF-IDF features included).
Can you explain differently the decrease in accuracy at my classifier?
In any case, what would you suggest doing?
Some other ideas of combining the TF-IDF and non-TF-IDF features are the following.
One option would be to have two separate (random forest) models - one for the TF-IDF features and one for the non-TF-IDF features. Then the results of these two models will be combined either by (weighted) voting or meta-classification.
max_features
, it will usesqrt(n_features)
by default, which is about 360 features any given tree will see. Even if there's no overlap in those features between different trees, 150*360 = 54k. So most of your 130k features will never be seen by the model. – Ostracodermmax_features
in general and this is why I refer to this at my post too. However, unless I am missing something, keep in mind that a new set of features is chosen every time at each split and not only at each tree based on the original paper of the random forest but also based on the SkLearn documentation for theRandomForestClassifier
(max_features{“auto”, “sqrt”, “log2”}, int or float, default=”auto” The number of features to consider when looking for the best split:
). – Supersession