I am working on a text classification problem where multiple text features and need to build a model to predict salary range. Please refer the Sample dataset Most of the resources/tutorials deal with feature extraction on only one column and then predicting target. I am aware of the processes such as text pre-processing, feature extraction (CountVectorizer or TF-IDF) and then the applying algorithms.
In this problem, I have multiple input text features. How to handle text classification problems when multiple features are involved? These are the methods I have already tried but I am not sure if these are the right methods. Kindly provide your inputs/suggestion.
1) Applied data cleaning on each feature separately followed by TF-IDF and then logistic regression. Here I tried to see if I can use only one feature for classification.
2) Applied Data cleaning on all the columns separately and then applied TF-IDF for each feature and then merged the all feature vectors to create only one feature vector. Finally logistic regression.
3) Applied Data cleaning on all the columns separately and merged all the cleaned columns to create one feature 'merged_text'. Then applied TF-IDF on this merged_text and followed by logistic regression.
All these 3 methods gave me around 35-40% accuracy on cross-validation & test set. I am expecting at-least 60% accuracy on the test set which is not provided.
Also, I didn't understand how use to 'company_name' & 'experience' with text data. there are about 2000+ unique values in company_name. Please provide input/pointer on how to handle numeric data in text classification problem.