You need to come up with the model to convert your data to a list of tuples [input, expected_output]
, where input
is a list of numbers between 0 and 1 representing the given words, and output
is one number between 0 and 1 representing how close the sentence is to your objective analysis (being political). For example, for the sentence "The quick brown cat jumped over the lazy dog" you might want to give a score of zero. A sentence like "President shakes off corruption scandal" you might want to give a score very close to one.
As you can see, your biggest challenge is actually obtaining the data and cleaning it. Converting it to the training format is easy, you could just hash words into numbers between 0 and 1, and make sure to handle different casing, punctuation, and you might want to step words to get the best results.
One more thing, you can use a term relevance algorithm to rank the importance of words in your training data set, so that you can choose only the top k
relevant words in a sentence, since you need uniform data size for each sentence.