I have a text classification problem where i have two types of features:
- features which are n-grams (extracted by CountVectorizer)
- other textual features (e.g. presence of a word from a given lexicon). These features are different from n-grams since they should be a part of any n-gram extracted from the text.
Both types of features are extracted from the text's tokens. I want to run tokenization only once,and then pass these tokens to CountVectorizer and to the other presence features extractor. So, i want to pass a list of tokens to CountVectorizer, but is only accepts a string as a representation to some sample. Is there a way to pass an array of tokens?