Does TfidfVectorizer identify n-grams using python regular expressions?
This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is token_pattern=u'(?u)\b\w\w+\b'
. I am having trouble seeing how this works. Consider the bi-gram case. If I do:
In [1]: import re
In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
Out[2]: []
I do not find any bigrams. Whereas:
In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
Out[2]: [u'this is', u'a sentence', u'this is', u'another one']
finds some (but not all, e.g. u'is a'
and all other even count bigrams are missing). What am I doing wrong in interpreting the \b
character function?
Note:
According to the regular expression module documentation, the \b
character in re is supposed to:
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
I see questions addressing the issue of identifying n-grams in python (see 1,2), so a secondary question is: should I do this and add joined n-grams before feeding my text to TfidfVectorizer?