Token pattern for n-gram in TfidfVectorizer in python

Does TfidfVectorizer identify n-grams using python regular expressions?

This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is token_pattern=u'(?u)\b\w\w+\b'. I am having trouble seeing how this works. Consider the bi-gram case. If I do:

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

I do not find any bigrams. Whereas:

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

finds some (but not all, e.g. u'is a' and all other even count bigrams are missing). What am I doing wrong in interpreting the \b character function?

Note: According to the regular expression module documentation, the \b character in re is supposed to:

\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

I see questions addressing the issue of identifying n-grams in python (see 1,2), so a secondary question is: should I do this and add joined n-grams before feeding my text to TfidfVectorizer?

Recommended topics

Hot tags