Token pattern for n-gram in TfidfVectorizer in python
Asked Answered
E

1

7

Does TfidfVectorizer identify n-grams using python regular expressions?

This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is token_pattern=u'(?u)\b\w\w+\b'. I am having trouble seeing how this works. Consider the bi-gram case. If I do:

    In [1]: import re
    In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
    Out[2]: []

I do not find any bigrams. Whereas:

    In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
    Out[2]: [u'this is', u'a sentence', u'this is', u'another one']

finds some (but not all, e.g. u'is a' and all other even count bigrams are missing). What am I doing wrong in interpreting the \b character function?

Note: According to the regular expression module documentation, the \b character in re is supposed to:

\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

I see questions addressing the issue of identifying n-grams in python (see 1,2), so a secondary question is: should I do this and add joined n-grams before feeding my text to TfidfVectorizer?

Episiotomy answered 26/3, 2015 at 23:51 Comment(0)
W
1

You should prepend regular expressions with r. The following works:

>>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
[u'this', u'is', u'sentence', u'this', u'is', u'another', u'one']

This is a known bug in the documentation, but if you look at the source code they do use raw literals.

Walloping answered 3/6, 2015 at 9:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.