Compute word n-grams on original text or after lemma/stemming process?

Computing word n-grams after lemmatization or stemming would be done for the same reasons as you would want to before stemming. Sometimes this gets you false positives, e.g., (D3) but it usually increases recall in such a meaningful way that you want to do it.

In some domains, e.g., short-text, stemming can hurt. The best thing to do is to test, but in general, I would suggest stemming and case-folding, but it really depends on your domain and queries.

Q="criminal records"

D1 = "... has a criminal record ..." (match on stem)
D2 = "... released the criminal records ..." (match normally)
D3 = "... while working on 'Smooth Criminal', recording ..." (false match on stem)

It's a precision/recall tradeoff. You can increase recall by stemming (always) and you can increase precision by not stemming. But it depends on what kinds of queries you are serving. If you're running code search, for instance, you almost never want to stem or preprocess, because users expect to type in exact symbol names and then find them.

Recommended topics

Hot tags