Lucene Standard Analyzer vs Snowball
Asked Answered
A

3

23

Just getting started with Lucene.Net. I indexed 100,000 rows using standard analyzer, ran some test queries, and noticed plural queries don't return results if the original term was singular. I understand snowball analyzer adds stemming support, which sounds nice. However, I'm wondering if there are any drawbacks to gong with snowball over standard? Am I losing anything by going with it? Are there any other analyzers out there to consider?

Anglesey answered 6/10, 2010 at 17:45 Comment(1)
If you use the snowball analyzer, you should get results for singular/plural, because snowball will normalize them into the same form. Are you sure that you use the same analyzer for creating an index and querying it?Climatology
H
18

Yes, by using a stemmer such as Snowball, you are losing information about the original form of your text. Sometimes this will be useful, sometimes not.

For example, Snowball will stem "organization" into "organ", so a search for "organization" will return results with "organ", without any scoring penalty.

Whether or not this is appropriate to you depends on your content, and on the type of queries you are supporting (for example, are the searches very basic, or are users very sophisticated and using your search to accurately filter down the results). You may also want to look into less aggressive stemmers, such as KStem.

Halfcaste answered 6/10, 2010 at 17:52 Comment(4)
I just figured out you can also do a fuzzy search like this "kangaroos~" that will return singular versions of the word as well, although it seems to take a bit longer to process the query.Anglesey
@alchemical: I would really recommend against doing that. ~ is a very slow operator, and if your user does stuff like search for a phrase you're kinda screwed. Why is it so bad if you "kangaroos" is stored as "kangaroo"?Chanell
OK, that's good to know -- to use KStem do you need Solr? Do you need to work with Lucene source code to integrate it in?Anglesey
Know this is a bit old, but do you know if the normal analyser does stemming at all, or is it only stop-words? Wasn't able to figure it out :(Geier
C
6

The snowball analyzer will increase your recall, because it is much more aggressive than standard analyzer. So you need to evaluate your search results to see if for your data you need to increase recall or precision.

Climatology answered 10/10, 2010 at 11:8 Comment(0)
R
4

I just finished an analyzer that performs lemmatization. That's similar to stemming, except that it uses context to determine a word's type (noun, verb, etc.) and uses that information to derive the stem. It also keeps the original form of the word in the index. Maybe my library can be of use to you. It requires Lucene Java, though, and I'm not aware of any C#/.NET lemmatizers.

Rms answered 7/10, 2010 at 10:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.