Using a Combination of Wildcards and Stemming

Asked 1/2, 2012 at 21:17 Answered 16/5 at 13:30

Solved search lucene full-text-search lucene.net

I'm using a snowball analyzer to stem the titles of multiple documents. Everything works well, but their are some quirks.

Example:

A search for "valv", "valve", or "valves" returns the same number of results. This makes sense since the snowball analyzer reduces everything down to "valv".

I run into problems when using a wildcard. A search for "valve*" or "valves*" does not return any results. Searching for "valv*" works as expected.

I understand why this is happening, but I don't know how to fix it.

I thought about writing an analyzer that stores the stemmed and non-stemmed tokens. Basically applying two analyzers and combining the two token streams. But I'm not sure if this is a practical solution.

I also thought about using the AnalyzingQueryParser, but I don't know how to apply this to a multifield query. Also, the using AnalyzingQueryParser would return results for "valve" when searching for "valves*" and that's not the expected behavior.

Is there a "preferred" way of utilizing both wildcards and stemming algorithms?

Mediocre answered 1/2, 2012 at 21:17 Comment(0)

I used 2 different approach to solve this before

Use two fields, one that contain stemmed terms, the other one containing terms generated by say, the StandardAnalyzer. When you parse the search query if its a wildcard search in the "standard" field, if not use the field with stemmed terms. This may be harder to use if you have the user input their queries directly in the Lucene's QueryParser.
Write a custom analyzer and index overlapping tokens. It basically consist of indexing the original term and the stem at the same position in the index using the PositionIncrementAttribute. You can look into SynonymFilter to get some example of how to use the PositionIncrementAttribute correctly.

I Prefer solution #2.

March answered 2/2, 2012 at 19:8 Comment(2)

+1 for the second solution, it's the most natural way of doing this. – Christlike 9/2, 2012 at 19:10

The KeywordRepeatFilter of Lucene 4.7.2+ (released in 2014) does exactly what solution 2 describes and is compatible with official stemming filters. – Sherborn 17/6, 2020 at 15:7

This is the simplest solution and it would work -

Add solr.KeywordRepeatFilterFactory in your 'index' analyser.

http://lucene.apache.org/core/4_8_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/KeywordRepeatFilterFactory.html

Also add RemoveDuplicatesTokenFilterFactory at the end of the 'index' analyzer

Now in your index you will always have the stemmed and the non stemmed form for each token on the same position and you are good to go.

Dianndianna answered 27/5, 2014 at 19:44 Comment(0)

I don't think that there is an easy(and correct) way to do this.

My solution would be writing a custom query parser that finds the longest string common to the terms in the index and to your search criteria.

class MyQueryParser : Lucene.Net.QueryParsers.QueryParser
{
    IndexReader _reader;
    Analyzer _analyzer;

    public MyQueryParser(string field, Analyzer analyzer,IndexReader indexReader) : base(field, analyzer)
    {
        _analyzer = analyzer;
        _reader = indexReader;
    }

    public override Query GetPrefixQuery(string field, string termStr)
    {
        for(string longestStr = termStr; longestStr.Length>2; longestStr = longestStr.Substring(0,longestStr.Length-1))
        {
            TermEnum te = _reader.Terms(new Term(field, longestStr));
            Term term = te.Term();
            te.Close();
            if (term != null && term.Field() == field && term.Text().StartsWith(longestStr))
            {
                return base.GetPrefixQuery(field, longestStr);
            }
        }

        return base.GetPrefixQuery(field, termStr);
    }
}

you can also try to call your analyzer in GetPrefixQuery which is not called for PrefixQuerys

TokenStream ts = _analyzer.TokenStream(field, new StringReader(termStr));
Lucene.Net.Analysis.Token token = ts.Next();
var termstring = token.TermText();
ts.Close();
return base.GetPrefixQuery(field, termstring);

But, be aware that you can always find a case where the returned results are not correct. This is why Lucene doesn't take analyzers into account when using wildcards.

Hightension answered 1/2, 2012 at 22:32 Comment(1)

I'd really like to find a way to merge two tokenstreams so I could have a stemmed and non-stemmed set of tokens... I'm going to look into this for a bit. I'll update if I find a way. – Mediocre 2/2, 2012 at 2:6

The only potential idea I have beyond the other answers is to use the dismax against the two fields, so you can just set the relative weights of the two fields. The only caveat is that some versions of dismax didn't handle wildcards, and some parsers are Solr specific.

Panther answered 11/2, 2012 at 3:53 Comment(0)

Another runtime solution would be to process the search term through the query parser before search. Something like this:

detect wildcard terms and remove wildcards

searchTerm = "valves*".Replace("*", ""); -> "valves"
pass the search term through your query parser

searchTerm = queryParser.Parse(searchTerm).ToString().Split(':')[1]; -> "valv"
add wildcard to the search term:

searchTerm = searchTerm + "*";

Castled answered 16/5 at 13:30 Comment(0)

Recommended topics

Hot tags