Howto perform a 'contains' search rather than 'starts with' using Lucene.Net
Asked Answered
I

2

19

We use Lucene.NET to implement a full text search on a clients website. The search itself works already but we now want to implement a modification.

Currently all terms get appended a * which leads Lucene to perform what I would classify as a StartsWith search.

In the future we would like to have a search that performs something like a Contains rather than a StartsWith.

We use

  • Lucene.Net 2.9.2.2
  • StandardAnalyzer
  • default QueryParser

Samples:

(Title:Orch*) matches: Orchestra

but:

(Title:rch*) does not match: Orchestra

We want the first and the second one to both match Orchestra.

Basically I want the exact opposite of what was asked in this question, I'm not sure why for this person Lucene performed a Contains and rather than a StartsWith by default:
Why is this Lucene query a "contains" instead of a "startsWith"?

How can we make this happen?
I have the feeling it has something to do with the Analyzer but I'm not sure.

Iapetus answered 30/3, 2011 at 10:15 Comment(0)
W
25

First off, I assume you're using StandardAnalyzer, or something similar. Your linked question fail to understand that you search for terms, and his case a* will match "Fleet Africa" because it's tokenized into "fleet" and "africa".

You need to call QueryParser.SetAllowLeadingWildcard(true) to be able to write queries like field:*value*. Are you actually changing the string that's passed to QueryParser?

You could parse the query as usual, and then implement a QueryVisitor that rewrites all TermQuery into WildcardQuery. That way you still support phrase searches.

I see no good things in rewriting queries into prefix- or wildcard-queries. There is very little shared between an orc, or a chest, and an Orchestra, but both words will match. Instead, hook up your customer with an analyzer that supports stemming, synonyms, and provide a spell correction feature to fix simple searching mistakes.

Weslee answered 30/3, 2011 at 11:17 Comment(4)
As search engine specs are often "do it like google", you can say that google don't seem to allow this. Try searching for "chestra" ;)Ekaterinodar
Thx, that was exactly what I was looking for, regarding the task itself: Well as so often, the client wants exactly this and is resistant against arguments ;) Also it's actually a non natural search for band names and event descriptions so things like stemming/synonyms etc. will most likely not be ideal in this case. Anyway works great now thx!Iapetus
Watchout of the serious performance penalty of using SetAllowLeadingWildcard(true) - there's only a clue about that in the above answer. Also, depending on your intended usage, you might want to look into n-grams - ShingleFilter like Xodarap suggested in another answer.Somnus
Yes, allowing leading wildcards is a possible huge performance penalty. Queries like f:cat* are rewritten into something like f:(cat cats category ...) by using a TermsEnum. Leading wildcards means that it will need to iterate all terms, instead of a small range. Similar issue exists with SQL Server indexes, they cant be used with LIKE '%value%'.Weslee
L
2

@Simon Svensson probably gave the better answer (i.e. you don't need this), but if you do, you should use a Shingle Filter.

Note that this will make your index massively larger, since instead of just storing "orchestra", you will store "orc", "rch", "che", "hes"... But just having a plain term query with leading wildcards will be massively slow. It will essentially have to look through every single term in your corpus.

Luttrell answered 30/3, 2011 at 15:11 Comment(1)
Can you explain how to configure/use shingle filter for this case?Deoxidize

© 2022 - 2024 — McMap. All rights reserved.