Document search on partial words
Asked Answered
S

1

17

I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.

For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*

Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?

Syncrisis answered 26/4, 2011 at 5:32 Comment(3)
I reccomend that you add a search engine tag to your question, lucene, Xapian, or at least search-engine. Search is a general tag, people that are into search-engines may get tired reading all sorts of weird requests for non search-engine related questions. Good Luck!Dietsche
Thanks for the suggestion shelter. Added more tags.Syncrisis
Any reason you have not read the documentation of the various engines. Lucene (and therefore Solr) support wildcard searches: wiki.apache.org/lucene-java/…Mincey
L
23

With lucene you would be able to implement this in several ways:

1.) You can use wildcard queries *brit* (You would have to set your query parser to allow leading wild cards)

2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).

3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei but wanted to find britney.

For wildcard queries and fuzzy search have a look at the query syntax docs.

Loar answered 27/4, 2011 at 22:9 Comment(2)
How can you use "*" at the beginning of the query?Ozonolysis
You have to tell the query parser to allow these kinds of queries. Use the function setAllowLeadingWildcard to do that. lucene.apache.org/core/4_0_0/queryparser/org/apache/lucene/…Loar

© 2022 - 2024 — McMap. All rights reserved.