Sitecore Lucene index search term with space match same word without space
Asked Answered
O

2

6

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:

The problem

  • I'm searching for place names.
  • I have a field called Name
  • It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
  • It is TOKENIZED
  • The value of Name contains 1 space in the value: halong bay.
  • The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
  • If I use the term halong bay I get a hit.
  • If I use the term ha long bay I do not get a hit.

The attempted solution

Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:

var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);

I have also tried a fuzzy match using the .Like() extension:

var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));

This also yields no results for ha long bay.

How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?

Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.

Outworn answered 17/8, 2016 at 16:30 Comment(2)
For misspellings of phrases it's common to use synonyms instead of getting your search logic to cover all bases. Take a look at this post on setting it up with Sitecore. Might be worth considering if you have more of these types of scenarios - firebreaksice.com/sitecore-synonym-search-with-luceneDashed
Thanks for the heads up about synonyms. I might actually implement that for other types of searches. However, to my mind this isn't a synonym. It's the same word but with whitespace added. Perhaps I'm being pedantic, but the reason to have synonyms is to specify totally different words which have nothing mathematically in common even though they have the same meaning e.g "fast" and "quick" have zero common letters.Outworn
C
4

A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.

It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.

A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.

An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces

I hope this helps

Jeff

Cloistered answered 19/8, 2016 at 17:20 Comment(1)
Thanks for the answer. I'm wondering if changing it to untokenized would allow a match to be made without manipulating the input term? I'm trying to not have to write a separate list of alternative names just to take account of whitespace.Outworn
A
0

Something like this might do the trick:

var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
    var tempTerm = t;
    predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);

It does split your input string, but I guess that is not 'fancy' ;)

Alcantara answered 18/8, 2016 at 10:6 Comment(1)
I'm concerned that this will match anything with the word "bay" or "ha" or "long" which isn't what I'm after.Outworn

© 2022 - 2024 — McMap. All rights reserved.